Running LLMs in-browser via WebGPU, Transformers.js, and Chrome's Prompt API—no Ollama, no server
Posted by psgganesh@reddit | LocalLLaMA | View on Reddit | 4 comments
Been experimenting with browser-based inference and wanted to share what I've learned packaging it into a usable Chrome extension.
Three backends working together:
- WebLLM (MLC): Llama 3.2, DeepSeek-R1, Qwen3, Mistral, Gemma, Phi, SmolLM2, Hermes 3
- Transformers.js: HuggingFace models via ONNX Runtime
- Browser AI / Prompt API: Chrome's built-in Gemini Nano and Phi (no download required)
Models cache in browser and chat messages stored in IndexedDB, works offline after first download. Added a memory monitor that warns at 80% usage and helps clear unused weights—browser-based inference eats RAM fast.
Curious what this community thinks about WebGPU as a viable inference path for everyday use. Hence I built this project, anyone else building in this space?
Project: https://noaibills.app/?utm_source=reddit&utm_medium=social&utm_campaign=launch_localllama
stefferri@reddit
Similar stack, different use case. We use Transformers.js v4 + WebGPU EP for embedding models in an Obsidian plugin semantic search rather than LLM inference. Same WebGPU probe at startup (navigator.gpu.requestAdapter()), same fallback to CPU when the adapter is null.
One thing we hit with 768-dim models: fp32 on WASM triggers a SafeInt overflow in onnxruntime-web 1.26.0. WebGPU EP sidesteps it entirely since it bypasses the WASM runtime. Worth knowing before you add larger models.
InvertedVantage@reddit
Cool, I've been wondering about how webllm performs, will check this out when I can!
psgganesh@reddit (OP)
Appreciate it! 🤗 WebLLM performance is surprisingly good with WebGPU acceleration—obviously not matching native speeds, but very usable for everyday tasks.
A few things you'll notice:
- First model download takes time (models are 1-4GB), but they cache in IndexedDB for instant reuse - Smaller models (Llama 3.2 3B, Qwen 2.5 3B) are snappy on decent hardware
- There's a memory monitor that alerts at 80% usage so you can clear unused models
Would love to hear your experience once you try it.
Sea_Bed_9754@reddit
I using it heavily: i running with Mac m2 64 Gb. Hermes 3B runs quite well. Could you please advise in memory cleaning - how actually model memory gets overloaded?