Running Llama 3.2 100% locally in the browser on WebGPU w/ Transformers.js
Posted by xenovatech@reddit | LocalLLaMA | View on Reddit | 37 comments
Posted by xenovatech@reddit | LocalLLaMA | View on Reddit | 37 comments
neo_fpv@reddit
Can you run this on cpu with wasm?
xenovatech@reddit (OP)
The [model](https://huggingface.co/onnx-community/Llama-3.2-1B-Instruct-q4f16) runs 100% locally in the browser w/ Transformers.js and ONNX Runtime Web, meaning no data leaves your device! Important links, for those interested:
Demo: https://huggingface.co/spaces/webml-community/llama-3.2-webgpu
Source code: https://github.com/huggingface/transformers.js-examples/tree/main/llama-3.2-webgpu
waiting_for_zban@reddit
I am curious how would that work if you want to implement and serve some app on top of it. How much resources would be needed from the client.
sourceholder@reddit
What is the overhead relative to running on platform natively (i.e. llama.cpp)?
privacyparachute@reddit
someone tested this a while back. It's surprisingly small.
irvollo@reddit
i think the main advantage of this would be to serve llm applications via client side.
not a lot of people want/knows how to setup their llama server.
pkmxtw@reddit
Would be cool as a truly zero-setup LLM extension for summarizing, grammar check, etc where those 1-3B models are more than sufficient.
estebansaa@reddit
that is a great question. I can imagine llama.cpp is much faster? Also how big is the weight file?
bwjxjelsbd@reddit
Can I change it to 3B model? For me 1B model is not that great kek
Rangizingo@reddit
It's pretty fucking cool that we have good small models now that can be run locally, much less in browser. This is sweet.
Mkengine@reddit
Unfortunately the answers are much better in English than in German. Would it be possible to finetune a language?
silenceimpaired@reddit
I’m curious how this would work… ask it to translate a question in German to English. Ask the English question… then ask it to translate the English answer to German.
privacyparachute@reddit
Since there is a base model available this should be possible. Transformers.js should be able to run the finetune once you export an ONNX version of the model.
Time-Plum-7893@reddit
What does it mean to it locally? Can it run offline? So it's ready for local production deployment?
awomanaftermidnight@reddit
Mikitz@reddit
That's literally the same output I get for every single message I send to it 😂
omercelebi00@reddit
Can't wait to run it on my smart watch..
Lechowski@reddit
Maybe I was too optimistic trying to run this on my Android phone....
It loaded at least
_meaty_ochre_@reddit
Yeah, WebGPU is a lot closer to full support than it used to be, but it’s nowhere near universal yet. https://caniuse.com/webgpu
Captain_Pumpkinhead@reddit
I've never heard of WebGPU before today. I might have to try it out!
privacyparachute@reddit
You could try the Wllama or WebLLM version.
Wllama demo:
https://huggingface.co/spaces/ngxson/wllama
WebLLM demo:
https://chat.webllm.ai/
hummingbird1346@reddit
Wait is the web version the 1B one or 3B? I was able to run 1B smoothly on android but it wasn’t coherent at all.
Any attempt to even load the 3B crashed the app though. The ram was just not enough. (Samsung A52 5G)
khromov@reddit
Crashes for me on compiling shaders, even though the phone should have enough RAM to handle it. 😿 (Chrome/Android 14)
Lechowski@reddit
Same, is crashing on s24 ultra so it seems that shader compilation is not supported in Android
Hubrex@reddit
Soon.
ScoreUnique@reddit
Feel you bruh
Worldly_Dish_48@reddit
Really cool! What are your specs
Shot_Platypus4420@reddit
Cool. I’m not an expert on llm, but it seemed to me that the models from meta are the most censored and inclined to give general answers.
Original_Finding2212@reddit
Which number of parameters?
CommunismDoesntWork@reddit
How does it handle OOM issues?
privacyparachute@reddit
This is a bit of a sore point with WASM (Web Assembly). I couldn't find te article I wanted to link to here, but the gist is that it's hard to predict how much memory you want to reserve, or to even know how much is really available.
You can of course catch OOM events, and inform the user that the WASM instance has crashed. RangeErrors galore.
Due_Effect_5414@reddit
Since WebGPU runs on vulkan, direct3d and metal, does that mean it's basically agnostic for inference on mac/nvidia/amd?
privacyparachute@reddit
Yes. There are however other limitations: Safari and Firefox still don't have WebGPU support enabled by default in the stable releases. Shouldn't be too far off though.
No_Afternoon_4260@reddit
You only have q4 and q8 with transformers right?
CheatCodesOfLife@reddit
FP16 as well if you want
After-Main567@reddit
Starting out i got 10 tokens/s on my google pixel 9 pro. It went slower and slower as the context got longer.
habiba2000@reddit
This is honestly quite cool. I tried it with coding, and unfortunately it would get stuck repeating particular lines (is there a technical term for this)?
May truly be a parameter issue and not the model itself.