Running Llama 3.2 100% locally in the browser on WebGPU w/ Transformers.js

[-]

agonny@reddit

Ok so u built a gpt wrapper, but the openai part is mainly used to make function calls, get context, and reason based on the fetched context. So it doesn't need to actually be so "smart" to do a good job.

What are my alternatives that I let this "reasoning" part be done on the user's browser?

[-]

neo_fpv@reddit

Can you run this on cpu with wasm?

[-]

xenovatech@reddit (OP)

The [model](https://huggingface.co/onnx-community/Llama-3.2-1B-Instruct-q4f16) runs 100% locally in the browser w/ Transformers.js and ONNX Runtime Web, meaning no data leaves your device! Important links, for those interested:

[-]

waiting_for_zban@reddit

I am curious how would that work if you want to implement and serve some app on top of it. How much resources would be needed from the client.

[-]

sourceholder@reddit

What is the overhead relative to running on platform natively (i.e. llama.cpp)?

[-]

privacyparachute@reddit

someone tested this a while back. It's surprisingly small.

[-]

irvollo@reddit

i think the main advantage of this would be to serve llm applications via client side.
not a lot of people want/knows how to setup their llama server.

[-]

pkmxtw@reddit

Would be cool as a truly zero-setup LLM extension for summarizing, grammar check, etc where those 1-3B models are more than sufficient.

[-]

estebansaa@reddit

that is a great question. I can imagine llama.cpp is much faster? Also how big is the weight file?

[-]

bwjxjelsbd@reddit

Can I change it to 3B model? For me 1B model is not that great kek

[-]

Rangizingo@reddit

It's pretty fucking cool that we have good small models now that can be run locally, much less in browser. This is sweet.

[-]

Mkengine@reddit

Unfortunately the answers are much better in English than in German. Would it be possible to finetune a language?

[-]

silenceimpaired@reddit

I’m curious how this would work… ask it to translate a question in German to English. Ask the English question… then ask it to translate the English answer to German.

[-]

privacyparachute@reddit

Since there is a base model available this should be possible. Transformers.js should be able to run the finetune once you export an ONNX version of the model.

[-]

Time-Plum-7893@reddit

What does it mean to it locally? Can it run offline? So it's ready for local production deployment?

[-]

awomanaftermidnight@reddit

[-]

Mikitz@reddit

That's literally the same output I get for every single message I send to it 😂

[-]

omercelebi00@reddit

Can't wait to run it on my smart watch..

[-]

Lechowski@reddit

Maybe I was too optimistic trying to run this on my Android phone....

It loaded at least

[-]

_meaty_ochre_@reddit

Yeah, WebGPU is a lot closer to full support than it used to be, but it’s nowhere near universal yet. https://caniuse.com/webgpu

[-]

Captain_Pumpkinhead@reddit

I've never heard of WebGPU before today. I might have to try it out!

[-]

privacyparachute@reddit

You could try the Wllama or WebLLM version.

Wllama demo:
https://huggingface.co/spaces/ngxson/wllama

WebLLM demo:
https://chat.webllm.ai/

[-]

hummingbird1346@reddit

Wait is the web version the 1B one or 3B? I was able to run 1B smoothly on android but it wasn’t coherent at all.

Any attempt to even load the 3B crashed the app though. The ram was just not enough. (Samsung A52 5G)

[-]

khromov@reddit

Crashes for me on compiling shaders, even though the phone should have enough RAM to handle it. 😿 (Chrome/Android 14)

[-]

Lechowski@reddit

Same, is crashing on s24 ultra so it seems that shader compilation is not supported in Android

[-]

Hubrex@reddit

Soon.

[-]

ScoreUnique@reddit

Feel you bruh

[-]

Worldly_Dish_48@reddit

Really cool! What are your specs

[-]

Shot_Platypus4420@reddit

Cool. I’m not an expert on llm, but it seemed to me that the models from meta are the most censored and inclined to give general answers.

[-]

Original_Finding2212@reddit

Which number of parameters?

[-]

CommunismDoesntWork@reddit

How does it handle OOM issues?

[-]

privacyparachute@reddit

This is a bit of a sore point with WASM (Web Assembly). I couldn't find te article I wanted to link to here, but the gist is that it's hard to predict how much memory you want to reserve, or to even know how much is really available.

You can of course catch OOM events, and inform the user that the WASM instance has crashed. RangeErrors galore.

[-]