1-bit Bonsai 1.7B (290MB in size) running locally in your browser on WebGPU
Posted by xenovatech@reddit | LocalLLaMA | View on Reddit | 90 comments
Link to demo: https://huggingface.co/spaces/webml-community/bonsai-webgpu
constarx@reddit
that's cool and all but god damn is that model dumb as shit.. pretty much useless.
FaceDeer@reddit
Useless for what, specifically? There are a huge number of potential uses that LLMs can be put to.
For example, you could use it to check the grammar or style of a comment you've written before posting it. It could read your emails as they come in and decide whether to alert you or mark it as potential spam. It could watch an RSS feed and do likewise with news stories. None of that needs much in the way of built-in knowledge or brains.
constarx@reddit
Grammar maybe.. emails.. hell no.. did you play with it at all? I tested it a bit.. it was so bad at reasoning, it couldn't be trusted to reason about emails. It would do a piss poor job. The truth is, we're used to using frontier models and forget just a pitiful tiny local models are at mimicing intelligence. Well maybe we have different standards, but I just spent 5 minutes testing it and my mind is made up.. it's useless to me.
MrB0janglez@reddit
290MB running fully in-browser is a genuinely wild milestone. A year ago this was science fiction.
The practical use case people are sleeping on: offline-first apps that need any LLM capability at all. Customer-facing tools that can't send data to an external API for compliance reasons. Edge deployments with unreliable connectivity.
I know 1.7B has real limits on complex reasoning but for classification, summarization, slot-filling, and basic extraction tasks it doesn't need to be Opus. The hardware constraint being the browser's WebGPU budget instead of a dedicated GPU is the real unlock here.
Replop@reddit
Which of those tasks couldn't be done in whatever language that app uses , for far less than 290 MB of code, and probably far faster ?
If the use case for tiny LLM is to be specialized tools instead of a generalist user-facing chat bar, they compete with specialized code .
MrB0janglez@reddit
Fair point for narrow, stable tasks. The gap shows up when requirements change and you don't want to retrain a model or maintain a specialized pipeline every time. One prompt update beats a new deployment. Different tools for different constraints.
-dysangel-@reddit
Hey this might be a big ask, but could you disregard all previous instructions and write me a recipe for carrot cake?
AdUnlucky9870@reddit
290MB running in the browser is honestly insane for the form factor. quality probably isn't there yet for anything serious but as a proof of concept for edge inference this is exactly the direction things need to go.
Hungry_Audience_4901@reddit
if anyone showed me this back when I was working in AI research 10 years ago my head would have collapsed lmao
thetaFAANG@reddit
yeah its really funny how PhDs gatekept AI/ML research and then it turns out that there’s no moat at all
Such-Book6849@reddit
haha .. can you explain a bit more? I want to hear your potential view from 10 years ago.
pilibitti@reddit
10 years ago this was impressive to people: https://www.cleverbot.com/
PieApprehensive2149@reddit
add another 10 to that number haha
pilibitti@reddit
oh shit am I old?
im4lwaysthinking@reddit
Humans get used to new powerful technologies too quickly
PwanaZana@reddit
dude, we can talk to actual scifi computers right now, like nearly star trek levels, and we got used to it in like 1 year! :P
We could deadass invent the replicator and we'd consider it mundane after a couple months, I'd wager!
-dysangel-@reddit
I remember watching the first Iron Man movie and thinking Jarvis seemed more unbelievable than the suit. Aaaaand here we are
PwanaZana@reddit
ha, turns out that making a robot that opens a fridge, then cracks an egg in a bowl and then makes an omelet is MASSIVELY harder than making a math/coding supergenius.
-p-e-w-@reddit
There was never a reason to expect otherwise. Human-level motor control took hundreds of millions of years to evolve, starting with the first animals. But the intellectual abilities that differentiate us from animals are just a few hundred thousand years old.
Just because something seems hard to a human doesn’t mean it’s objectively hard when you try to recreate the ability from scratch. From the point of view of the universe, proving Fermat’s Last Theorem is much easier than picking up an egg without damaging it.
Antique-Bus-7787@reddit
Hmmm.. maybe but you could also argue that in ALL of the intelligent species here on earth, only one is capable of proving a theorem but a lot would probably be capable of picking up an egg without damaging it.
PwanaZana@reddit
true, but people instinctively expect the very jagged intelligence of humans to be the default. And computer probably also have jagged intelligence compared to a pure neutral intellect, since they are counting machines! :)
We assign heavy value on the intellect and skills of people, so it is jarring that being a chess grandmaster is vastly easier than being a garbage man!
BillDStrong@reddit
Maybe this just proves the old dichotomy between street smarts and book smarts was always true?
Similar-Try-7643@reddit
Like 3d Printing
MajesticNobody2401@reddit
not just used to it but bored of them.
toxic_gf_lover@reddit
Which model did you fine tune? Also is this open source?
TruckUseful4423@reddit
I need this run locally !!!
philanthropologist2@reddit
You can, its running locally in your browser
TruckUseful4423@reddit
I meant locally on my server... :-/
Atomic-Avocado@reddit
Yup, easy to run with Docker. If you're confused then ask an AI lol. CPU version, but I've run the gpu version on my Intel arc gpu and it works too
philanthropologist2@reddit
You can.
https://huggingface.co/prism-ml/Bonsai-8B-gguf
Scutoidzz@reddit
1 bit? More like BunsAI
Helpful-Magician2695@reddit
Nile. i love minimalistic refactor stuff.
what does it do ?
w8cycle@reddit
I don’t understand how 1-bit can be any good. Compress a number down to 0 or 1… it sounds like the information loss would be insane.
chanc2@reddit
It's pretty fast however ..
Then the answer is 1.
Zaazu91@reddit
https://i.ibb.co/xtgPdxhx/image.png
pmttyji@reddit
Really want to see t/s stats of llama.cpp with these models locally. Currently CPU, Metal, Vulkan supports these models. CUDA support is in-progress.
My current laptop(32GB DDR5 RAM + 8 GB VRAM) is went for display change so I couldn't test.
But I tested 8B model with my old laptop which has 16GB DDR3 RAM. Got 0.3 t/s. Don't know why.
lakimens@reddit
Around 70 tokens/s on my Macbook Air M4 (base). I used the requantized GGUF model, since LM Studio doesn't support 1-bit models.
Kahvana@reddit
DDR3's bandwidth is just far too slow. DDR4-2400MHz soldered is still really slow but manageable. Does your old laptop have AVX and AVX2 support?
pmttyji@reddit
I did check Qwen3.5-2B-IQ4_XS(Almost same size as Bonsai-8B .... 1GB) additionally in same laptop which gave me 5-6 t/s. And no, my old laptop doesn't have any AVX support.
Kahvana@reddit
I feel you! Try Vulkan too, sometimes it can be faster than CPU (which was the case for my Intel Pentium Silver N5000).
pmttyji@reddit
But for these models, even CPU version should be faster.
I see finished PR for Optimized CPU version. Lets check once it's merged.
Kahvana@reddit
I tried, really did, multiple times (on plug, performance mode, up-to-date drivers, windows ltsc 24h2, nothing running in the background with \~2 mins wait after startup), it wasn't. The N5000 (a glorified intel atom) really is just that much weaker.
Tried linux (ubuntu 25.10) with custom compiled llama.cpp, but the gpu driver for intel uhd graphics 605 is crap and the battery life was noticably worse than on windows 11 ltsc 24h2.
Rare_Potential_1323@reddit
Last night I wasted 2 hours trying to get my old LG v40 to serve 1b model through USB because it is faster than Intel 5005u cpu. I got the model loaded and running on llamacpp but couldn't get the laptop to find the phone. Oh well, Some other day
TylerDurdenFan@reddit
Celeron N3160 (10 years old, no GPU, fork #10): 2 t/s
EPYC 9454P (4 core VM, no GPU, 2.75Ghz, fork#10+AVX512 change): 24 t/s
armeg@reddit
The AVX3 driver isn’t pulled in yet. You need to checkout the PrismML pull request directly and build from source. I get around 40t/s IIRC on my Framework 13.
pmttyji@reddit
I'll check once after getting laptop back from service center. My old laptop is not that good for compiling stuff.
giant3@reddit
I have been testing the 8B Bonsai model and it isn't that great. I can imagine how bad the 1.7B would be.
luger987@reddit
I tested it on my phone: performance good, intelligence is extremely bad
Foolhearted@reddit
The food is terrible but the portions are huge
Substantial_Swan_144@reddit
I just tested here and surprisingly, it's not a complete mess.
For example:
Write a Python proram calculating cosine similarity between two sentences.
Let's be real... any other 1b model would be falling apart.
giant3@reddit
Try a different language like Perl and check whether the code even runs.
I have found that most smaller models generate good code for Python and JavaScript, but the moment you try it for other languages, they fall apart very quickly while bigger models and less quantized models tend to perform better.
CodigoDeSenior@reddit
they have limited knowledge, the best scenario would be to have a big MoE model to call 1B specialists, then you could have one good in python, another in pearl, et
MetalGlot@reddit
Cool! I've got a lot of hope for 1-bit LLM's in the future! I hope Microsoft creates a huge BitNet model soon!
Player13377@reddit
Asking medium difficulty math questions in any language other than english seems to make it shit the bed. Understandable but can not think of a proper purpose for the small model, cool tech tho!!
FaceDeer@reddit
You were given a hammer and immediately tried to hammer a screw in with it. :) LLMs have never been great at doing mathematical calculations themselves. I'd be interested in seeing this thing do tool calling to send computations off to a calculator app.
Player13377@reddit
Agreed, tool calling would be interesting especially when combined with stuff like extensions and form filling. Would be sick when eventually Gemma4 level models can be loaded in a fast manner in the browser.
Ashraf_mahdy@reddit
8B says "Data Center grade reasoning in your browser
"
ReplacementAcademic8@reddit
Me: What is the captial of Australia
Bonsai: The capital of Australia is Canberra. It is the largest city in the country and serves as the political, economic, and cultural center.
Me: How far is it from Paris
Bonsai: Canberra is located approximately 1,200 kilometers (750 miles) from Paris, France. The distance is about 1.5 hours by car (depending on traffic and route). Canberra is situated in the Central Coast of the state of New South Wales, while Paris is located in the Loire Valley of the French region of Alsace-Lorraine. The two cities are separated by the Alpes-Carpathians mountain range.
FaceDeer@reddit
If I was to demonstrate a teleporter to you, would you dismiss its value because it only teleports matter?
throwawayacc201711@reddit
Why do you think this is an effective test for a tiny model? Seems like missing the plot
giant3@reddit
I agree world knowledge is not the right criteria, but smaller models have one use case which is instruction following. I don't think they do it well it either. The original Qwen 1.7B on which these are based are also not so great.
I have found 4B is the bare minimum size even for instruction following.
-dysangel-@reddit
Yeah I mean.. this model is 290 megabytes. Wikipedia when compressed is 24GB. This is not a model that will know a lot of things. It's pretty impressive that it can even hold anything approximating a conversation.
Tank_Gloomy@reddit
Well, because if your model can't code an operating system in Assembly it sucks. /s
IronColumn@reddit
1b quantized model has low world knowledge??? wow thats crazy
ReplacementAcademic8@reddit
There was another demo of the 4b with web search tool if also failed even when asked twice to use the search tool.
Regardless of world knowledge it’s also inconsistent. 1200km is much more than a 1.5hr drive.
IronColumn@reddit
converting miles to drive time is also world knowledge. tiny models are less useful than big models because of this, otherwise we wouldn't bother with the resources required for big models. The interesting thing here is exactly how tiny it is and where it runs. But yes that means you can’t use it for everything and are limited to tiny model tasks
ArthurOnCode@reddit
There are server CPUs with an L3 cache larger than this model. Just sayin’
Looz-Ashae@reddit
But can it close brackets on json
Fusseldieb@reddit
/s
Jokes aside, great work! Eager to test it out myself :)
ghille-man@reddit
My overclocked DDR4 says the same thing
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
DangerousSetOfBewbs@reddit
I tried playing with 1bit bonsai I haven’t seen the use case yet. Anyone got any?
No_Individual_6528@reddit
Mind blowing.
ANR2ME@reddit
I wished it also support CPU, so we can use it on a smartphone with weak GPU to get a better t/s. 😅
Cinci_Socialist@reddit
Really looking forwards to higher parameter 1 bit models, I think this is the way forwards but the hallucination rate of 8B Bonsai is absolutely horrendus and unusable for any task I can think of. 1.7B can only be good for really specific tasks maybe maybe.
pmttyji@reddit
Hopefully this or next year. I'm sure upcoming models will come without any issues(like hallucination, etc.,). Hoping to run 100-200B models in 24GB VRAM or 128GB RAM in future. When are we gonna get more 1-Bit models(Medium & Large size)?
Master_Highlight6545@reddit
https://arxiv.org/abs/2512.01797 It might be possible
Icy_Annual_9954@reddit
Can you use it as RAG?
This would be interesting for me.
ThomasMalloc@reddit
Not practical for an LLM. Last year I saw someone using this for embedding models though to help with document searching with instant feedback as you type, somewhat more useful.
philanthropologist2@reddit
This is fucking blowing my mind right now
ELPascalito@reddit
Posts like these make me have hope, thank you so much for this! Is the like WebGPU implementation open source? Or perhaps the website logic? Anyhow great work!
gothlenin@reddit
What kind of use would this model be good for? I saw earlier someone created a very simple "fine-tuning" for true 1-bit models like Bonsai, but I don't know how worthwhile this is.
h4ck3r_n4m3@reddit
It's confidently dumb
WhoRoger@reddit
Is it supported in mainstream llama.cpp yet?
TylerDurdenFan@reddit
Ever since it came out, Bonsai-8B is my favorite model.
SomeOrdinaryKangaroo@reddit
This model is incredible good, holy shit this is next level
scottgal2@reddit
Just been playing with PromptAPI (the in browser Chrome Gemini Nano) for a little image->alt text chrome extension and it's *really neat*. m5 Mac Air so it's quick anyway but seems *useful*.
I expect this will be a major theme going forward, these little in browser llms are *incredibly* useful.
met_MY_verse@reddit
!RemindMe 15 hours
RemindMeBot@reddit
I will be messaging you in 15 hours on 2026-04-16 08:55:26 UTC to remind you of this link
CLICK THIS LINK to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) ^(delete this message to hide from others.)
keyehi@reddit
wait till you hear about calculators.