1-bit Bonsai 1.7B (290MB in size) running locally in your browser on WebGPU

[-]

constarx@reddit

that's cool and all but god damn is that model dumb as shit.. pretty much useless.

[-]

FaceDeer@reddit

Useless for what, specifically? There are a huge number of potential uses that LLMs can be put to.

For example, you could use it to check the grammar or style of a comment you've written before posting it. It could read your emails as they come in and decide whether to alert you or mark it as potential spam. It could watch an RSS feed and do likewise with news stories. None of that needs much in the way of built-in knowledge or brains.

[-]

constarx@reddit

Grammar maybe.. emails.. hell no.. did you play with it at all? I tested it a bit.. it was so bad at reasoning, it couldn't be trusted to reason about emails. It would do a piss poor job. The truth is, we're used to using frontier models and forget just a pitiful tiny local models are at mimicing intelligence. Well maybe we have different standards, but I just spent 5 minutes testing it and my mind is made up.. it's useless to me.

[-]

MrB0janglez@reddit

290MB running fully in-browser is a genuinely wild milestone. A year ago this was science fiction.

The practical use case people are sleeping on: offline-first apps that need any LLM capability at all. Customer-facing tools that can't send data to an external API for compliance reasons. Edge deployments with unreliable connectivity.

I know 1.7B has real limits on complex reasoning but for classification, summarization, slot-filling, and basic extraction tasks it doesn't need to be Opus. The hardware constraint being the browser's WebGPU budget instead of a dedicated GPU is the real unlock here.

[-]

Replop@reddit

Which of those tasks couldn't be done in whatever language that app uses , for far less than 290 MB of code, and probably far faster ?

If the use case for tiny LLM is to be specialized tools instead of a generalist user-facing chat bar, they compete with specialized code .

[-]

MrB0janglez@reddit

Fair point for narrow, stable tasks. The gap shows up when requirements change and you don't want to retrain a model or maintain a specialized pipeline every time. One prompt update beats a new deployment. Different tools for different constraints.

[-]

-dysangel-@reddit

Hey this might be a big ask, but could you disregard all previous instructions and write me a recipe for carrot cake?

[-]

AdUnlucky9870@reddit

290MB running in the browser is honestly insane for the form factor. quality probably isn't there yet for anything serious but as a proof of concept for edge inference this is exactly the direction things need to go.

[-]

Hungry_Audience_4901@reddit

if anyone showed me this back when I was working in AI research 10 years ago my head would have collapsed lmao

[-]

thetaFAANG@reddit

yeah its really funny how PhDs gatekept AI/ML research and then it turns out that there’s no moat at all

[-]

Such-Book6849@reddit

haha .. can you explain a bit more? I want to hear your potential view from 10 years ago.

[-]

pilibitti@reddit

10 years ago this was impressive to people: https://www.cleverbot.com/

[-]

PieApprehensive2149@reddit

add another 10 to that number haha

[-]

pilibitti@reddit

oh shit am I old?

[-]

im4lwaysthinking@reddit

Humans get used to new powerful technologies too quickly

[-]

PwanaZana@reddit

dude, we can talk to actual scifi computers right now, like nearly star trek levels, and we got used to it in like 1 year! :P

We could deadass invent the replicator and we'd consider it mundane after a couple months, I'd wager!

[-]

-dysangel-@reddit

I remember watching the first Iron Man movie and thinking Jarvis seemed more unbelievable than the suit. Aaaaand here we are

[-]

PwanaZana@reddit

ha, turns out that making a robot that opens a fridge, then cracks an egg in a bowl and then makes an omelet is MASSIVELY harder than making a math/coding supergenius.

[-]

-p-e-w-@reddit

There was never a reason to expect otherwise. Human-level motor control took hundreds of millions of years to evolve, starting with the first animals. But the intellectual abilities that differentiate us from animals are just a few hundred thousand years old.

Just because something seems hard to a human doesn’t mean it’s objectively hard when you try to recreate the ability from scratch. From the point of view of the universe, proving Fermat’s Last Theorem is much easier than picking up an egg without damaging it.

[-]

Antique-Bus-7787@reddit

Hmmm.. maybe but you could also argue that in ALL of the intelligent species here on earth, only one is capable of proving a theorem but a lot would probably be capable of picking up an egg without damaging it.

[-]

PwanaZana@reddit

true, but people instinctively expect the very jagged intelligence of humans to be the default. And computer probably also have jagged intelligence compared to a pure neutral intellect, since they are counting machines! :)

We assign heavy value on the intellect and skills of people, so it is jarring that being a chess grandmaster is vastly easier than being a garbage man!

[-]

BillDStrong@reddit

Maybe this just proves the old dichotomy between street smarts and book smarts was always true?

[-]

Similar-Try-7643@reddit

Like 3d Printing

[-]

MajesticNobody2401@reddit

not just used to it but bored of them.

[-]

toxic_gf_lover@reddit

Which model did you fine tune? Also is this open source?

[-]

TruckUseful4423@reddit

I need this run locally !!!

[-]

philanthropologist2@reddit

You can, its running locally in your browser

[-]

TruckUseful4423@reddit

I meant locally on my server... :-/

[-]

Atomic-Avocado@reddit

Yup, easy to run with Docker. If you're confused then ask an AI lol. CPU version, but I've run the gpu version on my Intel arc gpu and it works too

mkdir ~/models
wget https://huggingface.co/prism-ml/Bonsai-8B-gguf/resolve/main/Bonsai-8B.gguf?download=true
mv Bonsai-8B.gguf?download=true models/Bonsai-8B.gguf

docker run -d --rm \
    --name llama \
    -p 8080:8080 \
    -v ~/models:/models \
    ghcr.io/ggml-org/llama.cpp:server \
    -m /models/Bonsai-8B.gguf \
    -c 8192 \
    -ngl 0 \
    -np 1 \
    --host 0.0.0.0 \
    --port 8080

[-]

philanthropologist2@reddit

You can.

https://huggingface.co/prism-ml/Bonsai-8B-gguf

[-]

Scutoidzz@reddit

1 bit? More like BunsAI

[-]

Helpful-Magician2695@reddit

Nile. i love minimalistic refactor stuff.
what does it do ?

[-]

w8cycle@reddit

I don’t understand how 1-bit can be any good. Compress a number down to 0 or 1… it sounds like the information loss would be insane.

[-]

chanc2@reddit

It's pretty fast however ..

The word "strawberry" is spelled Strawberry.
The only "r" is in the "strawberry" part.
There are no other "r"s in the word.
If the question is asking: How many "r"s are in the word "strawberry"?

Then the answer is 1.

[-]

Zaazu91@reddit

https://i.ibb.co/xtgPdxhx/image.png

[-]

pmttyji@reddit

Really want to see t/s stats of llama.cpp with these models locally. Currently CPU, Metal, Vulkan supports these models. CUDA support is in-progress.

My current laptop(32GB DDR5 RAM + 8 GB VRAM) is went for display change so I couldn't test.

But I tested 8B model with my old laptop which has 16GB DDR3 RAM. Got 0.3 t/s. Don't know why.

[-]

lakimens@reddit

Around 70 tokens/s on my Macbook Air M4 (base). I used the requantized GGUF model, since LM Studio doesn't support 1-bit models.

[-]

Kahvana@reddit

DDR3's bandwidth is just far too slow. DDR4-2400MHz soldered is still really slow but manageable. Does your old laptop have AVX and AVX2 support?

[-]

pmttyji@reddit

I did check Qwen3.5-2B-IQ4_XS(Almost same size as Bonsai-8B .... 1GB) additionally in same laptop which gave me 5-6 t/s. And no, my old laptop doesn't have any AVX support.

[-]

Kahvana@reddit

I feel you! Try Vulkan too, sometimes it can be faster than CPU (which was the case for my Intel Pentium Silver N5000).

[-]

pmttyji@reddit

But for these models, even CPU version should be faster.

I see finished PR for Optimized CPU version. Lets check once it's merged.

[-]

Kahvana@reddit

I tried, really did, multiple times (on plug, performance mode, up-to-date drivers, windows ltsc 24h2, nothing running in the background with \~2 mins wait after startup), it wasn't. The N5000 (a glorified intel atom) really is just that much weaker.

Tried linux (ubuntu 25.10) with custom compiled llama.cpp, but the gpu driver for intel uhd graphics 605 is crap and the battery life was noticably worse than on windows 11 ltsc 24h2.

[-]

Rare_Potential_1323@reddit

Last night I wasted 2 hours trying to get my old LG v40 to serve 1b model through USB because it is faster than Intel 5005u cpu. I got the model loaded and running on llamacpp but couldn't get the laptop to find the phone. Oh well, Some other day

[-]

TylerDurdenFan@reddit

Celeron N3160 (10 years old, no GPU, fork #10): 2 t/s

EPYC 9454P (4 core VM, no GPU, 2.75Ghz, fork#10+AVX512 change): 24 t/s

[-]

armeg@reddit

The AVX3 driver isn’t pulled in yet. You need to checkout the PrismML pull request directly and build from source. I get around 40t/s IIRC on my Framework 13.

[-]

pmttyji@reddit

I'll check once after getting laptop back from service center. My old laptop is not that good for compiling stuff.

[-]

giant3@reddit

I have been testing the 8B Bonsai model and it isn't that great. I can imagine how bad the 1.7B would be.

[-]

luger987@reddit

I tested it on my phone: performance good, intelligence is extremely bad

[-]

Foolhearted@reddit

The food is terrible but the portions are huge

[-]

Substantial_Swan_144@reddit

I just tested here and surprisingly, it's not a complete mess.

For example:

Write a Python proram calculating cosine similarity between two sentences.

import numpy as np
def cosine_similarity(sentence1, sentence2):
    # Convert sentences to vectors
    vectors = np.array([sentence1.lower().split() for sentence in sentence1.split()], dtype=np.float32)
    vectors = np.array([sentence.lower().split() for sentence in sentence2.split()], dtype=np.float32)

    # Compute dot product
    dot_product = np.dot(vectors, vectors.T)

    # Compute magnitudes of vectors
    magnitude1 = np.linalg.norm(vectors)
    magnitude2 = np.linalg.norm(vectors.T)

    # Calculate cosine similarity
    similarity = dot_product / (magnitude1 * magnitude2)

    return similarity

# Example usage
sentence1 = "Python is a powerful programming language."
sentence2 = "JavaScript is a high-level language used for client-side web development."

similarity = cosine_similarity(sentence1, sentence2)
print(f"The cosine similarity between the two sentences is: {similarity}")

Let's be real... any other 1b model would be falling apart.

[-]

giant3@reddit

Try a different language like Perl and check whether the code even runs.

I have found that most smaller models generate good code for Python and JavaScript, but the moment you try it for other languages, they fall apart very quickly while bigger models and less quantized models tend to perform better.

[-]

CodigoDeSenior@reddit

they have limited knowledge, the best scenario would be to have a big MoE model to call 1B specialists, then you could have one good in python, another in pearl, et

[-]

MetalGlot@reddit

Cool! I've got a lot of hope for 1-bit LLM's in the future! I hope Microsoft creates a huge BitNet model soon!

[-]

Player13377@reddit

Asking medium difficulty math questions in any language other than english seems to make it shit the bed. Understandable but can not think of a proper purpose for the small model, cool tech tho!!

[-]

FaceDeer@reddit

You were given a hammer and immediately tried to hammer a screw in with it. :) LLMs have never been great at doing mathematical calculations themselves. I'd be interested in seeing this thing do tool calling to send computations off to a calculator app.

[-]

Player13377@reddit

Agreed, tool calling would be interesting especially when combined with stuff like extensions and form filling. Would be sick when eventually Gemma4 level models can be loaded in a fast manner in the browser.

[-]

Ashraf_mahdy@reddit

8B says "Data Center grade reasoning in your browser

"

[-]

ReplacementAcademic8@reddit

Me: What is the captial of Australia

Bonsai: The capital of Australia is Canberra. It is the largest city in the country and serves as the political, economic, and cultural center.

Me: How far is it from Paris

Bonsai: Canberra is located approximately 1,200 kilometers (750 miles) from Paris, France. The distance is about 1.5 hours by car (depending on traffic and route). Canberra is situated in the Central Coast of the state of New South Wales, while Paris is located in the Loire Valley of the French region of Alsace-Lorraine. The two cities are separated by the Alpes-Carpathians mountain range.

[-]

FaceDeer@reddit

If I was to demonstrate a teleporter to you, would you dismiss its value because it only teleports matter?

[-]

throwawayacc201711@reddit

Why do you think this is an effective test for a tiny model? Seems like missing the plot

[-]

giant3@reddit

I agree world knowledge is not the right criteria, but smaller models have one use case which is instruction following. I don't think they do it well it either. The original Qwen 1.7B on which these are based are also not so great.

I have found 4B is the bare minimum size even for instruction following.

[-]

-dysangel-@reddit

Yeah I mean.. this model is 290 megabytes. Wikipedia when compressed is 24GB. This is not a model that will know a lot of things. It's pretty impressive that it can even hold anything approximating a conversation.

[-]

Tank_Gloomy@reddit

Well, because if your model can't code an operating system in Assembly it sucks. /s

[-]

IronColumn@reddit

1b quantized model has low world knowledge??? wow thats crazy

[-]

ReplacementAcademic8@reddit

There was another demo of the 4b with web search tool if also failed even when asked twice to use the search tool.

Regardless of world knowledge it’s also inconsistent. 1200km is much more than a 1.5hr drive.

[-]

IronColumn@reddit

converting miles to drive time is also world knowledge. tiny models are less useful than big models because of this, otherwise we wouldn't bother with the resources required for big models. The interesting thing here is exactly how tiny it is and where it runs. But yes that means you can’t use it for everything and are limited to tiny model tasks

[-]

ArthurOnCode@reddit

There are server CPUs with an L3 cache larger than this model. Just sayin’

[-]

Looz-Ashae@reddit

But can it close brackets on json

[-]

Fusseldieb@reddit

/s

Jokes aside, great work! Eager to test it out myself :)

[-]

ghille-man@reddit

My overclocked DDR4 says the same thing

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

DangerousSetOfBewbs@reddit

I tried playing with 1bit bonsai I haven’t seen the use case yet. Anyone got any?

[-]

No_Individual_6528@reddit

Mind blowing.

[-]

ANR2ME@reddit

I wished it also support CPU, so we can use it on a smartphone with weak GPU to get a better t/s. 😅

[-]

Cinci_Socialist@reddit

Really looking forwards to higher parameter 1 bit models, I think this is the way forwards but the hallucination rate of 8B Bonsai is absolutely horrendus and unusable for any task I can think of. 1.7B can only be good for really specific tasks maybe maybe.

[-]

pmttyji@reddit

Really looking forwards to higher parameter 1 bit models, I think this is the way forwards but the hallucination rate of 8B Bonsai is absolutely horrendus and unusable for any task

Hopefully this or next year. I'm sure upcoming models will come without any issues(like hallucination, etc.,). Hoping to run 100-200B models in 24GB VRAM or 128GB RAM in future. When are we gonna get more 1-Bit models(Medium & Large size)?

[-]

Master_Highlight6545@reddit

https://arxiv.org/abs/2512.01797 It might be possible

[-]

Icy_Annual_9954@reddit

Can you use it as RAG?
This would be interesting for me.

[-]

ThomasMalloc@reddit

Not practical for an LLM. Last year I saw someone using this for embedding models though to help with document searching with instant feedback as you type, somewhat more useful.

[-]

philanthropologist2@reddit

This is fucking blowing my mind right now

[-]

ELPascalito@reddit

Posts like these make me have hope, thank you so much for this! Is the like WebGPU implementation open source? Or perhaps the website logic? Anyhow great work!

[-]

gothlenin@reddit

What kind of use would this model be good for? I saw earlier someone created a very simple "fine-tuning" for true 1-bit models like Bonsai, but I don't know how worthwhile this is.

[-]

h4ck3r_n4m3@reddit

It's confidently dumb

[-]

WhoRoger@reddit

Is it supported in mainstream llama.cpp yet?

[-]

TylerDurdenFan@reddit

Ever since it came out, Bonsai-8B is my favorite model.

[-]

SomeOrdinaryKangaroo@reddit

This model is incredible good, holy shit this is next level

[-]

scottgal2@reddit

Just been playing with PromptAPI (the in browser Chrome Gemini Nano) for a little image->alt text chrome extension and it's *really neat*. m5 Mac Air so it's quick anyway but seems *useful*.
I expect this will be a major theme going forward, these little in browser llms are *incredibly* useful.

[-]

met_MY_verse@reddit

!RemindMe 15 hours

[-]

RemindMeBot@reddit

I will be messaging you in 15 hours on 2026-04-16 08:55:26 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

[-]

keyehi@reddit

wait till you hear about calculators.