Bonsai models are pure hype: Bonsai-8B is MUCH dumber than Gemma-4-E2B
Posted by WeGoToMars7@reddit | LocalLLaMA | View on Reddit | 68 comments
I'm using the https://github.com/PrismML-Eng/llama.cpp fork for Bonsai, regular llama.cpp for Gemma.
Without embedding parameters:
Gemma 4 has 2.3B at 4.8 bpw (Q4_K_M) = 1104 MB
Bonsai-8B has 6.95B at 1.125 bpw (Q1_0) = 782 MB (-29% smaller)
I could've went with a smaller quant of Gemma 4, it's just conventional wisdom to not push small models beyond Q4_K_M.
I might try their ternary model later, but I don't have much hope...
itsArmanJr@reddit
they don't seem too bad on my benchmark. decent performance. https://github.com/ArmanJR/PrismML-Bonsai-vs-Qwen3.5-Benchmark
Snoo_28140@reddit
Considering this is created from qwen 3, you could have included the original qwen 3. If it retains close to qwen 3 full precision performance, that would be quite a feat. It also helps to make an apples to apples comparison to an 8b model rather than 9b.
Feztopia@reddit
To be fair bonsai is based on Qwen and the new Gemma models are just awesome for their size. They could use their quantization on the small gemma but do we need such small models? Would be more interesting if Google would have gamme 4 E12B and Bonsai would shrink that
Snoo_28140@reddit
Both are interesting. Lightning fast inference with negligible footprint AND running bigger models than we currently reasonably can.
egomarker@reddit
Everyone knows the fix. You can't relate on world knowledge of a 1-bit model.
KaroYadgar@reddit
Indeed, it should be noted that Bonsai was built on Qwen3, not Qwen3.5, so its issues may stem from the fact that it's built on the previous generation rather than purely its quantization impacting it.
I would like to see them apply the same thing to Qwen3.5, as it'd probably result in an even faster model, and we'd be able to properly test them with other current-generation models.
WeGoToMars7@reddit (OP)
Qwen3-8B-Q4_K_M has no issue with all three questions, so it's PrismMLs quant process that lobotomised it! With almost 3x the RAM used, it's not a fair comparison, but PrismML was the one to claim that their quantisation barelly cost the models any score drop in benchmarks.
KaroYadgar@reddit
Ah! Such a bummer. I am truly hoping that their 1-bit quantization is still better than previous 1-bit quantized models. I am disappointed, I really hoped that PrismML's quantization scheme would live up to what they asked.
I still wonder why nobody has tried to make a larger BitNet model.
deleted-account69420@reddit
Someone is trying
https://huggingface.co/CMSManhattan
Pyros-SD-Models@reddit
It actually lives up to the hype, but you need to understand that to reach 1-bit, something has to die. Prism’s whole claim is that if you sacrifice basically all factual knowledge of a model, you can still preserve its reasoning abilities, and yes, it does. Which is amazing, because it suggests that reasoning does not depend heavily on factual knowledge.
But this also means that asking the model random factual questions is not a good evaluation and does not say anything about the core claim. It only validates that they did, in fact, nuke factuality.
Pyros-SD-Models@reddit
I mean, it is also not fair to judge models based on three random questions (google “sample size”), and your questions do not disprove their benchmark scores either. The benchmarks are open, you can literally run the exact same evaluations yourself.
I find it quite amusing that people here love to shit-talk benchmarks, and then you look at how this sub actually tests models and it’s just some dude asking three questions.
No methodology, no ablations, no correlation study, no de-biasing. Nothing. Not even a definition of what is actually being measured. What does “dumb” even mean in your context? Are humans who don’t know bun dumber than people who do? How does “knowing bun” correlate with intelligence? Do they even correlate? Congrats, you just invented the worst benchmark possible.
And if you think this somehow disproves the work Prism did... you know, actual scientific work: building a theory, running experiments, measuring results (with quite a bit more than just three questions), and even providing the exact tooling in their repo so you can reproduce the experiments, which hundreds of people already did — then I have bad news for you.
And the benchmarks themselves make this pretty clear: there is not a single claim that Bonsai preserves factual grounding. The benchmark suite includes MMLU-R, MuSR, GSM8K, HumanEval+, IFEval, and BFCL. That covers reasoning, math, coding, instruction following, and function calling. Their core metric is “intelligence density per gigabyte.”
The entire thesis is that 1-bit quantization preserves reasoning capability at a fraction of the size... not that it preserves encyclopedic knowledge. Those are different things.
It's literally the point of their work to sacrifice factual grounding for reasoning stability, and you just tested that they did in fact sacrifice factual grounding. Amazing.
DangerousSetOfBewbs@reddit
I actually ran 4 of them in a counsel type array, still could not get a simple web scraper correctly implemented. So yeah, its good at really simple tasks I think, just not IT
ecompanda@reddit
the base model thing is the whole story tbh. qwen3 8B gets lapped by qwen3.5 8B on most benchmarks even at normal quants. building 1 bit compression on top of a model that's already behind is going to lose to a well quantized current gen model almost every time.
charlesrwest0@reddit
If we are being fair, Google has way more resources than the bonsai team. It's a cool proof of their concept, but I'm not sure it's really super production ready.
xadiant@reddit
Yep seems like a proof of concept, and it's super impressive.
OP is comparing apples to oranges
xrvz@reddit
Doesn't sound like a POC to me.
xadiant@reddit
"our core technology will enable industry-changing intelligence in the cloud."
Meaning the technology is allegedly there. If there is interest, there will be newer and better models.
dahood10@reddit
what kind of Gemma-4-e2b Q4_K_M is 1.1gb?
the Q4_K_M from all providers is around \~4.4gb
WeGoToMars7@reddit (OP)
It includes the embeddings and the multimodal projection matrix, I counted transformer parameters only.
droans@reddit
If you really want to do an apples-to-apples comparison, you would use an equivalent-sized Qwen3 model. The Bonsai 8B is 1.16GB so you should find a Qwen3 model which is about 1.16GB. Qwen3-1.7B-Q4_K_M comes close.
I don't talk about how fast my car is by comparing it to a bike or to a Lambo. I would compare it to other similar cars. Why would you use Gemma4? It's an entirely different model. It's a VLM, too, so it's got quite a bit of memory dedicated to non-language parameters. And its size is a lot different than Bonsai.
It's not just you - it seems like every LLM community goes to great lengths to avoid true comparisons and it annoys me to no end.
Mushoz@reddit
But the embeddings are actually used and result in better performance. Ignoring it when comparing file sizes make no sense. Not counting the vision part of the model does make sense.
dahood10@reddit
who cares, people still load 3.5gb in vram
the difference is 4x in size
rerri@reddit
You and op are both wrong lol
Unsloth 3.11 GB
Bartowski 3.46 GB
dahood10@reddit
my bad, but 3.5gb is still huge when comparing to 800mb
FullOf_Bad_Ideas@reddit
Gemma E2B is a new architecture that also attempts to pack a punch in small size. It's expected that Bonsai won't necessarily beat it here or there (I think those questions are too limited to judge a model from answers anyway). I like how different organizations try to solve this problem for us.
lobabobloblaw@reddit
Shit talking at the speed of life…and life is speeding, isn’t it?
lobabobloblaw@reddit
One can’t deny that it’s proof of something
ai_without_borders@reddit
the right comparison is probably: for a given RAM budget, what's the best quality you can get? bonsai's value prop is fitting 8b parameters in \~1gb. the question is whether that beats a smaller model (2-3b fp16 or q4_k_m) in the same budget. if gemma-4-e2b at 1.1gb q4 outperforms bonsai-8b at 0.8gb, bonsai's param count advantage is just marketing. the size/quality tradeoff is what matters for actual deployment, not the parameter count headline.
Weak-Shelter-1698@reddit
Dude. It's not the dumbness. :\
There's a difference between knowledge and Intelligence.
WeGoToMars7@reddit (OP)
How many days of the week end with "day"?
Gemma answers correctly, Bonsai can't.
amunozo1@reddit
These issues have more to do with memorizing and tokenization than anything else. I am not saying that you're wrong, but both this example and the one provided above are just bad.
WeGoToMars7@reddit (OP)
I welcome everyone here to contribute with better comparisons - these models are extremely easy to run. It's not easy to find a good task for these small LLMs especially so it would be short enough to fit the results into one screenshot.
A few years ago, it was common to evaluate quantisation by the ability of the model to add up two numbers 50 times in a row, and that's much more unfair.
RXK_Nephew@reddit
A use I tried it for was a quiz platform where the model would be provided with a question, the answer to the question, and the user's answer to the question, and given the task to evaluate whether the user's answer was correct or incorrect. It seemed too dumb for even that task, unfortunately.
Negative-Magazine174@reddit
tried on Locally AI on my mac 🤔
tiffanytrashcan@reddit
I'm super curious what your generation parameters are? Like temperature, minP, and TopK - Qwen3 was really sensitive to TopK settings, that's probably exaggerated with this experiment..
arkuto@reddit
That's an even worse test of intelligence. It requires reasoning about tokens. It is like asking someone how many neurons fired when thinking about a concept. It's not got anything to do with intelligenxe or reasoning, but about a very specific and esoteric knowledge about how its internals work.
WaveCut@reddit
It's a bad example,any model would answer it if trained for this case, and fail if not.
WeGoToMars7@reddit (OP)
Would this be a better example? Bonsai is unfortunately lobotomised...
worldwidesumit@reddit
You should compare equal quants. Q_1 is very aggressive and non usable.
stddealer@reddit
For Bonzai, Q_1 is not a quant, it's the native type of the full model. Losing to a non-QAT quantized model that's about the same file size is not a good look. But to be fair this quant of Gemma is a bit bigger than Bonzai, so maybe a smaller quant would perform worse.
Party-Special-5177@reddit
I keep saying this: PrismMl saying ‘there is no novelty in the model’ (don’t remember the exact words off white paper page 6) tells us exactly how it was made: they made it using standard tools and the same techniques we’ve had for years now (bitnet.cpp et al).
There are only 2 ways to make bitnets currently. It seems this is still true as prism hasn’t disclosed any new method.
StillWastingAway@reddit
The value from having a functional Q_1 is way higher than just its file size, computation wise it might be a game changer and allow a lot tricks that are not practical at none {-1,0,1} weights. Not to mention the FPGA/ASIC or other specialized HW implications.
It's one of the most interesting directions the field might take, even if currently its under delivering.
KaroYadgar@reddit
Bonsai is made for Q1. IIRC it's the only quant they provide. Comparing a smaller model at Q4 vs Bonsai at Q1 is completely fair, as those who made Bonsai claim that it performs similarly/closely to full precision models, even at similar parameter ranges.
Although, it should be noted that Bonsai is built on Qwen3 and not Qwen3.5, so subpar intelligence compared to the modern generation is sort of expected.
Alex_L1nk@reddit
Also they are experimenting with various techniques. Today they released 1.58-quantized versions
KaroYadgar@reddit
That's very interesting.
PromptInjection_@reddit
Possible reasons:
a) The training was significantly worse than Google's (this is likely the case)
b) 1-bit simply can't work miracles after all and involves trade-offs
c) Alignment tax. The model is extremely safety-tuned, sometimes even against its own context
tchek@reddit
Bonsai is not meant to be powerful, it's just a miracle that a 1-bit model works at all, and is promising for 1 bitting bigger model like Qwen 27b
Party-Special-5177@reddit
All of their marketing is based around ‘intelligence density’. It is supposed to be powerful, compared to its weight (which OP is suggesting it isn’t).
Remember 1bit bitnet models (not unsloth etc, true bitnets) have been around since 2023, ternary models since 2024. There is no miracle anymore. Anyone can make them these days.
Effective-Drawer9152@reddit
Gemma42b is pretty good. even on my oneplus phone give very good speed for output
Tam1@reddit
How can you run such a big model on a phone? How many tokens a second do you get?
Cinci_Socialist@reddit
Alright so I'm very excited by 1bit and 1.58bit models getting more research resources but y'all need to think about this for a fucking second.
What makes a FP16 better and more accurate than a Q8 or Q4 quant? There is more information encoded into each node. You can do a lot to smooth over this and adjust things to get similar performance but at the end of the day its like lossy compression: you're going to lose something.
Bonsai models are quantized down to 1 bit. It doesn't make any sense to try to compare it to a model of similar parameters even at a low quant.
Now I understand here you're using a 2B4Q which is admittedly very small, but it's not as simple as 2x4=8 and 1x8=8 for similar complexity. You just gotta reason about this a little bit. Does it seem like the domain of knowledge and depth on the 2B 4Q model is smaller, but more accurate? Does it seem like the expression and range of knowledge of the 8B1B is wider but less accurate? That's what I would expect to find. Parameters drive complexity and depth, bigger tensors drive more accuracy per parameter.
I think what all this misses is that previously it's been impossible or close to impossible to get anything even slightly usable out of a quant smaller than 4. 1bit and 1.58bit models will show their unique strengths as parameter count gets scaled up, as with so little information encoded per tensor it's the only way they could hope to arrive at similar levels of complexity and utility.
scknkkrer@reddit
Oh, god! I'm not alone on this. Thank you! This model is pure hype and BS.
ThatRandomJew7@reddit
I look at it as a proof of concept. If they can get something even vaguely coherent out of a 1bit quant of an 8b model, imagine what they can do with a much larger model
ImJustHereToShare25@reddit
First off, go run Q4 K_M of Qwen3 4B benchmarks they provide and compare to the Qwen3 8B Bonzai ternary model benchmarks. They should be roughly the same size, and I would expect the Bonzai to be 10-15% better.
Second off: They are getting reasonable degradation at Q1 and Q1.58 vs the Q4 K_M. Even if they are worse at Q1.58 or Q1, that opens up possibilities to run the big models like Deepseek at Q1/Q1.58 where Q4 absolutely cannot. Even if Q1.58 is slightly worse on the pure efficiency tradeoff, everyone knows that bigger models quantized down is better than medium-small models quantized less.
Nice_Database_9684@reddit
That's not really what these models are for though. Your comparison asking it how many days end with "day" is a much better question, imo, than this knowledge-based question.
I think smaller models are always going to be used for smaller tasks like this. Sentiment analysis, summarisation, etc. No one should be using them for general intelligence or knowledge questions. They're just too small.
yami_no_ko@reddit
Q4_K_M and Q1_0 just don't compare.
z_latent@reddit
Any sane person wouldn't do that, but PrismML themselves compared their Q1_0 model to full FP16. I think the scrutiny is valid.
yami_no_ko@reddit
I've tried it. Not properly tested. It sure felt way dumber than what I'd expect from a 8b model these days. Fitting an 8b model within 1GB is where it shines, and even though is clearly falls short for an 8b model it is way better than what you can usually get around 1GB filesize.
It's main drawback to me was how slow it performs on edge HW (specifically ARM), so to have it run at decent speeds you need a system that isn't really restricted to 1 or 2 GB of RAM.
I found LFM to fit the bill when it comes to constrainted HW.
stddealer@reddit
Model file size is very relevant. You could also say 8B and 2B just don't compare.
LagOps91@reddit
by this standard, every human is much dumber than any given llm.
seamonn@reddit
no shit
charmander_cha@reddit
A comparação com outros modelos não parece ser algo que realmente faça jus entender o modelo e suas limitações.
donk8r@reddit
I've been testing extreme quantization for edge deployment, and here's the reality:
The comparison is apples-to-oranges, but the point stands:
The issue isn't just quantization - it's architectural. Gemma 4 was designed from scratch for efficiency. Bonsai is post-training quantization on a model that wasn't.
What actually works at 1-bit:
I've had success with QAT (quantization-aware training) models, but Bonsai isn't doing QAT - they're doing post-training ternary conversion. That's why you see the "lobotomy" effect on reasoning.
The real test:
Run both on code generation or multi-step reasoning tasks. 1-bit models usually fail at:
Gemma 4's E2B architecture handles these because the efficiency is in the architecture, not just the weights.
My take:
Bonsai is a cool research project, but for production? Gemma 4 2B/4B is the pragmatic choice. The 29% size savings isn't worth the capability drop unless you're on extremelyconstrained hardware (think microcontrollers, not consumer GPUs).
Lines25@reddit
I think that's cuz of 1-bit quant ?
Idk I use my own models I trin locally lmao
SomeOrdinaryKangaroo@reddit
This is not true, Bonsai is a GREAT model
DeepOrangeSky@reddit
I wonder if that day a couple days ago when Micron's stock price shot back up by like 10% in a day was when all the big wall street guys decided to stop looking at the benchmarks and try Bonsai out in real life for the first time.
Mike: "WHOA... this Bonsai thing SUCKS. Hey Brian, go buy like a billion dollars worth of Micron stock, immediately."
Brian: "Yea, but what about that TurboQuant thing though?"
Mike: "TurboQuant?... more like TurboCu-"
Briant (interrupting): "-If you finish that word, they're definitely going to fire you this time. The H.R. lady already doesn't like you."
Mike: "Alright, half a billion dollars worth of Micron, then." - every guy on wall street a couple days ago, apparently
Healthy-Nebula-3603@reddit
Really ? What do you expect with an extreme high compression?
Nexter92@reddit
It's not dumb, i just training data...