Bonsai models are pure hype: Bonsai-8B is MUCH dumber than Gemma-4-E2B

[-]

itsArmanJr@reddit

they don't seem too bad on my benchmark. decent performance. https://github.com/ArmanJR/PrismML-Bonsai-vs-Qwen3.5-Benchmark

[-]

Considering this is created from qwen 3, you could have included the original qwen 3. If it retains close to qwen 3 full precision performance, that would be quite a feat. It also helps to make an apples to apples comparison to an 8b model rather than 9b.

[-]

Feztopia@reddit

To be fair bonsai is based on Qwen and the new Gemma models are just awesome for their size. They could use their quantization on the small gemma but do we need such small models? Would be more interesting if Google would have gamme 4 E12B and Bonsai would shrink that

[-]

Snoo_28140@reddit

Both are interesting. Lightning fast inference with negligible footprint AND running bigger models than we currently reasonably can.

[-]

egomarker@reddit

Everyone knows the fix. You can't relate on world knowledge of a 1-bit model.

[-]

KaroYadgar@reddit

Indeed, it should be noted that Bonsai was built on Qwen3, not Qwen3.5, so its issues may stem from the fact that it's built on the previous generation rather than purely its quantization impacting it.

I would like to see them apply the same thing to Qwen3.5, as it'd probably result in an even faster model, and we'd be able to properly test them with other current-generation models.

[-]

WeGoToMars7@reddit (OP)

Qwen3-8B-Q4_K_M has no issue with all three questions, so it's PrismMLs quant process that lobotomised it! With almost 3x the RAM used, it's not a fair comparison, but PrismML was the one to claim that their quantisation barelly cost the models any score drop in benchmarks.

[-]

KaroYadgar@reddit

Ah! Such a bummer. I am truly hoping that their 1-bit quantization is still better than previous 1-bit quantized models. I am disappointed, I really hoped that PrismML's quantization scheme would live up to what they asked.

I still wonder why nobody has tried to make a larger BitNet model.

[-]

deleted-account69420@reddit

Someone is trying

https://huggingface.co/CMSManhattan

[-]

Pyros-SD-Models@reddit

It actually lives up to the hype, but you need to understand that to reach 1-bit, something has to die. Prism’s whole claim is that if you sacrifice basically all factual knowledge of a model, you can still preserve its reasoning abilities, and yes, it does. Which is amazing, because it suggests that reasoning does not depend heavily on factual knowledge.

But this also means that asking the model random factual questions is not a good evaluation and does not say anything about the core claim. It only validates that they did, in fact, nuke factuality.

[-]

Pyros-SD-Models@reddit

I mean, it is also not fair to judge models based on three random questions (google “sample size”), and your questions do not disprove their benchmark scores either. The benchmarks are open, you can literally run the exact same evaluations yourself.

I find it quite amusing that people here love to shit-talk benchmarks, and then you look at how this sub actually tests models and it’s just some dude asking three questions.

No methodology, no ablations, no correlation study, no de-biasing. Nothing. Not even a definition of what is actually being measured. What does “dumb” even mean in your context? Are humans who don’t know bun dumber than people who do? How does “knowing bun” correlate with intelligence? Do they even correlate? Congrats, you just invented the worst benchmark possible.

And if you think this somehow disproves the work Prism did... you know, actual scientific work: building a theory, running experiments, measuring results (with quite a bit more than just three questions), and even providing the exact tooling in their repo so you can reproduce the experiments, which hundreds of people already did — then I have bad news for you.

And the benchmarks themselves make this pretty clear: there is not a single claim that Bonsai preserves factual grounding. The benchmark suite includes MMLU-R, MuSR, GSM8K, HumanEval+, IFEval, and BFCL. That covers reasoning, math, coding, instruction following, and function calling. Their core metric is “intelligence density per gigabyte.”

The entire thesis is that 1-bit quantization preserves reasoning capability at a fraction of the size... not that it preserves encyclopedic knowledge. Those are different things.

It's literally the point of their work to sacrifice factual grounding for reasoning stability, and you just tested that they did in fact sacrifice factual grounding. Amazing.

[-]

DangerousSetOfBewbs@reddit

I actually ran 4 of them in a counsel type array, still could not get a simple web scraper correctly implemented. So yeah, its good at really simple tasks I think, just not IT

[-]

ecompanda@reddit

the base model thing is the whole story tbh. qwen3 8B gets lapped by qwen3.5 8B on most benchmarks even at normal quants. building 1 bit compression on top of a model that's already behind is going to lose to a well quantized current gen model almost every time.

[-]

charlesrwest0@reddit

If we are being fair, Google has way more resources than the bonsai team. It's a cool proof of their concept, but I'm not sure it's really super production ready.

[-]

xadiant@reddit

Yep seems like a proof of concept, and it's super impressive.

OP is comparing apples to oranges

[-]

xrvz@reddit

Our new class of models is designed to unlock production-ready accuracy at the edge and our core technology will enable industry-changing intelligence in the cloud.

Doesn't sound like a POC to me.

[-]

xadiant@reddit

"our core technology will enable industry-changing intelligence in the cloud."

Meaning the technology is allegedly there. If there is interest, there will be newer and better models.

[-]

dahood10@reddit

what kind of Gemma-4-e2b Q4_K_M is 1.1gb?
the Q4_K_M from all providers is around \~4.4gb

[-]

WeGoToMars7@reddit (OP)

It includes the embeddings and the multimodal projection matrix, I counted transformer parameters only.

[-]

droans@reddit

If you really want to do an apples-to-apples comparison, you would use an equivalent-sized Qwen3 model. The Bonsai 8B is 1.16GB so you should find a Qwen3 model which is about 1.16GB. Qwen3-1.7B-Q4_K_M comes close.

I don't talk about how fast my car is by comparing it to a bike or to a Lambo. I would compare it to other similar cars. Why would you use Gemma4? It's an entirely different model. It's a VLM, too, so it's got quite a bit of memory dedicated to non-language parameters. And its size is a lot different than Bonsai.

It's not just you - it seems like every LLM community goes to great lengths to avoid true comparisons and it annoys me to no end.

[-]

Mushoz@reddit

But the embeddings are actually used and result in better performance. Ignoring it when comparing file sizes make no sense. Not counting the vision part of the model does make sense.

[-]

dahood10@reddit

who cares, people still load 3.5gb in vram
the difference is 4x in size

[-]

rerri@reddit

You and op are both wrong lol

Unsloth 3.11 GB

Bartowski 3.46 GB

[-]

dahood10@reddit

my bad, but 3.5gb is still huge when comparing to 800mb

[-]

FullOf_Bad_Ideas@reddit

Gemma E2B is a new architecture that also attempts to pack a punch in small size. It's expected that Bonsai won't necessarily beat it here or there (I think those questions are too limited to judge a model from answers anyway). I like how different organizations try to solve this problem for us.

[-]

lobabobloblaw@reddit

Shit talking at the speed of life…and life is speeding, isn’t it?

[-]

lobabobloblaw@reddit

One can’t deny that it’s proof of something

[-]

ai_without_borders@reddit

the right comparison is probably: for a given RAM budget, what's the best quality you can get? bonsai's value prop is fitting 8b parameters in \~1gb. the question is whether that beats a smaller model (2-3b fp16 or q4_k_m) in the same budget. if gemma-4-e2b at 1.1gb q4 outperforms bonsai-8b at 0.8gb, bonsai's param count advantage is just marketing. the size/quality tradeoff is what matters for actual deployment, not the parameter count headline.

[-]

Weak-Shelter-1698@reddit

Dude. It's not the dumbness. :\
There's a difference between knowledge and Intelligence.

[-]

WeGoToMars7@reddit (OP)

How many days of the week end with "day"?

Gemma answers correctly, Bonsai can't.

[-]

amunozo1@reddit

These issues have more to do with memorizing and tokenization than anything else. I am not saying that you're wrong, but both this example and the one provided above are just bad.

[-]

WeGoToMars7@reddit (OP)

I welcome everyone here to contribute with better comparisons - these models are extremely easy to run. It's not easy to find a good task for these small LLMs especially so it would be short enough to fit the results into one screenshot.

A few years ago, it was common to evaluate quantisation by the ability of the model to add up two numbers 50 times in a row, and that's much more unfair.

[-]

RXK_Nephew@reddit

A use I tried it for was a quiz platform where the model would be provided with a question, the answer to the question, and the user's answer to the question, and given the task to evaluate whether the user's answer was correct or incorrect. It seemed too dumb for even that task, unfortunately.

[-]

Negative-Magazine174@reddit

tried on Locally AI on my mac 🤔

[-]

tiffanytrashcan@reddit

I'm super curious what your generation parameters are? Like temperature, minP, and TopK - Qwen3 was really sensitive to TopK settings, that's probably exaggerated with this experiment..

[-]

arkuto@reddit

That's an even worse test of intelligence. It requires reasoning about tokens. It is like asking someone how many neurons fired when thinking about a concept. It's not got anything to do with intelligenxe or reasoning, but about a very specific and esoteric knowledge about how its internals work.

[-]

WaveCut@reddit

It's a bad example,any model would answer it if trained for this case, and fail if not.

[-]

WeGoToMars7@reddit (OP)

Would this be a better example? Bonsai is unfortunately lobotomised...

[-]

worldwidesumit@reddit

You should compare equal quants. Q_1 is very aggressive and non usable.

[-]

stddealer@reddit

For Bonzai, Q_1 is not a quant, it's the native type of the full model. Losing to a non-QAT quantized model that's about the same file size is not a good look. But to be fair this quant of Gemma is a bit bigger than Bonzai, so maybe a smaller quant would perform worse.

[-]

Party-Special-5177@reddit

and they don't really give any details von how it was made

I keep saying this: PrismMl saying ‘there is no novelty in the model’ (don’t remember the exact words off white paper page 6) tells us exactly how it was made: they made it using standard tools and the same techniques we’ve had for years now (bitnet.cpp et al).

There are only 2 ways to make bitnets currently. It seems this is still true as prism hasn’t disclosed any new method.

[-]

StillWastingAway@reddit

The value from having a functional Q_1 is way higher than just its file size, computation wise it might be a game changer and allow a lot tricks that are not practical at none {-1,0,1} weights. Not to mention the FPGA/ASIC or other specialized HW implications.

It's one of the most interesting directions the field might take, even if currently its under delivering.

[-]

KaroYadgar@reddit

Bonsai is made for Q1. IIRC it's the only quant they provide. Comparing a smaller model at Q4 vs Bonsai at Q1 is completely fair, as those who made Bonsai claim that it performs similarly/closely to full precision models, even at similar parameter ranges.

Although, it should be noted that Bonsai is built on Qwen3 and not Qwen3.5, so subpar intelligence compared to the modern generation is sort of expected.

[-]

Alex_L1nk@reddit

Also they are experimenting with various techniques. Today they released 1.58-quantized versions

[-]

KaroYadgar@reddit

That's very interesting.

[-]

PromptInjection_@reddit

Possible reasons:

a) The training was significantly worse than Google's (this is likely the case)

b) 1-bit simply can't work miracles after all and involves trade-offs

c) Alignment tax. The model is extremely safety-tuned, sometimes even against its own context

[-]

tchek@reddit

Bonsai is not meant to be powerful, it's just a miracle that a 1-bit model works at all, and is promising for 1 bitting bigger model like Qwen 27b

[-]

Party-Special-5177@reddit

All of their marketing is based around ‘intelligence density’. It is supposed to be powerful, compared to its weight (which OP is suggesting it isn’t).

Remember 1bit bitnet models (not unsloth etc, true bitnets) have been around since 2023, ternary models since 2024. There is no miracle anymore. Anyone can make them these days.

[-]

Effective-Drawer9152@reddit

Gemma42b is pretty good. even on my oneplus phone give very good speed for output

[-]

Tam1@reddit

How can you run such a big model on a phone? How many tokens a second do you get?

[-]

Cinci_Socialist@reddit

Alright so I'm very excited by 1bit and 1.58bit models getting more research resources but y'all need to think about this for a fucking second.

What makes a FP16 better and more accurate than a Q8 or Q4 quant? There is more information encoded into each node. You can do a lot to smooth over this and adjust things to get similar performance but at the end of the day its like lossy compression: you're going to lose something.

Bonsai models are quantized down to 1 bit. It doesn't make any sense to try to compare it to a model of similar parameters even at a low quant.

Now I understand here you're using a 2B4Q which is admittedly very small, but it's not as simple as 2x4=8 and 1x8=8 for similar complexity. You just gotta reason about this a little bit. Does it seem like the domain of knowledge and depth on the 2B 4Q model is smaller, but more accurate? Does it seem like the expression and range of knowledge of the 8B1B is wider but less accurate? That's what I would expect to find. Parameters drive complexity and depth, bigger tensors drive more accuracy per parameter.

I think what all this misses is that previously it's been impossible or close to impossible to get anything even slightly usable out of a quant smaller than 4. 1bit and 1.58bit models will show their unique strengths as parameter count gets scaled up, as with so little information encoded per tensor it's the only way they could hope to arrive at similar levels of complexity and utility.

[-]

scknkkrer@reddit

Oh, god! I'm not alone on this. Thank you! This model is pure hype and BS.

[-]

ThatRandomJew7@reddit

I look at it as a proof of concept. If they can get something even vaguely coherent out of a 1bit quant of an 8b model, imagine what they can do with a much larger model

[-]

ImJustHereToShare25@reddit

First off, go run Q4 K_M of Qwen3 4B benchmarks they provide and compare to the Qwen3 8B Bonzai ternary model benchmarks. They should be roughly the same size, and I would expect the Bonzai to be 10-15% better.

Second off: They are getting reasonable degradation at Q1 and Q1.58 vs the Q4 K_M. Even if they are worse at Q1.58 or Q1, that opens up possibilities to run the big models like Deepseek at Q1/Q1.58 where Q4 absolutely cannot. Even if Q1.58 is slightly worse on the pure efficiency tradeoff, everyone knows that bigger models quantized down is better than medium-small models quantized less.

[-]

Nice_Database_9684@reddit

That's not really what these models are for though. Your comparison asking it how many days end with "day" is a much better question, imo, than this knowledge-based question.

I think smaller models are always going to be used for smaller tasks like this. Sentiment analysis, summarisation, etc. No one should be using them for general intelligence or knowledge questions. They're just too small.

[-]

yami_no_ko@reddit

Q4_K_M and Q1_0 just don't compare.

[-]

z_latent@reddit

Any sane person wouldn't do that, but PrismML themselves compared their Q1_0 model to full FP16. I think the scrutiny is valid.

[-]

yami_no_ko@reddit

I've tried it. Not properly tested. It sure felt way dumber than what I'd expect from a 8b model these days. Fitting an 8b model within 1GB is where it shines, and even though is clearly falls short for an 8b model it is way better than what you can usually get around 1GB filesize.

It's main drawback to me was how slow it performs on edge HW (specifically ARM), so to have it run at decent speeds you need a system that isn't really restricted to 1 or 2 GB of RAM.

I found LFM to fit the bill when it comes to constrainted HW.

[-]

stddealer@reddit

Model file size is very relevant. You could also say 8B and 2B just don't compare.

[-]

LagOps91@reddit

by this standard, every human is much dumber than any given llm.

[-]

seamonn@reddit

no shit

[-]

charmander_cha@reddit

A comparação com outros modelos não parece ser algo que realmente faça jus entender o modelo e suas limitações.

[-]

donk8r@reddit

I've been testing extreme quantization for edge deployment, and here's the reality:

The comparison is apples-to-oranges, but the point stands:

Bonsai is Qwen3-8B at 1.125 bpw (effectively 1-bit with their ternary scheme)
Gemma 4 E2B is... well, Google's efficient architecture natively

The issue isn't just quantization - it's architectural. Gemma 4 was designed from scratch for efficiency. Bonsai is post-training quantization on a model that wasn't.

What actually works at 1-bit:
I've had success with QAT (quantization-aware training) models, but Bonsai isn't doing QAT - they're doing post-training ternary conversion. That's why you see the "lobotomy" effect on reasoning.

The real test:
Run both on code generation or multi-step reasoning tasks. 1-bit models usually fail at:

Variable naming consistency across long contexts
Nested logic (if/else chains)
Math with carry/borrow operations

Gemma 4's E2B architecture handles these because the efficiency is in the architecture, not just the weights.

My take:
Bonsai is a cool research project, but for production? Gemma 4 2B/4B is the pragmatic choice. The 29% size savings isn't worth the capability drop unless you're on extremelyconstrained hardware (think microcontrollers, not consumer GPUs).

[-]

Lines25@reddit

I think that's cuz of 1-bit quant ?

Idk I use my own models I trin locally lmao

[-]

SomeOrdinaryKangaroo@reddit

This is not true, Bonsai is a GREAT model

[-]

DeepOrangeSky@reddit

I wonder if that day a couple days ago when Micron's stock price shot back up by like 10% in a day was when all the big wall street guys decided to stop looking at the benchmarks and try Bonsai out in real life for the first time.

Mike: "WHOA... this Bonsai thing SUCKS. Hey Brian, go buy like a billion dollars worth of Micron stock, immediately."

Brian: "Yea, but what about that TurboQuant thing though?"

Mike: "TurboQuant?... more like TurboCu-"

Briant (interrupting): "-If you finish that word, they're definitely going to fire you this time. The H.R. lady already doesn't like you."

Mike: "Alright, half a billion dollars worth of Micron, then." - every guy on wall street a couple days ago, apparently

[-]

Healthy-Nebula-3603@reddit

Really ? What do you expect with an extreme high compression?

[-]

Nexter92@reddit

It's not dumb, i just training data...