Ternary Bonsai: Top intelligence at 1.58 bits
Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 76 comments
Today, we’re announcing Ternary Bonsai, a new family of 1.58-bit language models designed to balance strict memory constraints with high accuracy requirements.
This release builds on the efficiency frontier we began exploring with the recently released 1-bit Bonsai models. The 1-bit family showed that extreme compression could still produce commercially useful language models. Ternary Bonsai targets a different point on that curve: a modest increase in size for a meaningful gain in performance.
The models are available in three sizes: 8B, 4B, and 1.7B parameters. By using ternary weights {-1, 0, +1}, these models achieve a memory footprint approximately 9x smaller than standard 16-bit models while outperforming most peers in their respective parameter classes on standard benchmarks.
Blog post : https://prismml.com/news/ternary-bonsai
Models : https://huggingface.co/collections/prism-ml/ternary-bonsai
FP16 safetensors (HuggingFace format) of the ternary Bonsai-8B model. This repo exists for users who want to run Ternary Bonsai with stock HuggingFace tooling or frameworks that don't yet support any of the packed ternary format. The MLX 2-bit format is currently the only packed format available; more formats for other backends are coming soon.
Hope these ternary Bonsai models come with no/less hallucinations.
Waiting for 20-40B models(like Qwen3.5-27B, Qwen3.5-35B-A3B, Gemma-4-31B, Gemma-4-26B-A4B, etc.,) from them soon! That would be start of game change for big/large models.
DefNattyBoii@reddit
This is cool, but they are comparing against very obsolete full-weight models, they could easily benchmark the quantised models on these benches on new models (qwen3.5, gemma4 etc), so I'm just writing it up to intellectual dishonesty and overselling. Their work is impressive, but combined with the fact that they are not working together with mainstream inference frameworks(llama.cpp, vllm, sglang) raises some red flags.
keyboardhack@reddit
They do work with llama.cpp The bonsai devs were responsible for adding support for their 1 bit models to the cpu, cuda and metal backends. https://github.com/ggml-org/llama.cpp/pull/21273 https://github.com/ggml-org/llama.cpp/pull/21629 https://github.com/ggml-org/llama.cpp/pull/21528
I agree wtih everything else you said.
DefNattyBoii@reddit
Thank you for correcting me I was misinformed. Looking forward to trying these over the weekend
pmttyji@reddit (OP)
Vulkan done https://github.com/ggml-org/llama.cpp/pull/21539
Waiting for below CPU optimization PR to be merged
https://github.com/ggml-org/llama.cpp/pull/21636
minkyuthebuilder@reddit
The efficiency gain is impressive, but I'm curious how ternary weights affect consistency across repeated queries. Benchmarks show strong average performance, but does the quantization introduce more variance in outputs compared to full-precision models? That trade-off matters a lot for use cases where reliability is more important than raw benchmark scores.
Silver_Bug8527@reddit
Bonsai 35B when?
FatheredPuma81@reddit
No please Bonsai 122B. Could be run on almost any system at like 30GB.
Hugi_R@reddit
No you could not. context is not free.
Uncle___Marty@reddit
Not free but getting cheaper with turboquant, that is, if it ever gets properly implemented in anything :/
Far-Low-4705@reddit
Qwen 3.5 context is actually MUCH cheaper, especially on a MOE.
I only spend like 10 Gb or so on full 262k fp16 context on qwen3.5 122b. And with turbo quant or Q8.KV cache you could easily cut that in half
pmttyji@reddit (OP)
120B comes around 24GB so 32K context possible in 32GB VRAM. 64K context is too much, might need System RAM additionally or bigger 40 or 48 GB GPU.
Far-Low-4705@reddit
This 1000%
Dany0@reddit
Bonsai E122B if you get what I'm saying, clean yo mouth it's drooling
charmander_cha@reddit
Sim, mas um de 35 ou 27B serão muito mais democráticos
CallumCarmicheal@reddit
Holyshit I just got whiplash from imaging that.
Fault23@reddit
qwen3, are we serious?
RickyRickC137@reddit
I think they're working on qwen 3.5 397b model. Source
pmttyji@reddit (OP)
:) No, Not them(Prism-ML).
Dude u/Party-Special-5177 .... you're getting popular.
RickyRickC137@reddit
My bad then. Can you help me understand the difference?
pmttyji@reddit (OP)
He's from this sub who said that he's cooking 1-bit version of that model. He's not from Prism-ML team.
RickyRickC137@reddit
With all due respect, what's so special about Party Special? Because people like unsloth already has 1-bit quants, right?
Party-Special-5177@reddit
In my opinion: in general, absolutely nothing. lol
Not quite - those guys use bit packing and lookup tables (traditional quantization), meaning they may take e.g. 5 weights and aggregate them into a single lookup table (which still requires matmuls- by definition, Bitnets are designed to only use the ALU on a processor). Now, if the output model averages out to around 3 lookup table options per original weight, then it will be billed as a 1.58b etc.
Very different beasts.
RickyRickC137@reddit
Thank you both you guys for the explanation! Please keep posting your quantized models! I am following you!
pmttyji@reddit (OP)
Both are different. Search online for 1.58-bit models, you'll get detailed posts.
https://huggingface.co/blog/1_58_llm_extreme_quantization
https://en.wikipedia.org/wiki/1.58-bit_large_language_model
https://enerzai.com/resources/blog/small-but-mighty-a-technical-deep-dive-into-1-58-bit-quantization
https://arxiv.org/abs/2402.17764
Party-Special-5177@reddit
How?
Edit: lol it’s you. I’ll go respond to your other update comment now.
pmttyji@reddit (OP)
Don't break your promise. We want 1.58bit version of Qwen3.5-397B model.
Party-Special-5177@reddit
I won’t. Wait until you see what I had to cook up to make it possible. I took all last week off work just to get the pipeline kinks out lol.
pmttyji@reddit (OP)
Yay!
r4in311@reddit
Isn't it kind of dishonest of these guys in these tables to show the full weights of the 8B/9B models? If you were to quantize them with Q4, the performance wouldn't drop that much, and the size difference would be far less noticeable.
WeGoToMars7@reddit
They also don't mention anywhere except one spot in the whitepaper that they are actually quants of Qwen3 (not even 3.5), not something they trained from scratch to be quantisation aware. Sound a lot more like they are building hype with not much substance behind it.
lobabobloblaw@reddit
Building hype without much substance is something we’re using to seeing from a lot of different parties. But ternary architecture is barely out the door, let alone a revolving one
M0ULINIER@reddit
I think it's even better that it doesn't require training a whole model, QWEN 3.6 27b at 1 bit would be *chief kiss*
Thomas-Lore@reddit
You already have quants like that available. It is unlikely ternary will be better considering how modern quants work. Ternary limits to three specific values, a proper 1.6 bit quant will offer higher range, more aligned with the original model.
Party-Special-5177@reddit
In size, but not in architecture or methodology. Bitnets cannot be compared as they literally don’t operate like modern quants lol.
More explanation here: https://old.reddit.com/r/LocalLLaMA/comments/1snqo1f/ternary_bonsai_top_intelligence_at_158_bits/ogp9jij/
kaeptnphlop@reddit
The point of Bonsai was to show that the relative reduction in quality is far lower than typical quants. The original release showed only a drop of 8 points — 79 v 71 — in avg benchmark results compared to 16bit I think it was. I may be wrong about the exact numbers I don’t have them in front of me. But the relative drop is significantly lower than any other low bit quant. I don’t think you can compare with current 2bit quants.
It all still has to bear out for modern, larger and MoE models so comparisons can be made.
I use bonsai for its small footprint to generate chat titles and search terms in Open WebUI and can’t complain about its performance so far.
Kind of what you come to expect from last year’s model that lags behind frontier capability by another year or so.
One_Key_8127@reddit
I disagree that they are building hype with not much substance behind it. If there is a way to quantize a model to 1.6 bit with preserving most of the model intelligence, that would be huge. We could have Minimax at 46GB and great speeds, or Gemma 4 that fits on rtx 3050. However, from my experience it looks like quants under Q4 degrade a lot - and even when people claim great results on benchmarks, they fail to do the real job - they get stupid, enter thinking loops, the quality degrades a lot even if benchmarks say otherwise.
So, if they know how to quantize a model to Q1.6 and it performs similar to Q4, its huge. If they can reproduce it on other models then that's amazing, every big studio like Anthropic or OpenAI would want that. They could offer "-fast" variants of their models for half the price and still profit. Maybe even they could drop some models (for example serve quantized Opus instead of Sonnet and save time, effort and compute).
Party-Special-5177@reddit
Traditional post-train quants (e.g. unsloth etc) are bit packing and lookup tables. They take groups of weights and combine them into one lookup table - if you take 5 8-bit weights and combine them into a lookup table with 15 options, that will be billed as ‘1.58b’, and yes, while it works amazingly around 5-6 bit, it destroys models around 1-1.58 bits.
Bitnets have nothing in common with standard post training quant methods. As I was explaining elsewhere, bit-packed/LUT model quants still require matmuls. Bitnets are designed to only use a processors 1-clock-per-instruction ALU. It turns model weights in to votes almost, where you can just loop over the layer with a running total of the (1, 0, -1) that you’ve seen this far.
WeGoToMars7@reddit
You can already download and run a ternary quant of a large model from Unsloth, they perform surprisingly well for some tasks: https://kaitchup.substack.com/p/lessons-from-gguf-evaluations-ternary Qwen3 in particular seems to be quite resistant to degradation.
pmttyji@reddit (OP)
But you can't find benchmarks of quantized versions of models from Original model cards. Every model cards come with only 16-bit's benchmark values. So the size column of the table is important for them.
WeGoToMars7@reddit
These are 8B models, re-running the full benchmark on a Q4 model shouldn't be that hard or expensive
Far-Low-4705@reddit
Right but no one does it.
And it is not a trivial or cheap as you think
RedditUsr2@reddit
Literally no one does it. Remember when google released Gemma 3 QAT (Quantization-Aware Training)? Their scores were still posted at BF16 and I never found Q8/Q8 quantized score anywhere.
LagOps91@reddit
If they can run those benchmarks for their own models, they could also run it for some quanted models to get a fair comparison.
MaxKruse96@reddit
its more misguiding than it is genuine
Septerium@reddit
If these guys are showing official benchmarks only then what is the problem? You should ask the other teams to publish results for compressed versions of their models
Randomdotmath@reddit
WHAT
tmvr@reddit
Yes, they would need to create and show good results with some larger 25-35B models. Right now they reach Qwen3 4B at slightly lower than Qwen3 4B at Q4 size so Qwen3.5 4B at Q4 sizes is probably significantly better with minimal speed difference.
Then-Indication7672@reddit
What if you combine 1/.58bit quantization from Prism with the tensor swapping compression method of compactifyAI? Since they are two different technique they should be theoretically possible to combine.
pmttyji@reddit (OP)
Bumping this as I have no idea. Experts might respond this correctly.
Waste-Intention-2806@reddit
Opus 4.7 bonsai when? When? Lol
ghulamalchik@reddit
I'm curious why they stopped at 8b. Why not go much higher since the models will be tiny anyway.
Beginning-Window-115@reddit
probs cos of compute
z_latent@reddit
Yes, their quants likely require running a base model at full precision to distill it.
FierceDeity_@reddit
Unsloth has research in making that more efficient on V/RAM usage, don't they?
CodeCatto@reddit
Will we get GGUFs to run on Windows/Linux? I only see MLX downloads.
pmttyji@reddit (OP)
Far-Low-4705@reddit
“Waiting for 20-40B models(like Qwen3.5-27B, Qwen3.5-35B-A3B, Gemma-4-31B, Gemma-4-26B-A4B, etc.,) from them soon! That would be start of game change for big/large models.”
I would die for a qwen3.5 122b…
Heck, a 1bit qwen 3.6 400b would absolutely be insane
IrisColt@reddit
IT CANNOT BE!
smart4@reddit
They should release a version based on Qwen3.6-35B-A3B, with at 1.58 bits
ComplexType568@reddit
WAITING FOR STUFF LIKE KIMI OR GLM 5.1 TO BE BONSAIED PLEASE PLEASE IM ON MY KNEEEES
charmander_cha@reddit
Este artigo tem a ver com aquele paper de bit destillation? Se for, ele dizia que a técnica não parecia ser viável em modelos grandes
pmttyji@reddit (OP)
https://github.com/PrismML-Eng/Bonsai-demo
https://github.com/PrismML-Eng/Bonsai-demo/blob/main/1-bit-bonsai-8b-whitepaper.pdf
charmander_cha@reddit
Mas esse não é o paper original? Ele tinha menções a quantizacao ternária? Eu vou olhar novamente, mas N lembro de ter visto nada sobre
Kaljuuntuva_Teppo@reddit
Too bad we are limited to small models.
Something that better utilizes 24-32 GB consumer GPU's would be preferable.
pmttyji@reddit (OP)
I'm sure that this year onwards we'll be getting medium/big/large models
TruckUseful4423@reddit
How to make GGUF?
pmttyji@reddit (OP)
They'll release soon, just watch their HF page
https://huggingface.co/prism-ml/models?sort=created
power97992@reddit
Glm bonsai 5.1 and minimax 2.7 when?
Silver-Champion-4846@reddit
magic magic?
Skyline34rGt@reddit
Thats cool but way still obsolete Qwen3 as base?
MuDotGen@reddit
I mean, Qwen 3.5 is actually 2 months old today believe it or not. Qwen 3 came out a year ago, so if they had been starting their research and training, etc., more than 2 months ago, they would not have even had access to something after Qwen 3 to be fair. I imagine if they make a next version of Bonsai, they'd likely use Qwen3.5 or 3.6's weights. Chances they're already doing so with 3.5 at least.
Skyline34rGt@reddit
Yeah thats true. For me 2 months at Ai race is like 2 years (or more) at other technology progress. I hope next projects will be base at Qwen3.5 or other newer models.
Yu2sama@reddit
Because that's the only Qwen in the 8B range? Pretty evident they aren't competing against bigger ones (Qwen 3.5 9B in this case).
MuDotGen@reddit
Will there be gguf too? Seems it's just MLX.
pmttyji@reddit (OP)
lobabobloblaw@reddit
Well now