Ternary Bonsai: Top intelligence at 1.58 bits

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 76 comments

Today, we’re announcing Ternary Bonsai, a new family of 1.58-bit language models designed to balance strict memory constraints with high accuracy requirements.

This release builds on the efficiency frontier we began exploring with the recently released 1-bit Bonsai models. The 1-bit family showed that extreme compression could still produce commercially useful language models. Ternary Bonsai targets a different point on that curve: a modest increase in size for a meaningful gain in performance.

The models are available in three sizes: 8B, 4B, and 1.7B parameters. By using ternary weights {-1, 0, +1}, these models achieve a memory footprint approximately 9x smaller than standard 16-bit models while outperforming most peers in their respective parameter classes on standard benchmarks.

Blog post : https://prismml.com/news/ternary-bonsai

Models : https://huggingface.co/collections/prism-ml/ternary-bonsai

FP16 safetensors (HuggingFace format) of the ternary Bonsai-8B model. This repo exists for users who want to run Ternary Bonsai with stock HuggingFace tooling or frameworks that don't yet support any of the packed ternary format. The MLX 2-bit format is currently the only packed format available; more formats for other backends are coming soon.

Hope these ternary Bonsai models come with no/less hallucinations.

Waiting for 20-40B models(like Qwen3.5-27B, Qwen3.5-35B-A3B, Gemma-4-31B, Gemma-4-26B-A4B, etc.,) from them soon! That would be start of game change for big/large models.

[-]

DefNattyBoii@reddit

This is cool, but they are comparing against very obsolete full-weight models, they could easily benchmark the quantised models on these benches on new models (qwen3.5, gemma4 etc), so I'm just writing it up to intellectual dishonesty and overselling. Their work is impressive, but combined with the fact that they are not working together with mainstream inference frameworks(llama.cpp, vllm, sglang) raises some red flags.

[-]

keyboardhack@reddit

They do work with llama.cpp The bonsai devs were responsible for adding support for their 1 bit models to the cpu, cuda and metal backends. https://github.com/ggml-org/llama.cpp/pull/21273 https://github.com/ggml-org/llama.cpp/pull/21629 https://github.com/ggml-org/llama.cpp/pull/21528

I agree wtih everything else you said.

[-]

DefNattyBoii@reddit

Thank you for correcting me I was misinformed. Looking forward to trying these over the weekend

[-]

pmttyji@reddit (OP)

Vulkan done https://github.com/ggml-org/llama.cpp/pull/21539

Waiting for below CPU optimization PR to be merged

https://github.com/ggml-org/llama.cpp/pull/21636

[-]

minkyuthebuilder@reddit

The efficiency gain is impressive, but I'm curious how ternary weights affect consistency across repeated queries. Benchmarks show strong average performance, but does the quantization introduce more variance in outputs compared to full-precision models? That trade-off matters a lot for use cases where reliability is more important than raw benchmark scores.

[-]

Silver_Bug8527@reddit

Bonsai 35B when?

[-]

FatheredPuma81@reddit

No please Bonsai 122B. Could be run on almost any system at like 30GB.

[-]

Hugi_R@reddit

No you could not. context is not free.

[-]

Uncle___Marty@reddit

Not free but getting cheaper with turboquant, that is, if it ever gets properly implemented in anything :/

[-]

Far-Low-4705@reddit

Qwen 3.5 context is actually MUCH cheaper, especially on a MOE.

I only spend like 10 Gb or so on full 262k fp16 context on qwen3.5 122b. And with turbo quant or Q8.KV cache you could easily cut that in half

[-]

pmttyji@reddit (OP)

120B comes around 24GB so 32K context possible in 32GB VRAM. 64K context is too much, might need System RAM additionally or bigger 40 or 48 GB GPU.

[-]

Far-Low-4705@reddit

This 1000%

[-]

Dany0@reddit

Bonsai E122B if you get what I'm saying, clean yo mouth it's drooling

[-]

charmander_cha@reddit

Sim, mas um de 35 ou 27B serão muito mais democráticos

[-]

CallumCarmicheal@reddit

Holyshit I just got whiplash from imaging that.

[-]

Fault23@reddit

qwen3, are we serious?

[-]

RickyRickC137@reddit

I think they're working on qwen 3.5 397b model. Source

[-]

pmttyji@reddit (OP)

:) No, Not them(Prism-ML).

Dude u/Party-Special-5177 .... you're getting popular.

[-]

RickyRickC137@reddit

My bad then. Can you help me understand the difference?

[-]

pmttyji@reddit (OP)

He's from this sub who said that he's cooking 1-bit version of that model. He's not from Prism-ML team.

[-]

RickyRickC137@reddit

With all due respect, what's so special about Party Special? Because people like unsloth already has 1-bit quants, right?

[-]

Party-Special-5177@reddit

what's so special about Party Special?

In my opinion: in general, absolutely nothing. lol

Because people like unsloth already has 1-bit quants

Not quite - those guys use bit packing and lookup tables (traditional quantization), meaning they may take e.g. 5 weights and aggregate them into a single lookup table (which still requires matmuls- by definition, Bitnets are designed to only use the ALU on a processor). Now, if the output model averages out to around 3 lookup table options per original weight, then it will be billed as a 1.58b etc.

Very different beasts.

[-]

RickyRickC137@reddit

Thank you both you guys for the explanation! Please keep posting your quantized models! I am following you!

[-]

pmttyji@reddit (OP)

Both are different. Search online for 1.58-bit models, you'll get detailed posts.

The main difference is that a regular Q1 quantization typically uses binary weights (-1 and +1), while a 1.58-bit model uses ternary weights (-1, 0, and +1), allowing for more nuanced representations and better performance. This ternary approach significantly reduces memory usage while maintaining comparable performance to full-precision models.

https://huggingface.co/blog/1_58_llm_extreme_quantization

https://en.wikipedia.org/wiki/1.58-bit_large_language_model

https://enerzai.com/resources/blog/small-but-mighty-a-technical-deep-dive-into-1-58-bit-quantization

https://arxiv.org/abs/2402.17764

[-]

Party-Special-5177@reddit

you're getting popular

How?

Edit: lol it’s you. I’ll go respond to your other update comment now.

[-]

pmttyji@reddit (OP)

Don't break your promise. We want 1.58bit version of Qwen3.5-397B model.

[-]

Party-Special-5177@reddit

I won’t. Wait until you see what I had to cook up to make it possible. I took all last week off work just to get the pipeline kinks out lol.

[-]

pmttyji@reddit (OP)

Yay!

[-]

r4in311@reddit

Isn't it kind of dishonest of these guys in these tables to show the full weights of the 8B/9B models? If you were to quantize them with Q4, the performance wouldn't drop that much, and the size difference would be far less noticeable.

[-]

WeGoToMars7@reddit

They also don't mention anywhere except one spot in the whitepaper that they are actually quants of Qwen3 (not even 3.5), not something they trained from scratch to be quantisation aware. Sound a lot more like they are building hype with not much substance behind it.

[-]

lobabobloblaw@reddit

Building hype without much substance is something we’re using to seeing from a lot of different parties. But ternary architecture is barely out the door, let alone a revolving one

[-]

M0ULINIER@reddit

I think it's even better that it doesn't require training a whole model, QWEN 3.6 27b at 1 bit would be *chief kiss*

[-]

Thomas-Lore@reddit

You already have quants like that available. It is unlikely ternary will be better considering how modern quants work. Ternary limits to three specific values, a proper 1.6 bit quant will offer higher range, more aligned with the original model.

[-]

Party-Special-5177@reddit

Even unsloth offers 2-bit quants which are close to ternary in size.

In size, but not in architecture or methodology. Bitnets cannot be compared as they literally don’t operate like modern quants lol.

More explanation here: https://old.reddit.com/r/LocalLLaMA/comments/1snqo1f/ternary_bonsai_top_intelligence_at_158_bits/ogp9jij/

[-]

kaeptnphlop@reddit

The point of Bonsai was to show that the relative reduction in quality is far lower than typical quants. The original release showed only a drop of 8 points — 79 v 71 — in avg benchmark results compared to 16bit I think it was. I may be wrong about the exact numbers I don’t have them in front of me. But the relative drop is significantly lower than any other low bit quant. I don’t think you can compare with current 2bit quants.

It all still has to bear out for modern, larger and MoE models so comparisons can be made.

I use bonsai for its small footprint to generate chat titles and search terms in Open WebUI and can’t complain about its performance so far.

Kind of what you come to expect from last year’s model that lags behind frontier capability by another year or so.

[-]

One_Key_8127@reddit

I disagree that they are building hype with not much substance behind it. If there is a way to quantize a model to 1.6 bit with preserving most of the model intelligence, that would be huge. We could have Minimax at 46GB and great speeds, or Gemma 4 that fits on rtx 3050. However, from my experience it looks like quants under Q4 degrade a lot - and even when people claim great results on benchmarks, they fail to do the real job - they get stupid, enter thinking loops, the quality degrades a lot even if benchmarks say otherwise.

So, if they know how to quantize a model to Q1.6 and it performs similar to Q4, its huge. If they can reproduce it on other models then that's amazing, every big studio like Anthropic or OpenAI would want that. They could offer "-fast" variants of their models for half the price and still profit. Maybe even they could drop some models (for example serve quantized Opus instead of Sonnet and save time, effort and compute).

[-]

Party-Special-5177@reddit

However, from my experience it looks like quants under Q4 degrade a lot

Traditional post-train quants (e.g. unsloth etc) are bit packing and lookup tables. They take groups of weights and combine them into one lookup table - if you take 5 8-bit weights and combine them into a lookup table with 15 options, that will be billed as ‘1.58b’, and yes, while it works amazingly around 5-6 bit, it destroys models around 1-1.58 bits.

Bitnets have nothing in common with standard post training quant methods. As I was explaining elsewhere, bit-packed/LUT model quants still require matmuls. Bitnets are designed to only use a processors 1-clock-per-instruction ALU. It turns model weights in to votes almost, where you can just loop over the layer with a running total of the (1, 0, -1) that you’ve seen this far.

[-]

WeGoToMars7@reddit

You can already download and run a ternary quant of a large model from Unsloth, they perform surprisingly well for some tasks: https://kaitchup.substack.com/p/lessons-from-gguf-evaluations-ternary Qwen3 in particular seems to be quite resistant to degradation.

[-]

pmttyji@reddit (OP)

But you can't find benchmarks of quantized versions of models from Original model cards. Every model cards come with only 16-bit's benchmark values. So the size column of the table is important for them.

[-]

WeGoToMars7@reddit

These are 8B models, re-running the full benchmark on a Q4 model shouldn't be that hard or expensive

[-]

Far-Low-4705@reddit

Right but no one does it.

And it is not a trivial or cheap as you think

[-]

RedditUsr2@reddit

Literally no one does it. Remember when google released Gemma 3 QAT (Quantization-Aware Training)? Their scores were still posted at BF16 and I never found Q8/Q8 quantized score anywhere.

[-]

LagOps91@reddit

If they can run those benchmarks for their own models, they could also run it for some quanted models to get a fair comparison.

[-]

MaxKruse96@reddit

its more misguiding than it is genuine

[-]

Septerium@reddit

If these guys are showing official benchmarks only then what is the problem? You should ask the other teams to publish results for compressed versions of their models

[-]

Randomdotmath@reddit

WHAT

[-]

tmvr@reddit

Yes, they would need to create and show good results with some larger 25-35B models. Right now they reach Qwen3 4B at slightly lower than Qwen3 4B at Q4 size so Qwen3.5 4B at Q4 sizes is probably significantly better with minimal speed difference.

[-]

Then-Indication7672@reddit

What if you combine 1/.58bit quantization from Prism with the tensor swapping compression method of compactifyAI? Since they are two different technique they should be theoretically possible to combine.

[-]

pmttyji@reddit (OP)

Bumping this as I have no idea. Experts might respond this correctly.

[-]

Waste-Intention-2806@reddit

Opus 4.7 bonsai when? When? Lol

[-]

ghulamalchik@reddit

I'm curious why they stopped at 8b. Why not go much higher since the models will be tiny anyway.

[-]

Beginning-Window-115@reddit

probs cos of compute

[-]

z_latent@reddit

Yes, their quants likely require running a base model at full precision to distill it.

[-]

FierceDeity_@reddit

Unsloth has research in making that more efficient on V/RAM usage, don't they?

[-]

CodeCatto@reddit

Will we get GGUFs to run on Windows/Linux? I only see MLX downloads.

[-]

pmttyji@reddit (OP)

[-]

Far-Low-4705@reddit

“Waiting for 20-40B models(like Qwen3.5-27B, Qwen3.5-35B-A3B, Gemma-4-31B, Gemma-4-26B-A4B, etc.,) from them soon! That would be start of game change for big/large models.”

I would die for a qwen3.5 122b…

Heck, a 1bit qwen 3.6 400b would absolutely be insane

[-]

IrisColt@reddit

IT CANNOT BE!

[-]

smart4@reddit

They should release a version based on Qwen3.6-35B-A3B, with at 1.58 bits

[-]

ComplexType568@reddit

WAITING FOR STUFF LIKE KIMI OR GLM 5.1 TO BE BONSAIED PLEASE PLEASE IM ON MY KNEEEES

[-]

charmander_cha@reddit

Este artigo tem a ver com aquele paper de bit destillation? Se for, ele dizia que a técnica não parecia ser viável em modelos grandes

[-]

pmttyji@reddit (OP)

https://github.com/PrismML-Eng/Bonsai-demo

https://github.com/PrismML-Eng/Bonsai-demo/blob/main/1-bit-bonsai-8b-whitepaper.pdf

[-]

charmander_cha@reddit

Mas esse não é o paper original? Ele tinha menções a quantizacao ternária? Eu vou olhar novamente, mas N lembro de ter visto nada sobre

[-]

Kaljuuntuva_Teppo@reddit

Too bad we are limited to small models.
Something that better utilizes 24-32 GB consumer GPU's would be preferable.

[-]

pmttyji@reddit (OP)

I'm sure that this year onwards we'll be getting medium/big/large models

[-]

TruckUseful4423@reddit

How to make GGUF?

[-]

pmttyji@reddit (OP)

They'll release soon, just watch their HF page

https://huggingface.co/prism-ml/models?sort=created

[-]

power97992@reddit

Glm bonsai 5.1 and minimax 2.7 when?

[-]

Silver-Champion-4846@reddit

magic magic?

[-]

Skyline34rGt@reddit

Thats cool but way still obsolete Qwen3 as base?

[-]

MuDotGen@reddit

I mean, Qwen 3.5 is actually 2 months old today believe it or not. Qwen 3 came out a year ago, so if they had been starting their research and training, etc., more than 2 months ago, they would not have even had access to something after Qwen 3 to be fair. I imagine if they make a next version of Bonsai, they'd likely use Qwen3.5 or 3.6's weights. Chances they're already doing so with 3.5 at least.

[-]

pmttyji@reddit (OP)

FP16 safetensors (HuggingFace format) of the ternary Bonsai-8B model. This repo exists for users who want to run Ternary Bonsai with stock HuggingFace tooling or frameworks that don't yet support any of the packed ternary format. The MLX 2-bit format is currently the only packed format available; more formats for other backends are coming soon.

[-]

lobabobloblaw@reddit

Well now