You guys seen this? 1-bit model with an MMLU-R of 65.7, 8B params
Posted by OmarBessa@reddit | LocalLLaMA | View on Reddit | 49 comments
This is nuts.
prism-ml/Bonsai-8B-gguf · Hugging Face
has anyone tested this thing?
denoflore_ai_guy@reddit
Said it elsewhere. The whitepaper is deliberately vague on the actual compression method - they call it “proprietary Caltech IP” and “mathematically grounded advances” without publishing the technique.
So you can use the models but you can’t reproduce the compression pipeline.
No native 1-bit hardware exists yet, so the speed gains come purely from software kernel optimizations on standard GPUs.
OmarBessa@reddit (OP)
that's a necessity nowadays, every lab with tons more gpus is looking for alpha to justify capex
no matter who they have to step on
denoflore_ai_guy@reddit
Wait till they find out you can hit SOTA in self-directed compound learning with a P40, a sketchy PCIe 3.0 riser cable, a no-name 450W ATX PSU, and a server running RAM and CPUs from 2-3 years before Attention Is All You Need was even published. Arch is the moat not the compression pipeline. 🤷♂️
Silver-Champion-4846@reddit
How do you do that?
denoflore_ai_guy@reddit
I’m really smart and good at math and science without the benefit of institutional dogmatic indoctrination leading to specialization/the need to only care about getting tenure.
Silver-Champion-4846@reddit
Any results we can test?
denoflore_ai_guy@reddit
Oh and I’m driven by spite and the fear of homelessness.
Look_0ver_There@reddit
Kind of reminds me of the Microsoft "1-bit" models. There's a good video explaining them here: https://youtu.be/WBm0nyDkVYM?si=d6fWhRmlcHJ6sOhn
Technically the MS versions are 1.58 bit, because they encode -1, 0, and 1, unlike Bonsai, which is just -1 and 1. The video I linked to explains why having at least 3 values is better than just 2.
So, this sort of thing seems to have been done before, but it looks like prism-ml is picking up the torch that MS dropped.
live_love_laugh@reddit
Yeah, I don't understand why they didn't go for ternary. Having the 0 helps a ton AFAIK.
kaeptnphlop@reddit
You have to completely train ternary models from scratch. That is not what is happening here. This is a new quantization approach from my understanding. The model is based on Qwen3 8B IIRC
lemon07r@reddit
I think a proper 1.58 bit version of qwen 3.5 27b would be super cool. 1 bit might be a bit too neutered.
ParticularSoftware28@reddit
Quants this low and strange usualy need to be integarted during training. Losses, optimizer etc. need to adapt to make it work afaik. So at least some fine tuning stage would be needed. One could finetune on the f16 outputs of the ssame model here tho.
OmarBessa@reddit (OP)
Ok.
Ran some simple benches.
* hallucinates some simple country information
* cant pass strawberry test
* can count words
* it can do multi-digit addition
* write small stories
* do fizz buzz
I'm not disappointed at all. I'm actually surprised that this thing works.
GravitasIsOverrated@reddit
IMO this is completely meaningless as a metric. The LLM never sees characters, so you’re only testing its ability to memorize the answer to this question rather any any sort of reasoning.
Also IMO this is a bad metric for small LLMs in 2025 - they’ll never have “enough” built-in knowledge to be reliable. Rather, for these tiny LLMs, ability to use tools and understand their output is way more interesting to me.
Educational_Win_2982@reddit
Strawberry test is a fair metric, there's other tokenizers with byte-level / character accuracy, such as byte-level transformer (BLT).
GravitasIsOverrated@reddit
But that’s not something you benchmark. You can report the tokenizer type, but benchmarking on “the strawberry test” when the model doesn’t have a byte/character level tokenizer is just measuring whether the model has memorized the answer to that specific benchmark. It’s not measuring anything useful at that point.
Educational_Win_2982@reddit
That's fair. I suppose a better benchmark would be esoteric programming languages with randomized single character syntax and no whitespace, where the model is given the syntax in context and has to create or debug a program.
GravitasIsOverrated@reddit
But what is that actually telling you, the user? Is that representative of the type of tasks you want to use LLMs for?
The problems most people hit day-to-day are not caused by the choice to use multi-character tokenizers (if they were you’d see SOTA LLMs trained with character or byte level tokenizers), so gotcha benchmarks that test character level handling are meaningless.
Educational_Win_2982@reddit
It's hard for me to agree, but most benchmarks are superficial to begin with. Also we're supposed to be researching towards "AGI", but if a model can't reason at a byte-level then it's going to struggle at edge cases regardless of how much you train it. It's exactly as you say, a model is guessing strawberry-like questions. But how much is a model guessing at other things as well? What if models attempting to use command line interfaces are struggling with edge cases because some CLI commands use single character arguments (like -c, -s, +x, etc). What if they're struggling with some math problems because the tokenizer is splitting the numbers down the middle, creating two or more tokens for one value, making it think it's two discrete values? That's even ignoring algebra and calculus which is also single character syntax a lot of the time. That's even ignoring creative writing, poetry and puns, which can sometimes be phonetic, which is almost a byte level problem (at the very least you'd need a perfect tokenizer that splits phonetically, otherwise you lose meaning). It's perhaps not important for a model to write good jokes, puns or sarcasm, but its important for it to be able to detect and reason over it.
Overall I think byte-level reasoning is important and should have been used early on, and my proposed byte-level benchmark would be reasonable enough to see generalized reasoning.
Whole-Remove-3120@reddit
Does it get dramatically more tokens per second than other models with similar parameter count?
Alarming-Ad8154@reddit
IT does on my Mac/iphone
OmarBessa@reddit (OP)
not sure, tested from collab
AppealSame4367@reddit
I think if they refine it for real use cases this could be great.
OmarBessa@reddit (OP)
small story written by it:
Once upon a time, in a quiet village nestled between misty hills, there lived a young girl named Lila. She was known for her curious mind and a heart that beat faster than the wind when she heard a story.
One morning, as the sun peeked over the horizon, Lila found an old, dusty book in the village library. Its cover was cracked and faded, but the title was still legible: *The Lantern of Eternity*.
The book was written in a language long forgotten, but Lila could still read it. Inside, it told the tale of a girl just like her, who discovered a magical lantern that could grant one wish. The lantern, however, had a twist—it could only grant the wish of the person who truly needed it.
Lila’s eyes sparkled with wonder. She couldn’t wait to find the lantern.
She spent weeks searching the village, asking elders, and following clues left in the pages. Finally, she found the lantern hidden in the roots of an ancient tree, glowing faintly with a golden light.
With a deep breath, Lila stepped forward and whispered her wish: *"I wish for a world where kindness is never forgotten."*
The lantern pulsed once, then dimmed. Lila looked around and saw something different. The villagers began to smile more, help each other more, and share stories with genuine warmth. The town, once quiet and solitary, now buzzed with life and connection.
From that day on, Lila became the village’s storyteller. She never spoke of the lantern again, but the warmth it brought lingered in every heart.
And so, the story of Lila and the Lantern of Eternity was passed down, a reminder that even the smallest wish can change the world.
**The End.**
Long_Homework3634@reddit
Explained and with live test here:https://youtu.be/0fWFetwHkVE?is=0pEfTPy22ubDiyzJ
cnmoro@reddit
Tried it, it's really fast, solid performance
xandep@reddit
Just because YOU said it works, I believe. Otherwise, it's April Fools. 🤔
IrisColt@reddit
h-heh
DangerousSetOfBewbs@reddit
Explain or you sound like an employee
cr0wburn@reddit
How did you test it ? I tried with llama.cpp (latest as of now) and I got a weird error.
cnmoro@reddit
In hugging face they have the link to their llamacpp fork that is compatible
Fireflykid1@reddit
It’s running on a forked version atm
42GOLDSTANDARD42@reddit
I don’t get the hype, their own huggingface has the 8B barely better than Qwen3 1.7B
sudochmod@reddit
They also might not have the same level of training data? Idk
Alarming-Ad8154@reddit
They don’t say their training form scratch anywhere, I suspect it’s a qwen3 quant that’s ginetuned/trained to retain quality with lower bit quant…
Basic_Extension_5850@reddit
Remember that it's1/16 the size of a model trained in fp16, and (I'm assuming) the company has far less funding and compute than Qwen does
kulchacop@reddit
I think everyone is excited because 1 bit models will be faster on cheap hardware.
Their 8B might be faster while both occupy ≈ 1GB of VRAM.
a_beautiful_rhind@reddit
To their credit, they tried. The architecture could be viable while they could be bad at making models. Or at least at benchmaxxing.
working_too_much@reddit
I tried loading in LM Studio and I get errors for the MLX and GGUF versions for Bonsai 8B from Prism-ML
GGUF version error:
```
🥲 Failed to load the model
Error loading model.
(Exit code: null). Please check settings and try loading the model again.
```
MLX version error
```
🥲 Failed to load the model
Failed to load model.
Error when loading model: ValueError: [quantize] The requested number of bits 1 is not supported. The supported bits are 2, 3, 4, 5, 6 and 8.
```
Iory1998@reddit
LM Studio is using an older version of Llama.cpp for now. You need to wait for an updated llama.cpp.
uti24@reddit
So they claim their 8B (8B bits \~= 1Gb) model is on par with modern 8B unquantized models, that's inetersting.
ILoveMy2Balls@reddit
i don't think they claimed it to be onpar with other 8B models, it is comparable to qwen3 1.7B but a lot smaller
aaronr_90@reddit
I wouldn’t say “on Par” but in the same ball park. According to the chart at the top of the model card, Ministral 3B just barely average’s better than 1-bit Bonsai 8B in the benchmarks. Qwen3-4B and Qwen3-8B are just slightly ahead.
Oatilis@reddit
This will be great if you can fine-tune it for specific purposes, i.e. an appliance SLM, and I'd love to benchmark it. First look on the repo doesn't mention anything regarding training. Worth looking into when I have some time.
Positive-Stock6444@reddit
Curious how a larger parameter 1bit model would be. The intelligence density metric is interesting.
Educational_Mud4588@reddit
Wow.. this thing works... Thank you for posting, I would have never seen this.
Arrowstar@reddit
I tried to load it in LM Studio but I got an error:
```
Error loading model.
(Exit code: 18446744072635810000). Unknown error. Try a different model and/or config.
```
XccesSv2@reddit
Read the model card correctly, there are forks for this..
Frosty_Chest8025@reddit
we tried to test it, but it got afraid of the testing stick to its nose