You guys seen this? 1-bit model with an MMLU-R of 65.7, 8B params

[-]

denoflore_ai_guy@reddit

Said it elsewhere. The whitepaper is deliberately vague on the actual compression method - they call it “proprietary Caltech IP” and “mathematically grounded advances” without publishing the technique.

So you can use the models but you can’t reproduce the compression pipeline.

No native 1-bit hardware exists yet, so the speed gains come purely from software kernel optimizations on standard GPUs.

[-]

OmarBessa@reddit (OP)

that's a necessity nowadays, every lab with tons more gpus is looking for alpha to justify capex

no matter who they have to step on

[-]

denoflore_ai_guy@reddit

Wait till they find out you can hit SOTA in self-directed compound learning with a P40, a sketchy PCIe 3.0 riser cable, a no-name 450W ATX PSU, and a server running RAM and CPUs from 2-3 years before Attention Is All You Need was even published. Arch is the moat not the compression pipeline. 🤷‍♂️

[-]

Silver-Champion-4846@reddit

How do you do that?

[-]

denoflore_ai_guy@reddit

I’m really smart and good at math and science without the benefit of institutional dogmatic indoctrination leading to specialization/the need to only care about getting tenure.

[-]

Silver-Champion-4846@reddit

Any results we can test?

[-]

denoflore_ai_guy@reddit

Oh and I’m driven by spite and the fear of homelessness.

[-]

Look_0ver_There@reddit

Kind of reminds me of the Microsoft "1-bit" models. There's a good video explaining them here: https://youtu.be/WBm0nyDkVYM?si=d6fWhRmlcHJ6sOhn

Technically the MS versions are 1.58 bit, because they encode -1, 0, and 1, unlike Bonsai, which is just -1 and 1. The video I linked to explains why having at least 3 values is better than just 2.

So, this sort of thing seems to have been done before, but it looks like prism-ml is picking up the torch that MS dropped.

[-]

live_love_laugh@reddit

Yeah, I don't understand why they didn't go for ternary. Having the 0 helps a ton AFAIK.

[-]

kaeptnphlop@reddit

You have to completely train ternary models from scratch. That is not what is happening here. This is a new quantization approach from my understanding. The model is based on Qwen3 8B IIRC

[-]

lemon07r@reddit

I think a proper 1.58 bit version of qwen 3.5 27b would be super cool. 1 bit might be a bit too neutered.

[-]

ParticularSoftware28@reddit

Quants this low and strange usualy need to be integarted during training. Losses, optimizer etc. need to adapt to make it work afaik. So at least some fine tuning stage would be needed. One could finetune on the f16 outputs of the ssame model here tho.

[-]

OmarBessa@reddit (OP)

Ok.

Ran some simple benches.

* hallucinates some simple country information

* cant pass strawberry test

* can count words

* it can do multi-digit addition

* write small stories

* do fizz buzz

I'm not disappointed at all. I'm actually surprised that this thing works.

[-]

GravitasIsOverrated@reddit

cant pass strawberry test

IMO this is completely meaningless as a metric. The LLM never sees characters, so you’re only testing its ability to memorize the answer to this question rather any any sort of reasoning.

hallucinates some simple country information

Also IMO this is a bad metric for small LLMs in 2025 - they’ll never have “enough” built-in knowledge to be reliable. Rather, for these tiny LLMs, ability to use tools and understand their output is way more interesting to me.

[-]

Educational_Win_2982@reddit

Strawberry test is a fair metric, there's other tokenizers with byte-level / character accuracy, such as byte-level transformer (BLT).

[-]

GravitasIsOverrated@reddit

But that’s not something you benchmark. You can report the tokenizer type, but benchmarking on “the strawberry test” when the model doesn’t have a byte/character level tokenizer is just measuring whether the model has memorized the answer to that specific benchmark. It’s not measuring anything useful at that point.

[-]

Educational_Win_2982@reddit

That's fair. I suppose a better benchmark would be esoteric programming languages with randomized single character syntax and no whitespace, where the model is given the syntax in context and has to create or debug a program.

[-]

GravitasIsOverrated@reddit

But what is that actually telling you, the user? Is that representative of the type of tasks you want to use LLMs for?

The problems most people hit day-to-day are not caused by the choice to use multi-character tokenizers (if they were you’d see SOTA LLMs trained with character or byte level tokenizers), so gotcha benchmarks that test character level handling are meaningless.

[-]

Educational_Win_2982@reddit

It's hard for me to agree, but most benchmarks are superficial to begin with. Also we're supposed to be researching towards "AGI", but if a model can't reason at a byte-level then it's going to struggle at edge cases regardless of how much you train it. It's exactly as you say, a model is guessing strawberry-like questions. But how much is a model guessing at other things as well? What if models attempting to use command line interfaces are struggling with edge cases because some CLI commands use single character arguments (like -c, -s, +x, etc). What if they're struggling with some math problems because the tokenizer is splitting the numbers down the middle, creating two or more tokens for one value, making it think it's two discrete values? That's even ignoring algebra and calculus which is also single character syntax a lot of the time. That's even ignoring creative writing, poetry and puns, which can sometimes be phonetic, which is almost a byte level problem (at the very least you'd need a perfect tokenizer that splits phonetically, otherwise you lose meaning). It's perhaps not important for a model to write good jokes, puns or sarcasm, but its important for it to be able to detect and reason over it.

Overall I think byte-level reasoning is important and should have been used early on, and my proposed byte-level benchmark would be reasonable enough to see generalized reasoning.

[-]

Whole-Remove-3120@reddit

Does it get dramatically more tokens per second than other models with similar parameter count?

[-]

Alarming-Ad8154@reddit

IT does on my Mac/iphone

[-]

OmarBessa@reddit (OP)

not sure, tested from collab

[-]

AppealSame4367@reddit

"write a small rust program with bevy": meager results, but that was expected
mermaid chart from 1000 line code file: okay, just uses () for names and cannot correct itself even after 5x back and forth and after it acknowledged its mistake
product classification based on json input, structured json output: mostly a failure, so not usable for that case

I think if they refine it for real use cases this could be great.

[-]

OmarBessa@reddit (OP)

small story written by it:

Once upon a time, in a quiet village nestled between misty hills, there lived a young girl named Lila. She was known for her curious mind and a heart that beat faster than the wind when she heard a story.

One morning, as the sun peeked over the horizon, Lila found an old, dusty book in the village library. Its cover was cracked and faded, but the title was still legible: *The Lantern of Eternity*.

The book was written in a language long forgotten, but Lila could still read it. Inside, it told the tale of a girl just like her, who discovered a magical lantern that could grant one wish. The lantern, however, had a twist—it could only grant the wish of the person who truly needed it.

Lila’s eyes sparkled with wonder. She couldn’t wait to find the lantern.

She spent weeks searching the village, asking elders, and following clues left in the pages. Finally, she found the lantern hidden in the roots of an ancient tree, glowing faintly with a golden light.

With a deep breath, Lila stepped forward and whispered her wish: *"I wish for a world where kindness is never forgotten."*

The lantern pulsed once, then dimmed. Lila looked around and saw something different. The villagers began to smile more, help each other more, and share stories with genuine warmth. The town, once quiet and solitary, now buzzed with life and connection.

From that day on, Lila became the village’s storyteller. She never spoke of the lantern again, but the warmth it brought lingered in every heart.

And so, the story of Lila and the Lantern of Eternity was passed down, a reminder that even the smallest wish can change the world.

**The End.**

[-]

Long_Homework3634@reddit

Explained and with live test here:https://youtu.be/0fWFetwHkVE?is=0pEfTPy22ubDiyzJ

[-]

cnmoro@reddit

Tried it, it's really fast, solid performance

[-]

xandep@reddit

Just because YOU said it works, I believe. Otherwise, it's April Fools. 🤔

[-]

IrisColt@reddit

h-heh

[-]

DangerousSetOfBewbs@reddit

Explain or you sound like an employee

[-]

cr0wburn@reddit

How did you test it ? I tried with llama.cpp (latest as of now) and I got a weird error.

[-]

cnmoro@reddit

In hugging face they have the link to their llamacpp fork that is compatible

[-]

Fireflykid1@reddit

It’s running on a forked version atm

[-]

42GOLDSTANDARD42@reddit

I don’t get the hype, their own huggingface has the 8B barely better than Qwen3 1.7B

[-]

sudochmod@reddit

They also might not have the same level of training data? Idk

[-]

Alarming-Ad8154@reddit

They don’t say their training form scratch anywhere, I suspect it’s a qwen3 quant that’s ginetuned/trained to retain quality with lower bit quant…

[-]

Basic_Extension_5850@reddit

Remember that it's1/16 the size of a model trained in fp16, and (I'm assuming) the company has far less funding and compute than Qwen does

[-]

kulchacop@reddit

I think everyone is excited because 1 bit models will be faster on cheap hardware.

Their 8B might be faster while both occupy ≈ 1GB of VRAM.

[-]

a_beautiful_rhind@reddit

To their credit, they tried. The architecture could be viable while they could be bad at making models. Or at least at benchmaxxing.

[-]

working_too_much@reddit

I tried loading in LM Studio and I get errors for the MLX and GGUF versions for Bonsai 8B from Prism-ML

GGUF version error:
```

🥲 Failed to load the model

Error loading model.

(Exit code: null). Please check settings and try loading the model again.

```

MLX version error

```

🥲 Failed to load the model

Failed to load model.

Error when loading model: ValueError: [quantize] The requested number of bits 1 is not supported. The supported bits are 2, 3, 4, 5, 6 and 8.

```

[-]

Error loading model.

(Exit code: 18446744072635810000). Unknown error. Try a different model and/or config.

```

[-]

XccesSv2@reddit

Read the model card correctly, there are forks for this..

[-]

Frosty_Chest8025@reddit

we tried to test it, but it got afraid of the testing stick to its nose