[-]

rkbala@reddit

I have an edge device (AMD Ryzen 7 AI laptop). Will it work? What i see in their llama.cpp fork is only cuda. I am a noob. Any suggestions pls

[-]

Disonantemus@reddit

https://old.reddit.com/r/LocalLLaMA/comments/1s90wo4/prismml_announcing_1bit_bonsai_the_first/ofdfrha/

[-]

Languages_Learner@reddit

Just found official development branch for cpu inference - PrismML-Eng/llama.cpp at q1-cpu

[-]

Languages_Learner@reddit

Here's llama.cpp fork which seems to be able to fix the bug with cpu-only inference: philtomson/llama.cpp: LLM inference in C/C++ (fork of PrismML fork that enables CPU (incl AVX2 and AVX512) and ROCm for AMD GPUs

[-]

Languages_Learner@reddit

Here's the answer for your question - No cpu-only build · Issue #6 · PrismML-Eng/Bonsai-demo

[-]

Disonantemus@reddit

From Commit #21539 "vulkan: Support Q1_0" (Today)
Now is not painfully slow anymore in my old system, command:
```
$ llama-bench -m Bonsai-8B.gguf
```

model	size	params	backend	ngl	test	t/s
qwen3 8B Q1_0	1.07 GiB	8.19 B	Vulkan	99	pp512	412.15 ± 1.68
qwen3 8B Q1_0	1.07 GiB	8.19 B	Vulkan	99	tg128	48.58 ± 0.18

System

distro: Arch Linux x86_64
kernel: 6.12.33-1-lts
ram: 32GB (DDR3)
cpu: Intel i7-4790 (8) @ 3.600GHz
gpu: NVIDIA GeForce GTX 1660 SUPER (6GB VRAM)

[-]

shockwaverc13@reddit

am i the only one who gets extremely slow CPU performance?

build/bin/llama-bench -m models/Bonsai-8B.gguf -r 1 -p 8 -n 8 --mmap 1
| model                          |       size |     params | backend    | threads | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ---: | --------------: | -------------------: |
| qwen3 8B Q1_0_g128             |   1.07 GiB |     8.19 B | CPU        |       4 |    1 |             pp8 |          0.36 ± 0.00 |
| qwen3 8B Q1_0_g128             |   1.07 GiB |     8.19 B | CPU        |       4 |    1 |             tg8 |          0.29 ± 0.00 |

build: 1179bfc82 (8194)

[-]

Disonantemus@reddit

am i the only one who gets extremely slow CPU performance?

From Commit #21539 "vulkan: Support Q1_0" (Today)
Now is not painfully slow anymore, command:
```
$ llama-bench -m Bonsai-8B.gguf
```

model	size	params	backend	ngl	test	t/s
qwen3 8B Q1_0	1.07 GiB	8.19 B	Vulkan	99	pp512	412.15 ± 1.68
qwen3 8B Q1_0	1.07 GiB	8.19 B	Vulkan	99	tg128	48.58 ± 0.18

[-]

Fun-Property-5964@reddit

me to having the same problem i am have you found a solution for it?

[-]

shockwaverc13@reddit

there are other quants of it that seem to work the same and are faster (but are bigger in size) https://huggingface.co/lilyanatia/Bonsai-8B-requantized

or you can use forks of the prismml fork that implement AVX and Vulkan https://github.com/PrismML-Eng/llama.cpp/pulls

[-]

brown2green@reddit (OP)

From the X post:

Today, we are emerging from stealth and launching PrismML, an AI lab with Caltech origins that is centered on building the most concentrated form of intelligence.

At PrismML, we believe that the next major leaps in AI will be driven by order-of-magnitude improvements in intelligence density, not just >sheer parameter count.

Our first proof point is the 1-bit Bonsai 8B, a 1-bit weight model that fits into 1.15 GBs of memory and delivers over 10x the intelligence >density of its full-precision counterparts. It is 14x smaller, 8x faster, and 5x more energy efficient on edge hardware while remaining competitive with other models in its parameter-class. We are open-sourcing the model under Apache 2.0 license, along with Bonsai 4B and 1.7B models.

When advanced models become small, fast, and efficient enough to run locally, the design space for AI changes immediately. We believe in a future of on-device agents, real-time robotics, offline intelligence and entirely new products that were previously impossible.

We are excited to share our vision with you and keep working in the future to push the frontier of intelligence to the edge.

https://huggingface.co/collections/prism-ml/bonsai
https://github.com/PrismML-Eng/Bonsai-demo/blob/main/1-bit-bonsai-8b-whitepaper.pdf
https://x.com/PrismML/status/2039049400190939426

[-]

Aaaaaaaaaeeeee@reddit

Is it a binary QAT (-1,+1), not ternary (-1,0,+1)?

[-]

brown2green@reddit (OP)

Just binary, it seems.

[-]

DistanceSolar1449@reddit

It’s probably 0/1 and not -1/1. I doubt you can make a LLM work without multiplying a lot of tensors by 0.

That’s still fucking insane. I’m mindblown that activations can be just binary and still work. Usually you NEED -1/0/1. Bitnet, for example, is ternary 1.53bit and not 1 bit.

[-]

Party-Special-5177@reddit

Bitnet, for example, is ternary 1.53bit and not 1 bit.

It is both - the original bitnet paper was a binary model (0,1) name: scaling 1 bit transformers for large language models, 2023 iirc

They came back a year later and said the models improved significantly when made ternary(-1,0,1), to the point that they could compete with the models they were quanted from: The era of 1 bit llms, all language models are in 1.58 bits, 2024

They

[-]

SHOR-LM@reddit

Both "work". but the 0 state is required if you want sparsity in a model as far as I know.... That's why they changed it....

[-]

Party-Special-5177@reddit

It isn’t for sparsity, it improves the representational capacity of a bitlinear layer.

The rest of your comment makes little sense to me. The weights aren’t ‘forced to choose’, the optimizer optimizes a set of master weights in full precision which bitnets then round to binary or trinary. That’s it.

[-]

SHOR-LM@reddit

I was conflating representational capacity with sparsity and what must be deterministically rounded as the forced choice .."what does the weight mean" A or B....thats limited representation. if it was as simple as what you're saying Microsoft wouldn't have backtracked on their decision.

[-]

Alarming-Ad8154@reddit

According to the whitepaper its -1/1… pretty insane it’s this good (or very benchmaxed??)

[-]

Alarming-Ad8154@reddit

It’s actually -1/1 scaled by a 16-bit scaling factor shared by 128 weights. Also since they don’t describe any of their training I am near certain it’s quantization or quantization + finetuning not base training…

[-]

HopePupal@reddit

yeah i noticed that too. they handwaved with "proprietary Caltech IP", which honestly is not surprising for a university spinoff, but looking for recent patents from people at Prism or Hassibi's lab, i immediately found

US-20250348715-A1 Preview PDF Text RESPONSE-ADAPTIVE CALIBRATION FOR POST-TRAINING QUANTIZATION OF LARGE LANGUAGE MODELS EDALATI; Ali et al.

[-]

kaibee@reddit

That’s still fucking insane. I’m mindblown that activations can be just binary and still work.

[-]

Plasmx@reddit

I guess nobody was as crazy as writing billions of if branches before. They could have invented modern ai models. /s

[-]

Top-Handle-5728@reddit

Where did they, or even BitNet, claim that the activations are binary? Isn't it more about the weights? Assuming strictly 0/1 weights, how do you make the signal negative to suppress the activation? If they are truly using 0/1 without a -1 'inhibit' state, they'd have to rely heavily on Biases or Normalization layers to shift the signal into the negative range, which technically means those higher precision 'escape hatches' still exist in the norm layers.

[-]

MaraPapewaio@reddit

Bitnet 1.58 2B4T from Microsoft prooved this a highly viable option compared to full precision FP16.

The 0 State changes everything. This is a "I don't know" option.

Moreover, Microsoft trained natively this model like this.

So maybe you, at Prisme would give it a try.

This may allow the model not to "make up" some answers like it's big brothers.

Thant for the share. I'm delighted to see this technology come to live.

Small is Beautiful. Running on consumer Hardware is the key of full integration.

Keep up the good work ans thanks for the open source release.

[-]

SHOR-LM@reddit

They dropped the zero state and went pure binary? How is that not a MAJOR step backward from the BitNet b1.58 architecture? The zero is supposed to give sparsity. Does this mean very connection could be forced to have an "opinion"? I wonder the implications. This can't be the whole story. How can it isolate a "this thing doesn't matter" situation in a users' prompt?

[-]

CryptoUsher@reddit

1-bit models sound wild, but i'm curious how they handle edge cases without falling off a cliff in accuracy.
have you tested on tasks that require nuanced reasoning, or does the compression favor speed over depth?

[-]

SHOR-LM@reddit

It lacks overall nuance but it's actually really good for what they've done to it. I'm actually surprised

[-]

l33tkvlthax42069@reddit

Given that you posted this when there were less than 20 downloads, I'll assume you are part of the team? Impressed with the llama cpp performance and output quality. MLX auto install did not work on Sequoia, but will try when I have more than 2 minutes later...

Hoping that batching is viable, super interested to see how this develops!

[-]

brown2green@reddit (OP)

No, I just saw the announcement on X and posted it here.

[-]

cafedude@reddit

1-bit models... wouldn't these be well-suited for running on an FPGA?

[-]

ObviousLavishness537@reddit

私もLLM詳しくないですが同じような事を考えました。FPGA の flip-flop通すだけでいいのかなと。

[-]

X3liteninjaX@reddit

We got LLMs made of booleans now /s

[-]

cafedude@reddit

I mean, if they're 1-bit end-to-end as they say then how are they not boolean? Could these models be converted to logic gate networks somehow? (something like difflogic: https://github.com/Felix-Petersen/difflogic )

[-]

Icy_Butterscotch6661@reddit

Isn't a bool 8 bits in most languages

[-]

AdventurousFly4909@reddit

Bro

[-]

Icy_Butterscotch6661@reddit

idk i mean a boolean variable takes up a whole byte (even if only one bit inside is flipped) in all languages i've worked with.

guess it doesn't really matter for the discussion at hand so idk why I even said that

[-]

AdventurousFly4909@reddit

yeah, I get where the confusion comes from after all the lowest unit the code can operate on is a byte. All(I think maybe there exists a super weird instruction?) instruction operate on the byte level. Most likely the multiple weights are packed into one byte.

[-]

VolkoTheWorst@reddit

I'm currently working on an implementation of an AI network on FPGA

[-]

Several-Tax31@reddit

What is the max parameter count model a FPGA can run? 100B? 1B? Less?

[-]

VolkoTheWorst@reddit

Technically nothing prevents you from running a 100B or more. It's just gonna probably require a custom made insanely big/expensive FPGA and run very slowly

[-]

VolkoTheWorst@reddit

Depends on which FPGA you have. My work is on a very small AI niche, we will have like 1k neurons so not a lot. And we're already limited by the BRAM size. But we are at the start of the project so we might find workarounds. We are using 7000s FPGAs

[-]

Leo_hofstadter@reddit

I have been thinking like when will we have FPGA made for LLMs(expensive to develop and have ROI), but something closer to that is what Groq company is doing( at least that’s my vague understanding), they sell crazy fast inference as their chips are tuned for inference only!

[-]

Plasmx@reddit

There is an ASIC Llama chip, Taalas HC1. It’s just a very specific use case since there is no way you can change the model.

[-]

Leo_hofstadter@reddit

Interesting, I read that they are already in business somehow with lot of clients, wouldn’t that be a shame if you can’t upgrade the LLM model? Almost like throwaway / burner phones? What’s the concept here that makes it them do it profitably?

[-]

Plasmx@reddit

They are insanely fast and energy efficient. It is like 17k TPS at 700 W. There are use cases where it is feasible to choose a model and stay with it for a while when inference is cheap. If that model is Llama 3.1 stays on another page, but I think they mainly wanted a proof of concept.

[-]

randylush@reddit

There are tons of chips that are made specifically for ML inference. The device you’re using to read this comment probably has an ML accelerator built in. So far FPGAs have only been useful for prototyping; it’s always more efficient to run a workload on a bespoke chip than an FPGA.

[-]

CaptBrick@reddit

Actually it’s always been that way, it’s all 1 and 0 at hardware level

[-]

Lucky-Necessary-8382@reddit

Before GTA IV

[-]

INtuitiveTJop@reddit

Hey, isn’t this a lot easier to place on an asic with the fact that it’s all 0s and 1s?

[-]

Legitimate-Handle390@reddit

Taalas is on the mission. i'm waiting for Cerebras to acquire Taalas. Imagine a wafer scale ASIC where memory=compute.

[-]

INtuitiveTJop@reddit

I know, this is the future. I cannot yet comprehend a Claude level asic running at 15k tokens a second

[-]

cafedude@reddit

...or an FPGA.

[-]

fotcorn@reddit

Also works on ROCM.

Getting roughly 150 t/s generation on my 9070 XT for the 8B model.

Output is hard to judge, but seeing 1bit working at all is already impressive, especially because it sounds like it was quantized from Qwen3.5, and not retrained from scratch like the BitNet 1.58 models

[-]

alpay_kasal@reddit

Hey u/fotcorn was the output from ROCM garbage output like someone reported on CPU? or did it look somewhat useful?

[-]

fotcorn@reddit

No, it worked fine on GPU, both 1.7B and 8B. Not very intelligent/knowledgeable, but that is expected.

CPU took forever to load and then only produced garbage output. From reading the PR in llama.cpp, it was only tested on ARM CPUs, so not surprising it's broken on x86.

[-]

alpay_kasal@reddit

That's amazing to hear!!! Thank you. One of the guys at the PrismML discord said they only implemented on a cuda backend, so they should be surprised to hear it works. Thx again for the speedy reply.

[-]

OkSun5433@reddit

which llama.cpp build and model did you use? the model won't load for me using windows HIP build

[-]

fotcorn@reddit

They have their own fork: https://github.com/PrismML-Eng/llama.cpp

They say only cuda/metal support, but HIP build worked just fine. Using ROCM 7.12 preview.

[-]

OkSun5433@reddit

thanks

[-]

madtopo@reddit

Do you have it in you to test the setup soccer l against a harness like opencode, pi or charm? It'd be nice to know how it performs against agentic coding tasks

[-]

lemon07r@reddit

There is no 8b qwen 3.5 model. it's a qwen 3 model.

[-]

Worried_Drama151@reddit

Thx bro we are all pretty dense here, so people couldn’t infer that maybe he meant 9B and fat fingered or meant 8B with 3 high five

[-]

lemon07r@reddit

type shit brother

[-]

Interpause@reddit

gimme a while im going squash their llama.cpp changes on top of llama.cpp and see if it really works cuz thats real crazy if it does

[-]

-dysangel-@reddit

I just tried it on their mlx fork - it works.

[-]

zh1412@reddit

how did you install? conda then pip? I tried following their installation guide and it failed

[-]

-dysangel-@reddit

I had Claude set it up. Yeah I think pip wasn't working - in the end I had to download the xcode metal compiler and build their custom mlx

[-]

zh1412@reddit

got it thanks!

[-]

-dysangel-@reddit

I seriously doubt the performance is going to match 8b f16 models as they claim, but it's good to see 1 bit models making progress

[-]

audioen@reddit

They didn't bother to place regular models and normal PTQ methods in the same picture when they posted this: https://huggingface.co/prism-ml/Bonsai-8B-gguf/resolve/main/assets/frontier.svg

But you can imagine that e.g. Qwen3-1.7B at bf16 can easily be shrunk by 75 % by PTQ'ing to something like IQ4_XS, and it would move that point left near their 1-bit frontier line. It looks mostly like it might give an incremental improvement to model quantization, possibly is indeed the most memory-efficient way to do it. I mean, it is 1-bit logic, multiplication of 32 weights in their quantized form is now an XOR operation.

[-]

TylerDurdenFan@reddit

I was going to say 30 years ago CPUs weren't 32 bits yet, but, indeed they were. Damn I'm getting old.

[-]

Double_Cause4609@reddit

Tbh, they don't really need to. Per unit of silicon 1bit is faster than you'd think.

Like, if you have $100 of silicon, you'd expect 1bit to be \~16x as fast as FP16, but it's actually faster due to a few weird things about hardware scales.

So, if you only need 1/16th the price to run the model, as long as it's more than 1/16th as good as the FP16 model, you're still coming out ahead.

I find that usually 1bit methods are \~3/4 as good as the FP16 models when they're quantization aware, which still gives you more value for your money.

[-]

-dysangel-@reddit

sure I'm not saying that I don't want 1 bit models, I'm just saying it's odd to claim the quality is as nuanced as f16. I would definitely like to see some scaled up bit models, so that the model itself is as efficient as can be without needing quantisation.

[-]

EstarriolOfTheEast@reddit

If it crosses a certain quality threshold/noise floor then because it takes less memory and is so fast, you can match or beat the fp16 by simply drawing more samples. The caveat as usual is this only works for problems which can be either reliably verified or aggregated automatically.

[-]

-dysangel-@reddit

well, I think a better comparison would be a 1 bit model of the same size as a 8B f16 model. At the moment they're saying that an 8billion param 1 bit model can match an 8billion param 16 bit model.. maybe on some tasks it can, but there is simply less capacity in that model. I think it would be more fair to compare a 128billion param 1 bit model with an 8 billion param 16 bit model, as they both contain the same amount of bits.

[-]

EstarriolOfTheEast@reddit

My point is actually independent of that. Because what the LLM encodes are conditioned probabilities, its lack of capacity can be made up for by sampling more, as long as the computed probabilities are already directionally close enough. This is similar to how a say, 16x sampled 4B can match or exceed a 1 sample 8B chain, depending on the task.

[-]

-dysangel-@reddit

Sure I understand that doing pass@n can be powerful if the model is fast enough. I think intelligence density is about improving pass@1 though. It's like the old saying: Amateurs practice till they get it right; professionals practice till they can’t get it wrong.

[-]

EstarriolOfTheEast@reddit

It's actually the opposite, trying to overoptimize for pass at 1 is what leads to entropy loss, uncalibrated uncertainty, reduced creativity and a tendency for "slop". As it's a distribution, what we actually want is to sample from the correct parts of the space (the thin shell or region away from the mode where most of the probability mass lives) and to draw enough samples to get a quality answer to our query in expectation. That expectation is better approximated by drawing more samples as is done in self-consistency for example. That and smarter samples (which the field except for a few of us largely gave up on to focus on agents) is what maxes out intelligence quality.

[-]

-dysangel-@reddit

I guess it really depends what type of work you're doing. For pure logic, fixing bugs, coding etc, you should be able to aim for pass@1 being 100%. For creativity, design work and thinking outside the box, a distribution is great.

[-]

EstarriolOfTheEast@reddit

For fixing bugs and coding, unless you're only doing completely unoriginal work with all bugs common and well documented, then the ability to think outside the box is important. Distribution quality will also improve writing, explanations and produce higher quality and richer reasoning chains. If you optimize for pass@1 you instead significantly reduce the intelligence of the model (by reducing model entropy you also damage its ability to sample the hardest things it learned). And there's much more to this than just pass@n as we can write samplers to better extract model intelligence.

The only reason people in the open community want high pass@1 is because of our hardware limitations. Closed models with pro tiers which draw more samples have to get this right since they need their higher sample count tiers to not be degenerate due to too low entropy, they can also afford the hardware.

[-]

the__storm@reddit

They're claiming 5-9x speedup vs fp16 version of their own model in the linked paper. In what scenario would you expect more than 16x speedup?

[-]

Double_Cause4609@reddit

I was making an information theoretic argument per unit of silicon area and theoretical silicon efficiency. They were making a practical argument when running their quants on existing hardware. Both claims can be true.

[-]

the__storm@reddit

I do not dispute it. Would you be willing to tell us more about how the greater speedup can theoretically be achieved, or link to similar? I couldn't find anything with some quick googling.

[-]

Double_Cause4609@reddit

Well, I'm not really sure if that's something you need to google. You can reason about it from first principles.

Search up how many transistors it is to do an FP16 MAC operation. Then search up how many transistors it is to do a binary add / subtract.

It's not even in the same league.

You can do binary operations with extremely cheap operators when you're designing the transistor layout to the operation.

[-]

the__storm@reddit

I see what you're saying, but at least for localllama (single data) purposes you're still bandwidth constrained. Although I guess if you're designing custom silicon you can then afford to reallocate die space from arithmetic to memory and come out ahead. Interesting line of exploration to be sure.

[-]

DangerousSetOfBewbs@reddit

The won’t ever. As someone who has created LLMs from scratch until my eyes bleed dry, pruning, selected graph pruning, quantization etc. Purposefully building small models and shrinking larger models etc

There are only so many areas you can cram data into. And these just can’t hold a ton.

Now are these models great for on device with no GPU and very limited ram/cpu? Yes. But their intelligence is greatly lacking. They can be effective in very small areas, but the reasoning is dumb. They essentially become a yes or no gate.

[-]

heliosythic@reddit

Honestly I'd love to see more models that somehow mostly cram in language understanding rather than knowledge and use RAG instead (vector db and/or internet search) for knowledge. But language understanding + knowledge are kinda a chicken and egg problem.

[-]

SkyFeistyLlama8@reddit

Great as a classifier.

[-]

jusio@reddit

Glad to see that there is movement in this area, haven't tried the model yet, but according to charts from white paper, converting a model to 1-bit really dumbs it down. In table 6 they list all of bonsai models vs all other models, and Bonsai 8b has lower score than Qwen3 4B.

And I guess if we quantize Qwen3 4B to 4bits, it will have very similar size and performance compared to Bonsai 8b 1-bit.

Table 6 for reference from: https://github.com/PrismML-Eng/Bonsai-demo/blob/main/1-bit-bonsai-8b-whitepaper.pdf

1 - bit Bonsai 1.7B	0.24 GB	49.60
1 - bit Bonsai 4B	0.57 GB	62.72
1 - bit Bonsai 8B	1.15 GB	70.50
Qwen 3 0.6B [31]	1.19 GB	48.02
Qwen 3 1.7B [31]	3.44 GB	66.57
Gemma 3 1B [15]	2.00 GB	45.53
LFM2 1.2B [20]	2.34 GB	46.73
Llama 3.2 1B [17]	2.47 GB	39.88
Ministral3 3B [24]	6.86 GB	73.22
Qwen 3 4B [31]	8.04 GB	77.10
Llama 3.2 3B [17]	6.43 GB	64.35
Gemma 3 4B [15]	7.76 GB	67.88
Qwen 3 8B [31]	16.38 GB	79.30
Olmo 3 7B [29]	14.60 GB	70.90
RNJ 8B [11]	16.62 GB	73.12
Trinity Nano 6B [3]	12.24 GB	61.17
Ministral3 8B [23]	16.04 GB	71.00
LFM2 8B [21]	16.68 GB	69.58
Llama 3.1 8B [17]	16.06 GB	67.08
Hermes 3 8B [28]	16.06 GB	65.43
GLM 4 9B [37]	18.80 GB	65.73
DeepSeek R1 Qwen 7B [10]	15.23 GB	55.03
Marin 8B [22]	16.06 GB	56.55

[-]

Addyad@reddit

I managed to make this model work in LM studio. the following are the details.
Since the Prism models are 1bit quantized, they need their version of llama.cpp which we will need to replace with the current LM studio llama.cpp. (thanks to this redditor's comment).

In LM studio, press Ctrl + Shift + R and check the current runtime version of CUDA. Current runtime as of April 2, 2026 is 2.8.0.
Quit LM studio.
Download the latest Windows CUDA 12 zip from the PrismML-Eng/llama.cpp GitHub releases. I downloaded the llama-prism-b1-1179bfc-bin-win-cuda-12.4-x64 zip
Go to your LM Studio extensions folder: %UserProfile%\.lmstudio\extensions\backends\llama.cpp-win-x86_64-nvidia-cuda12-avx2-2.8.0 (This is the current runtime location as of April 2, 2026).
Back up that folder first!
Copy all .dll files from the llama-prism-b1-1179bfc-bin-win-cuda-12.4-x64 zip into that folder.
Start LM Studio, and the load the Bonsai model (I tested with https://huggingface.co/prism-ml/Bonsai-8B-gguf). Model should load perfectly without any errors. Additionally, other models worked as usual as well.

[-]

power97992@reddit

Now make a glm 5 1 bit version and a minimax 2.7 1 bit version and a qwen 3.5 27b 1bit version

[-]

_-Nightwalker-_@reddit

When I tried to build it wit cuda it just ramped up my memory to 100% and crashed

[-]

Emotional-Ad5025@reddit

I did the same flow without noticing it might be specific for cuda on my m1 pro, after building and running it, it went to 100% too

[-]

_-Nightwalker-_@reddit

You are correct , i got help from chatgpt and modified the demo files to suit my 1050ti and 8b model generates 18tps , it's fast. But i noticed the tps degrade slowly as the total tokens goes up.

[-]

Legitimate-Pumpkin@reddit

I was waiting for this since I saw the research… 3 years ago? Let’s see how it goes!

[-]

9r4n4y@reddit

Which research, can you give me the link?

[-]

IrisColt@reddit

Bi-bitNets?

[-]

9r4n4y@reddit

Nah, bitnet was like 1 year back not 3 years

[-]

IrisColt@reddit

https://arxiv.org/abs/2310.11453 heh

[-]

IrisColt@reddit

2 years, 5 months ago, heh

[-]

Legitimate-Pumpkin@reddit

I couldn’t remember

[-]

Poki6041@reddit

So for people here, it’s neither a pure {-1, 0, +1} representation nor strictly {-1, +1} in the usual sense — it’s something slightly different. In a standard model, weights look like real-valued numbers, for example: 0.753453, -1.1757562, 0.005435344, ext.. These are typically stored in FP16, meaning each weight uses 16 bits. So for 128 weights: 128 × 16 bits = 2048 bitsIn Bonsai, weights are quantized differently. Each weight is approximated as:+scale or -scale /// Instead of storing full-precision values, we store:

-the sign of each weight (positive or negative)

-a shared scale factor for a group of weights

For a group of 128 weights: Each weight is represented by 1 bit (its sign: + or -) → 128 weights = 128 bits

Instead of 128 different values, we store one shared scale value in FP16 → 1 scale = 16 bits

So total storage becomes:

128 bits (signs)

+ 16 bits (scale)

= 144 bits

Per weight:

144 / 128 = 1.125 bits per weight

👉 That’s why Bonsai is effectively a 1.125-bit model, compared to 16 bits per weight in FP16:

144 bits vs 2048 bits

Finally, the scale is not arbitrary — it is chosen to minimize the approximation error between the original weights and their quantized version.

In practice (simplified case), the optimal scale is:

the average of the absolute values of the weights

So the idea is:

keep only the direction (sign) of each weight

approximate its magnitude using a shared scale

drastically reduce memory while preserving overall structureSo for people here, it’s neither a pure {-1, 0, +1} representation nor strictly {-1, +1} in the usual sense — it’s something slightly different. In a standard model, weights look like real-valued numbers, for example: 0.753453, -1.1757562, 0.005435344, ... These are typically stored in FP16, meaning each weight uses 16 bits. So for 128 weights: 128 × 16 bits = 2048 bits In Bonsai, weights are quantized differently. Each weight is approximated as: +scale or -scale Instead of storing full-precision values, we store: the sign of each weight (positive or negative) a shared scale factor for a group of weights For a group of 128 weights: Each weight is represented by 1 bit (its sign: + or -) → 128 weights = 128 bits Instead of 128 different values, we store one shared scale value in FP16 → 1 scale = 16 bits So total storage becomes: 128 bits (signs) + 16 bits (scale) = 144 bits Per weight: 144 / 128 = 1.125 bits per weight 👉 That’s why Bonsai is effectively a 1.125-bit model, compared to 16 bits per weight in FP16: 144 bits vs 2048 bits Finally, the scale is not arbitrary — it is chosen to minimize the approximation error between the original weights and their quantized version. In practice (simplified case), the optimal scale is: the average of the absolute values of the weights So the idea is: keep only the direction (sign) of each weight approximate its magnitude using a shared scale drastically reduce memory while preserving overall structure

so for people here, it's nether fully -1/0/1 or fully -1/1 , it's something different : for a normal model his weight is something like that : ex: 0.753453, -1.1757562, 0.005435344, etc. in FP16 so for 128 weight 128 × 16 bits = 2048 bits , for bonsai it's + scale or - scale for scale defining a common value , basically bonsai is 128 bits + 1 FP16 , every weight is nether +1 (+scale) or -1 (-scale) so 128 weight =128 bit then instead of having 128 different value we store 1 common value in fp16 , 1 scale = 16 bits , so for 128 weight 128 bits (- , + )

+ 16 bits (scale)

= 144 bits 144 / 128 = 1.125 bits per weight , so bonsai is a 1.125 bit models, we have 144 bits instead of 2048 bit , the scale optimal is choosen for minimising error so for the moment we do the Average of absolute values

[-]

promobest247@reddit

it's work & very fast in my laptop

[-]

spartanOrk@reddit

What could you do with it though? Can it code at all? Can it read a document and analyze it well?

[-]

Internal_Newt_7343@reddit

Looks really intersting! But i couldn't get it to load in LM Studio:

""
```

🥲 Failed to load the model

Error loading model.

(Exit code: 18446744072635810000). Unknown error. Try a different model and/or config.

```
""

Any ideas?

[-]

Iory1998@reddit

Ofc it won't work. LM Studio uses Llama.cpp, and usually there is a lag in implementation. You have to wait 1-2 weeks.

[-]

nemuro87@reddit

so it usually takes 1-2 weeks for LM studio to catch up?
would Ollama or something else catch up faster?

[-]

Iory1998@reddit

I don't use Ollama, so I can't provide you with suggestions. The LM Studio team take some time to update llama.cpp as they make sure it works fine for all users.

[-]

drFennec@reddit

It won't work, you'll need to use their fork of llama.cpp which has support for 1bit quants.

[-]

Internal_Newt_7343@reddit

lol, how i should i know :D, on their huggingface page under "Use this model" the option with LM Studio was there so that was why i tried with LM Studio! But thanks for clarifying this guys.

[-]

Stepfunction@reddit

This feel like marketing hype bullshit. No information provided about the training.

[-]

Murgatroyd314@reddit

This feels like an April fool’s joke, but apparently they posted it yesterday.

[-]

JsThiago5@reddit

What is this underground https://github.com/PrismML-Eng/llama.cpp repo? After what happened with LiteLLM I do not trust running this.

[-]

Interpause@reddit

the best way to do it is squash the fork changes into a single git diff, ask your favourite AI to double-check its safe if you cant read code, then apply it on top of mainline llama.cpp and build it yourself

[-]

tarruda@reddit

Would love to see that applied to the new Qwen 3.5 models. If the intelligence density scales, that would mean the RAM requirements would drastically reduce:

397B would fit in less than 60GB
122 would fit in less than 16GB
35B would fit in less than 5G

[-]

-dysangel-@reddit

Definitely want to see 27B or larger with this method. Bonsai feels impressive for its size, but it's not able to produce working code yet.

35B would be craaaazy fast..

[-]

valuat@reddit

What day is today, again?

[-]

Cool-Chemical-5629@reddit

Great, we still have yet to see someone make that notoriously praised 200B 1bit model that can supposedly run on a regular home computer.

[-]

Shifty_13@reddit

I guess FP4 is not the limit.

We will get FP1 acceleration in the future.

[-]

-dysangel-@reddit

fp1? :P

[-]

Guilty-Science9966@reddit

Its just all 0s

[-]

green-coder@reddit

always has been

[-]

Sioluishere@reddit

just make it int at this point

[-]

eat_my_ass_n_balls@reddit

Wait till this mf hears about 0 bit quantization

[-]

pmp22@reddit

My P40 is ready for 0-bit quants

[-]

m0j0m0j@reddit

This is my quant

[-]

Then-Salary-6859@reddit

This is where I heal my weights

[-]

thrownawaymane@reddit

How dare you post about my brain’s architecture

[-]

last_llm_standing@reddit

My intel celron dekstop from 2007 performs better than P40

[-]

asssuber@reddit

Where exactly is your floating point with just 1 bit?

[-]

drFennec@reddit

A 1 bit FP is just a point.

[-]

Shifty_13@reddit

I saw this post very early and needed to write something stupid to test the theory that new comments get all the upvotes. And well, I was right about that

But yeah, 1 bit is called boolean.

[-]

wonderwind271@reddit

If my understanding is correct, 4-bit quantization is not FP4. You are not literally representing a floating number in 4 bits in regular sense

[-]

hazmatika@reddit

Am I the only one that thought this might be an April Fool’s joke?

[-]

TopChard1274@reddit

the Locally AI app has the bonsai model ready to download and try, so i tried it on my 8gbram m1 iPad Pro. it’s blazingly fast and a miracle that something like this even work. but for my use case (understanding a creative text) seems dumb as a rock, which i guess it’s tied to the model itself, and the 1-bit compression

[-]

Stunning_Mast2001@reddit

We needs a hybrid 1 bit diffusion mamba multimodal models with turbo quant caches

[-]

ganonfirehouse420@reddit

mamba mamba mamba

It's sure gonna be fast.

[-]

pulse77@reddit

From whitepaper (https://github.com/PrismML-Eng/Bonsai-demo/blob/main/1-bit-bonsai-8b-whitepaper.pdf):
"1-bit Bonsai 8B is built from Qwen3-8B"

So this seems to be a transformation of the Qwen model. I wonder if the same transformation may be applied to Qwen3.5-27B or even larger MoE models...

[-]

Marcuss2@reddit

Went trough the paper, their methodologies are somewhat questionable how they measure knowledge density.

For example, we already quantize models to 4 bits, they tend to almost always take full bf16 weights for the other models.

Also they measure intelligence per GB, but intelligence does not scale linearly, but logarithmically.

[-]

tarruda@reddit

Even if they were comparing with current SOTA 4-bit quantizations, it is still impressive if the 1-bit can deliver what they promise.

[-]

Due_Net_3342@reddit

cant wait for the 0 bit version

[-]

ketosoy@reddit

Rand_between() rides again!

[-]

Shiny-Squirtle@reddit

Can't wait for the 1 qubit version

[-]

charmander_cha@reddit

Proprietary? If it were made open source, it would cause the AI bubble to burst.

[-]

bolmer@reddit

Open weight. It's not really a revolution. Even Unsloth(mainstream local LLM org) have 1bit quants of Qwen 3

[-]

9r4n4y@reddit

Yeah, unsloth also have one bit quant but those quant are very unstable and very poor in quality but this llm is very high in quality as we can see in benchmark also

[-]

AnonymousTransfem@reddit

tried Bonzai 8B gguf on their fork, prompt: "hii how are you !!", output was this

to in

in- from to to to:

in- in.

.
from in but is.

to.
in in (:

no.

to.

..

/.

but.

[-]

cafedude@reddit

Similar (and it's dog slow even though I built their llama.cpp fork with AVX2 enabled):

> What are the rules in conway's game of life?

 d.
:. and no-. in2. for all1. the|. in no**. and the in. 3 in0. an D1.1. a the1. in .

[-]

Inside-Spot4136@reddit

Can you try building it like they do it in colab notebook. I tested that, and it is slow, but the output is ok. I even asked to summarize some pdf of papers and I liked the summaries.

https://www.reddit.com/r/LocalLLaMA/comments/1s90wo4/comment/odml25w/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

[-]

cafedude@reddit

ah, there's the rub: before you build you have to checkout their branch: git checkout prism But you only see that in the collab code.

[-]

Far_Composer_5714@reddit

Just yesterday I was looking at paddleocr and it has very similar installation requiring pulling the master Branch in order to install properly.

[-]

Firepal64@reddit

Fork lesson: when you clone a repo for the first time, unless you specify a branch, it'll use the default branch. But most forks keep the default branch unchanged and work in new branches instead.

Use git branch to list the branches in a cloned repo.

[-]

Opposite_Parsley677@reddit

It worked for me using the instructions on the Github and was fast: https://github.com/PrismML-Eng/Bonsai-demo

Initially didn't realize they are working off their own llama.cpp fork - won't work without it

./scripts/run_llama.sh -p "How to grow a Bonsai tree?"
> How to grow a Bonsai tree?

Growing a **Bonsai tree** is a rewarding and artistic endeavor that requires patience, care, and attention to detail. Bonsai is a Japanese art form of cultivating small, carefully pruned trees in a pot, often representing a larger tree in a miniature form. Here's a comprehensive guide to growing a bonsai tree:

[-]

Inside-Spot4136@reddit

I tried their Colab (basically just ran all the cells). The 10th cell gave me url in output, which I opened, and it showed a chat interface. I entered the same prompt, but the response I got was:

```Hello! I'm an AI assistant, so I don't have feelings or emotions like humans, but I'm here to help with any questions or tasks you have. How can I assist you today?```

[-]

hideo_kuze_@reddit

use wrong parameters?

either you're doing something wrong or this model is a scam

because the benchmarks look good https://huggingface.co/prism-ml/Bonsai-8B-gguf#benchmarks

[-]

cafedude@reddit

I'm getting similar results using their llama.cpp fork. It's pretty brain-dead. And slow even though I built for CPU with AVX2 enabled.

[-]

Bubbly-Staff-9452@reddit

About what I expect lol. In theory this has the potential to be amazing for something like sorting or classification on low power devices but with quants this low I’ve never had a good experience so I just move to a smaller model at a higher quant, I’ve settled on 4B models at 4 bit quant as the smallest usable models for my fine-tuned scenarios.

[-]

w8cycle@reddit

What is a 1bit model? How is 1bit going to be enough?

[-]

MonkeyOnFire120@reddit

It can only answer yes or no questions

[-]

dark-light92@reddit

Chain enough yes/no and you can get pretty complex behavior. Fun fact: All modern computing is built on top of yes/no.

[-]

Ok_Reference_1100@reddit

What’s the quality tradeoff?

[-]

M0ULINIER@reddit

As will all quants, knowledge and edge cases, will have to see if it's still good generally tho

[-]

Adventurous-Okra-407@reddit

hmm... exact same parameters and chat template as Qwen. Looks sus to me.

[-]

M0ULINIER@reddit

It's just a Qwen3 model under the hood

[-]

redonculous@reddit

https://youtu.be/LRq_SAuQDec

[-]

denoflore_ai_guy@reddit

What they don’t say is the whitepaper is deliberately vague on the actual compression method - they call it “proprietary Caltech IP” and “mathematically grounded advances” without publishing the technique. So you can use the models but you can’t reproduce the compression pipeline. No native 1-bit hardware exists yet, so the speed gains come purely from software kernel optimizations on standard GPUs.

[-]

Alarming-Ad8154@reddit

Looks like they just quantized qwen3 8-bit to me, but it would def require some innovation in quant aware finetuning? Or just like a lot of it?

[-]

alexchen_gamer@reddit

The memory footprint angle is what caught my eye here. Been running a local AI companion setup and the whisper + LLM stack already eats through RAM fast. A solid 8B at ~1GB would genuinely change what's possible on a mid-range laptop without a dedicated GPU. The conversational task performance is the real question though - benchmarks always look better than real-world dialogue quality in my experience.

[-]

alexchen_gamer@reddit

This is actually huge for edge inference use cases. 1.15GB at 8B parameter scale means you could run this thing on basically any laptop or even a higher-end phone without breaking a sweat.

I have been tinkering with running a local AI companion setup on my machine and memory footprint has always been the bottleneck once you stack whisper + the LLM + any other services. Having a solid 8B that fits in ~1GB changes the calculus a lot. Curious how the quality holds up on conversational/creative tasks vs just benchmarks though.

[-]

Worried_Drama151@reddit

Way too fragile at 1bit, abstract things make it go bananas

[-]

AppealSame4367@reddit

wtf! wow

[-]

Cinci_Socialist@reddit

This is the way.

[-]

nicholas_the_furious@reddit

Gimme a big one.

[-]

the__storm@reddit

It'd be nice if they compared to some quantized models, or at least something with natively lower precision weights like GPT-OSS. Running all the competition at fp16 is a bit disingenuous when it's well known that fp16 models retain a lot of their capability down to 5-6 bpw and are still usable even at 3-4.

[-]

Due_Net_3342@reddit

so this is a fancy binary tree?

[-]

silentus8378@reddit

How much did it cost to make those 1 bit models?