PrismML — Announcing 1-bit Bonsai: The First Commercially Viable 1-bit LLMs
Posted by brown2green@reddit | LocalLLaMA | View on Reddit | 178 comments
Posted by brown2green@reddit | LocalLLaMA | View on Reddit | 178 comments
rkbala@reddit
I have an edge device (AMD Ryzen 7 AI laptop). Will it work? What i see in their llama.cpp fork is only cuda. I am a noob. Any suggestions pls
Disonantemus@reddit
https://old.reddit.com/r/LocalLLaMA/comments/1s90wo4/prismml_announcing_1bit_bonsai_the_first/ofdfrha/
Languages_Learner@reddit
Just found official development branch for cpu inference - PrismML-Eng/llama.cpp at q1-cpu
rkbala@reddit
Thanks a lot. Will check it out
Languages_Learner@reddit
Here's llama.cpp fork which seems to be able to fix the bug with cpu-only inference: philtomson/llama.cpp: LLM inference in C/C++ (fork of PrismML fork that enables CPU (incl AVX2 and AVX512) and ROCm for AMD GPUs
Languages_Learner@reddit
Here's the answer for your question - No cpu-only build · Issue #6 · PrismML-Eng/Bonsai-demo
Disonantemus@reddit
System
shockwaverc13@reddit
am i the only one who gets extremely slow CPU performance?
build/bin/llama-bench -m models/Bonsai-8B.gguf -r 1 -p 8 -n 8 --mmap 1
| model | size | params | backend | threads | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ---: | --------------: | -------------------: |
| qwen3 8B Q1_0_g128 | 1.07 GiB | 8.19 B | CPU | 4 | 1 | pp8 | 0.36 ± 0.00 |
| qwen3 8B Q1_0_g128 | 1.07 GiB | 8.19 B | CPU | 4 | 1 | tg8 | 0.29 ± 0.00 |
build: 1179bfc82 (8194)
Disonantemus@reddit
Fun-Property-5964@reddit
me to having the same problem i am have you found a solution for it?
shockwaverc13@reddit
there are other quants of it that seem to work the same and are faster (but are bigger in size) https://huggingface.co/lilyanatia/Bonsai-8B-requantized
or you can use forks of the prismml fork that implement AVX and Vulkan https://github.com/PrismML-Eng/llama.cpp/pulls
brown2green@reddit (OP)
From the X post:
Aaaaaaaaaeeeee@reddit
Is it a binary QAT (-1,+1), not ternary (-1,0,+1)?
brown2green@reddit (OP)
Just binary, it seems.
DistanceSolar1449@reddit
It’s probably 0/1 and not -1/1. I doubt you can make a LLM work without multiplying a lot of tensors by 0.
That’s still fucking insane. I’m mindblown that activations can be just binary and still work. Usually you NEED -1/0/1. Bitnet, for example, is ternary 1.53bit and not 1 bit.
Party-Special-5177@reddit
It is both - the original bitnet paper was a binary model (0,1) name: scaling 1 bit transformers for large language models, 2023 iirc
They came back a year later and said the models improved significantly when made ternary(-1,0,1), to the point that they could compete with the models they were quanted from: The era of 1 bit llms, all language models are in 1.58 bits, 2024
They
SHOR-LM@reddit
Both "work". but the 0 state is required if you want sparsity in a model as far as I know.... That's why they changed it....
Party-Special-5177@reddit
It isn’t for sparsity, it improves the representational capacity of a bitlinear layer.
The rest of your comment makes little sense to me. The weights aren’t ‘forced to choose’, the optimizer optimizes a set of master weights in full precision which bitnets then round to binary or trinary. That’s it.
SHOR-LM@reddit
I was conflating representational capacity with sparsity and what must be deterministically rounded as the forced choice .."what does the weight mean" A or B....thats limited representation. if it was as simple as what you're saying Microsoft wouldn't have backtracked on their decision.
Alarming-Ad8154@reddit
According to the whitepaper its -1/1… pretty insane it’s this good (or very benchmaxed??)
Alarming-Ad8154@reddit
It’s actually -1/1 scaled by a 16-bit scaling factor shared by 128 weights. Also since they don’t describe any of their training I am near certain it’s quantization or quantization + finetuning not base training…
HopePupal@reddit
yeah i noticed that too. they handwaved with "proprietary Caltech IP", which honestly is not surprising for a university spinoff, but looking for recent patents from people at Prism or Hassibi's lab, i immediately found
US-20250348715-A1 Preview PDF Text RESPONSE-ADAPTIVE CALIBRATION FOR POST-TRAINING QUANTIZATION OF LARGE LANGUAGE MODELS EDALATI; Ali et al.
kaibee@reddit
Plasmx@reddit
I guess nobody was as crazy as writing billions of if branches before. They could have invented modern ai models. /s
Top-Handle-5728@reddit
Where did they, or even BitNet, claim that the activations are binary? Isn't it more about the weights? Assuming strictly 0/1 weights, how do you make the signal negative to suppress the activation? If they are truly using 0/1 without a -1 'inhibit' state, they'd have to rely heavily on Biases or Normalization layers to shift the signal into the negative range, which technically means those higher precision 'escape hatches' still exist in the norm layers.
MaraPapewaio@reddit
Bitnet 1.58 2B4T from Microsoft prooved this a highly viable option compared to full precision FP16.
The 0 State changes everything. This is a "I don't know" option.
Moreover, Microsoft trained natively this model like this.
So maybe you, at Prisme would give it a try.
This may allow the model not to "make up" some answers like it's big brothers.
Thant for the share. I'm delighted to see this technology come to live.
Small is Beautiful. Running on consumer Hardware is the key of full integration.
Keep up the good work ans thanks for the open source release.
SHOR-LM@reddit
They dropped the zero state and went pure binary? How is that not a MAJOR step backward from the BitNet b1.58 architecture? The zero is supposed to give sparsity. Does this mean very connection could be forced to have an "opinion"? I wonder the implications. This can't be the whole story. How can it isolate a "this thing doesn't matter" situation in a users' prompt?
CryptoUsher@reddit
1-bit models sound wild, but i'm curious how they handle edge cases without falling off a cliff in accuracy.
have you tested on tasks that require nuanced reasoning, or does the compression favor speed over depth?
SHOR-LM@reddit
It lacks overall nuance but it's actually really good for what they've done to it. I'm actually surprised
l33tkvlthax42069@reddit
Given that you posted this when there were less than 20 downloads, I'll assume you are part of the team? Impressed with the llama cpp performance and output quality. MLX auto install did not work on Sequoia, but will try when I have more than 2 minutes later...
Hoping that batching is viable, super interested to see how this develops!
brown2green@reddit (OP)
No, I just saw the announcement on X and posted it here.
cafedude@reddit
1-bit models... wouldn't these be well-suited for running on an FPGA?
ObviousLavishness537@reddit
私もLLM詳しくないですが同じような事を考えました。FPGA の flip-flop通すだけでいいのかなと。
X3liteninjaX@reddit
We got LLMs made of booleans now /s
cafedude@reddit
I mean, if they're 1-bit end-to-end as they say then how are they not boolean? Could these models be converted to logic gate networks somehow? (something like difflogic: https://github.com/Felix-Petersen/difflogic )
Icy_Butterscotch6661@reddit
Isn't a bool 8 bits in most languages
AdventurousFly4909@reddit
Bro
Icy_Butterscotch6661@reddit
idk i mean a boolean variable takes up a whole byte (even if only one bit inside is flipped) in all languages i've worked with.
guess it doesn't really matter for the discussion at hand so idk why I even said that
AdventurousFly4909@reddit
yeah, I get where the confusion comes from after all the lowest unit the code can operate on is a byte. All(I think maybe there exists a super weird instruction?) instruction operate on the byte level. Most likely the multiple weights are packed into one byte.
VolkoTheWorst@reddit
I'm currently working on an implementation of an AI network on FPGA
Several-Tax31@reddit
What is the max parameter count model a FPGA can run? 100B? 1B? Less?
VolkoTheWorst@reddit
Technically nothing prevents you from running a 100B or more. It's just gonna probably require a custom made insanely big/expensive FPGA and run very slowly
VolkoTheWorst@reddit
Depends on which FPGA you have. My work is on a very small AI niche, we will have like 1k neurons so not a lot. And we're already limited by the BRAM size. But we are at the start of the project so we might find workarounds. We are using 7000s FPGAs
Leo_hofstadter@reddit
I have been thinking like when will we have FPGA made for LLMs(expensive to develop and have ROI), but something closer to that is what Groq company is doing( at least that’s my vague understanding), they sell crazy fast inference as their chips are tuned for inference only!
Plasmx@reddit
There is an ASIC Llama chip, Taalas HC1. It’s just a very specific use case since there is no way you can change the model.
Leo_hofstadter@reddit
Interesting, I read that they are already in business somehow with lot of clients, wouldn’t that be a shame if you can’t upgrade the LLM model? Almost like throwaway / burner phones? What’s the concept here that makes it them do it profitably?
Plasmx@reddit
They are insanely fast and energy efficient. It is like 17k TPS at 700 W. There are use cases where it is feasible to choose a model and stay with it for a while when inference is cheap. If that model is Llama 3.1 stays on another page, but I think they mainly wanted a proof of concept.
randylush@reddit
There are tons of chips that are made specifically for ML inference. The device you’re using to read this comment probably has an ML accelerator built in. So far FPGAs have only been useful for prototyping; it’s always more efficient to run a workload on a bespoke chip than an FPGA.
CaptBrick@reddit
Actually it’s always been that way, it’s all 1 and 0 at hardware level
Lucky-Necessary-8382@reddit
Before GTA IV
INtuitiveTJop@reddit
Hey, isn’t this a lot easier to place on an asic with the fact that it’s all 0s and 1s?
Legitimate-Handle390@reddit
Taalas is on the mission. i'm waiting for Cerebras to acquire Taalas. Imagine a wafer scale ASIC where memory=compute.
INtuitiveTJop@reddit
I know, this is the future. I cannot yet comprehend a Claude level asic running at 15k tokens a second
cafedude@reddit
...or an FPGA.
fotcorn@reddit
Also works on ROCM.
Getting roughly 150 t/s generation on my 9070 XT for the 8B model.
Output is hard to judge, but seeing 1bit working at all is already impressive, especially because it sounds like it was quantized from Qwen3.5, and not retrained from scratch like the BitNet 1.58 models
alpay_kasal@reddit
Hey u/fotcorn was the output from ROCM garbage output like someone reported on CPU? or did it look somewhat useful?
fotcorn@reddit
No, it worked fine on GPU, both 1.7B and 8B. Not very intelligent/knowledgeable, but that is expected.
CPU took forever to load and then only produced garbage output. From reading the PR in llama.cpp, it was only tested on ARM CPUs, so not surprising it's broken on x86.
alpay_kasal@reddit
That's amazing to hear!!! Thank you. One of the guys at the PrismML discord said they only implemented on a cuda backend, so they should be surprised to hear it works. Thx again for the speedy reply.
OkSun5433@reddit
which llama.cpp build and model did you use? the model won't load for me using windows HIP build
fotcorn@reddit
They have their own fork: https://github.com/PrismML-Eng/llama.cpp
They say only cuda/metal support, but HIP build worked just fine. Using ROCM 7.12 preview.
OkSun5433@reddit
thanks
madtopo@reddit
Do you have it in you to test the setup soccer l against a harness like opencode, pi or charm? It'd be nice to know how it performs against agentic coding tasks
lemon07r@reddit
There is no 8b qwen 3.5 model. it's a qwen 3 model.
Worried_Drama151@reddit
Thx bro we are all pretty dense here, so people couldn’t infer that maybe he meant 9B and fat fingered or meant 8B with 3 high five
lemon07r@reddit
type shit brother
Interpause@reddit
gimme a while im going squash their llama.cpp changes on top of llama.cpp and see if it really works cuz thats real crazy if it does
-dysangel-@reddit
I just tried it on their mlx fork - it works.
zh1412@reddit
how did you install? conda then pip? I tried following their installation guide and it failed
-dysangel-@reddit
I had Claude set it up. Yeah I think pip wasn't working - in the end I had to download the xcode metal compiler and build their custom mlx
zh1412@reddit
got it thanks!
-dysangel-@reddit
I seriously doubt the performance is going to match 8b f16 models as they claim, but it's good to see 1 bit models making progress
audioen@reddit
They didn't bother to place regular models and normal PTQ methods in the same picture when they posted this: https://huggingface.co/prism-ml/Bonsai-8B-gguf/resolve/main/assets/frontier.svg
But you can imagine that e.g. Qwen3-1.7B at bf16 can easily be shrunk by 75 % by PTQ'ing to something like IQ4_XS, and it would move that point left near their 1-bit frontier line. It looks mostly like it might give an incremental improvement to model quantization, possibly is indeed the most memory-efficient way to do it. I mean, it is 1-bit logic, multiplication of 32 weights in their quantized form is now an XOR operation.
TylerDurdenFan@reddit
I was going to say 30 years ago CPUs weren't 32 bits yet, but, indeed they were. Damn I'm getting old.
Double_Cause4609@reddit
Tbh, they don't really need to. Per unit of silicon 1bit is faster than you'd think.
Like, if you have $100 of silicon, you'd expect 1bit to be \~16x as fast as FP16, but it's actually faster due to a few weird things about hardware scales.
So, if you only need 1/16th the price to run the model, as long as it's more than 1/16th as good as the FP16 model, you're still coming out ahead.
I find that usually 1bit methods are \~3/4 as good as the FP16 models when they're quantization aware, which still gives you more value for your money.
-dysangel-@reddit
sure I'm not saying that I don't want 1 bit models, I'm just saying it's odd to claim the quality is as nuanced as f16. I would definitely like to see some scaled up bit models, so that the model itself is as efficient as can be without needing quantisation.
EstarriolOfTheEast@reddit
If it crosses a certain quality threshold/noise floor then because it takes less memory and is so fast, you can match or beat the fp16 by simply drawing more samples. The caveat as usual is this only works for problems which can be either reliably verified or aggregated automatically.
-dysangel-@reddit
well, I think a better comparison would be a 1 bit model of the same size as a 8B f16 model. At the moment they're saying that an 8billion param 1 bit model can match an 8billion param 16 bit model.. maybe on some tasks it can, but there is simply less capacity in that model. I think it would be more fair to compare a 128billion param 1 bit model with an 8 billion param 16 bit model, as they both contain the same amount of bits.
EstarriolOfTheEast@reddit
My point is actually independent of that. Because what the LLM encodes are conditioned probabilities, its lack of capacity can be made up for by sampling more, as long as the computed probabilities are already directionally close enough. This is similar to how a say, 16x sampled 4B can match or exceed a 1 sample 8B chain, depending on the task.
-dysangel-@reddit
Sure I understand that doing pass@n can be powerful if the model is fast enough. I think intelligence density is about improving pass@1 though. It's like the old saying:
Amateurs practice till they get it right; professionals practice till they can’t get it wrong.EstarriolOfTheEast@reddit
It's actually the opposite, trying to overoptimize for pass at 1 is what leads to entropy loss, uncalibrated uncertainty, reduced creativity and a tendency for "slop". As it's a distribution, what we actually want is to sample from the correct parts of the space (the thin shell or region away from the mode where most of the probability mass lives) and to draw enough samples to get a quality answer to our query in expectation. That expectation is better approximated by drawing more samples as is done in self-consistency for example. That and smarter samples (which the field except for a few of us largely gave up on to focus on agents) is what maxes out intelligence quality.
-dysangel-@reddit
I guess it really depends what type of work you're doing. For pure logic, fixing bugs, coding etc, you should be able to aim for pass@1 being 100%. For creativity, design work and thinking outside the box, a distribution is great.
EstarriolOfTheEast@reddit
For fixing bugs and coding, unless you're only doing completely unoriginal work with all bugs common and well documented, then the ability to think outside the box is important. Distribution quality will also improve writing, explanations and produce higher quality and richer reasoning chains. If you optimize for pass@1 you instead significantly reduce the intelligence of the model (by reducing model entropy you also damage its ability to sample the hardest things it learned). And there's much more to this than just pass@n as we can write samplers to better extract model intelligence.
The only reason people in the open community want high pass@1 is because of our hardware limitations. Closed models with pro tiers which draw more samples have to get this right since they need their higher sample count tiers to not be degenerate due to too low entropy, they can also afford the hardware.
the__storm@reddit
They're claiming 5-9x speedup vs fp16 version of their own model in the linked paper. In what scenario would you expect more than 16x speedup?
Double_Cause4609@reddit
I was making an information theoretic argument per unit of silicon area and theoretical silicon efficiency. They were making a practical argument when running their quants on existing hardware. Both claims can be true.
the__storm@reddit
I do not dispute it. Would you be willing to tell us more about how the greater speedup can theoretically be achieved, or link to similar? I couldn't find anything with some quick googling.
Double_Cause4609@reddit
Well, I'm not really sure if that's something you need to google. You can reason about it from first principles.
Search up how many transistors it is to do an FP16 MAC operation. Then search up how many transistors it is to do a binary add / subtract.
It's not even in the same league.
You can do binary operations with extremely cheap operators when you're designing the transistor layout to the operation.
the__storm@reddit
I see what you're saying, but at least for localllama (single data) purposes you're still bandwidth constrained. Although I guess if you're designing custom silicon you can then afford to reallocate die space from arithmetic to memory and come out ahead. Interesting line of exploration to be sure.
DangerousSetOfBewbs@reddit
The won’t ever. As someone who has created LLMs from scratch until my eyes bleed dry, pruning, selected graph pruning, quantization etc. Purposefully building small models and shrinking larger models etc
There are only so many areas you can cram data into. And these just can’t hold a ton.
Now are these models great for on device with no GPU and very limited ram/cpu? Yes. But their intelligence is greatly lacking. They can be effective in very small areas, but the reasoning is dumb. They essentially become a yes or no gate.
heliosythic@reddit
Honestly I'd love to see more models that somehow mostly cram in language understanding rather than knowledge and use RAG instead (vector db and/or internet search) for knowledge. But language understanding + knowledge are kinda a chicken and egg problem.
SkyFeistyLlama8@reddit
Great as a classifier.
jusio@reddit
Glad to see that there is movement in this area, haven't tried the model yet, but according to charts from white paper, converting a model to 1-bit really dumbs it down. In table 6 they list all of bonsai models vs all other models, and Bonsai 8b has lower score than Qwen3 4B.
And I guess if we quantize Qwen3 4B to 4bits, it will have very similar size and performance compared to Bonsai 8b 1-bit.
Table 6 for reference from: https://github.com/PrismML-Eng/Bonsai-demo/blob/main/1-bit-bonsai-8b-whitepaper.pdf
Addyad@reddit
I managed to make this model work in LM studio. the following are the details.
Since the Prism models are 1bit quantized, they need their version of llama.cpp which we will need to replace with the current LM studio llama.cpp. (thanks to this redditor's comment).
Ctrl + Shift + Rand check the current runtime version of CUDA. Current runtime as of April 2, 2026 is 2.8.0.llama-prism-b1-1179bfc-bin-win-cuda-12.4-x64zip%UserProfile%\.lmstudio\extensions\backends\llama.cpp-win-x86_64-nvidia-cuda12-avx2-2.8.0(This is the current runtime location as of April 2, 2026)..dllfiles from thellama-prism-b1-1179bfc-bin-win-cuda-12.4-x64zip into that folder.power97992@reddit
Now make a glm 5 1 bit version and a minimax 2.7 1 bit version and a qwen 3.5 27b 1bit version
_-Nightwalker-_@reddit
When I tried to build it wit cuda it just ramped up my memory to 100% and crashed
Emotional-Ad5025@reddit
I did the same flow without noticing it might be specific for cuda on my m1 pro, after building and running it, it went to 100% too
_-Nightwalker-_@reddit
You are correct , i got help from chatgpt and modified the demo files to suit my 1050ti and 8b model generates 18tps , it's fast. But i noticed the tps degrade slowly as the total tokens goes up.
Legitimate-Pumpkin@reddit
I was waiting for this since I saw the research… 3 years ago? Let’s see how it goes!
9r4n4y@reddit
Which research, can you give me the link?
IrisColt@reddit
Bi-bitNets?
9r4n4y@reddit
Nah, bitnet was like 1 year back not 3 years
IrisColt@reddit
https://arxiv.org/abs/2310.11453 heh
IrisColt@reddit
2 years, 5 months ago, heh
Legitimate-Pumpkin@reddit
I couldn’t remember
Poki6041@reddit
So for people here, it’s neither a pure {-1, 0, +1} representation nor strictly {-1, +1} in the usual sense — it’s something slightly different. In a standard model, weights look like real-valued numbers, for example: 0.753453, -1.1757562, 0.005435344, ext.. These are typically stored in FP16, meaning each weight uses 16 bits. So for 128 weights: 128 × 16 bits = 2048 bitsIn Bonsai, weights are quantized differently. Each weight is approximated as:+scale or -scale /// Instead of storing full-precision values, we store:
-the sign of each weight (positive or negative)
-a shared scale factor for a group of weights
For a group of 128 weights: Each weight is represented by 1 bit (its sign: + or -) → 128 weights = 128 bits
Instead of 128 different values, we store one shared scale value in FP16 → 1 scale = 16 bits
So total storage becomes:
128 bits (signs)
+ 16 bits (scale)
= 144 bits
Per weight:
144 / 128 = 1.125 bits per weight
👉 That’s why Bonsai is effectively a 1.125-bit model, compared to 16 bits per weight in FP16:
144 bits vs 2048 bits
Finally, the scale is not arbitrary — it is chosen to minimize the approximation error between the original weights and their quantized version.
In practice (simplified case), the optimal scale is:
the average of the absolute values of the weights
So the idea is:
keep only the direction (sign) of each weight
approximate its magnitude using a shared scale
drastically reduce memory while preserving overall structureSo for people here, it’s neither a pure {-1, 0, +1} representation nor strictly {-1, +1} in the usual sense — it’s something slightly different. In a standard model, weights look like real-valued numbers, for example: 0.753453, -1.1757562, 0.005435344, ... These are typically stored in FP16, meaning each weight uses 16 bits. So for 128 weights: 128 × 16 bits = 2048 bits In Bonsai, weights are quantized differently. Each weight is approximated as: +scale or -scale Instead of storing full-precision values, we store: the sign of each weight (positive or negative) a shared scale factor for a group of weights For a group of 128 weights: Each weight is represented by 1 bit (its sign: + or -) → 128 weights = 128 bits Instead of 128 different values, we store one shared scale value in FP16 → 1 scale = 16 bits So total storage becomes: 128 bits (signs) + 16 bits (scale) = 144 bits Per weight: 144 / 128 = 1.125 bits per weight 👉 That’s why Bonsai is effectively a 1.125-bit model, compared to 16 bits per weight in FP16: 144 bits vs 2048 bits Finally, the scale is not arbitrary — it is chosen to minimize the approximation error between the original weights and their quantized version. In practice (simplified case), the optimal scale is: the average of the absolute values of the weights So the idea is: keep only the direction (sign) of each weight approximate its magnitude using a shared scale drastically reduce memory while preserving overall structure
so for people here, it's nether fully -1/0/1 or fully -1/1 , it's something different : for a normal model his weight is something like that : ex: 0.753453, -1.1757562, 0.005435344, etc. in FP16 so for 128 weight 128 × 16 bits = 2048 bits , for bonsai it's + scale or - scale for scale defining a common value , basically bonsai is 128 bits + 1 FP16 , every weight is nether +1 (+scale) or -1 (-scale) so 128 weight =128 bit then instead of having 128 different value we store 1 common value in fp16 , 1 scale = 16 bits , so for 128 weight 128 bits (- , + )
+ 16 bits (scale)
= 144 bits 144 / 128 = 1.125 bits per weight , so bonsai is a 1.125 bit models, we have 144 bits instead of 2048 bit , the scale optimal is choosen for minimising error so for the moment we do the Average of absolute values
promobest247@reddit
it's work & very fast in my laptop
spartanOrk@reddit
What could you do with it though? Can it code at all? Can it read a document and analyze it well?
Internal_Newt_7343@reddit
Looks really intersting! But i couldn't get it to load in LM Studio:
""
```
🥲 Failed to load the model
Error loading model.
(Exit code: 18446744072635810000). Unknown error. Try a different model and/or config.
```
""
Any ideas?
Iory1998@reddit
Ofc it won't work. LM Studio uses Llama.cpp, and usually there is a lag in implementation. You have to wait 1-2 weeks.
nemuro87@reddit
so it usually takes 1-2 weeks for LM studio to catch up?
would Ollama or something else catch up faster?
Iory1998@reddit
I don't use Ollama, so I can't provide you with suggestions. The LM Studio team take some time to update llama.cpp as they make sure it works fine for all users.
drFennec@reddit
It won't work, you'll need to use their fork of llama.cpp which has support for 1bit quants.
Internal_Newt_7343@reddit
lol, how i should i know :D, on their huggingface page under "Use this model" the option with LM Studio was there so that was why i tried with LM Studio! But thanks for clarifying this guys.
Stepfunction@reddit
This feel like marketing hype bullshit. No information provided about the training.
Murgatroyd314@reddit
This feels like an April fool’s joke, but apparently they posted it yesterday.
JsThiago5@reddit
What is this underground https://github.com/PrismML-Eng/llama.cpp repo? After what happened with LiteLLM I do not trust running this.
Interpause@reddit
the best way to do it is squash the fork changes into a single git diff, ask your favourite AI to double-check its safe if you cant read code, then apply it on top of mainline llama.cpp and build it yourself
tarruda@reddit
Would love to see that applied to the new Qwen 3.5 models. If the intelligence density scales, that would mean the RAM requirements would drastically reduce:
-dysangel-@reddit
Definitely want to see 27B or larger with this method. Bonsai feels impressive for its size, but it's not able to produce working code yet.
35B would be craaaazy fast..
valuat@reddit
What day is today, again?
Cool-Chemical-5629@reddit
Great, we still have yet to see someone make that notoriously praised 200B 1bit model that can supposedly run on a regular home computer.
Shifty_13@reddit
I guess FP4 is not the limit.
We will get FP1 acceleration in the future.
-dysangel-@reddit
fp1? :P
Guilty-Science9966@reddit
Its just all 0s
green-coder@reddit
always has been
Sioluishere@reddit
just make it int at this point
eat_my_ass_n_balls@reddit
Wait till this mf hears about 0 bit quantization
pmp22@reddit
My P40 is ready for 0-bit quants
m0j0m0j@reddit
This is my quant
Then-Salary-6859@reddit
This is where I heal my weights
thrownawaymane@reddit
How dare you post about my brain’s architecture
last_llm_standing@reddit
My intel celron dekstop from 2007 performs better than P40
asssuber@reddit
Where exactly is your floating point with just 1 bit?
drFennec@reddit
A 1 bit FP is just a point.
Shifty_13@reddit
I saw this post very early and needed to write something stupid to test the theory that new comments get all the upvotes. And well, I was right about that
But yeah, 1 bit is called boolean.
wonderwind271@reddit
If my understanding is correct, 4-bit quantization is not FP4. You are not literally representing a floating number in 4 bits in regular sense
hazmatika@reddit
Am I the only one that thought this might be an April Fool’s joke?
TopChard1274@reddit
the Locally AI app has the bonsai model ready to download and try, so i tried it on my 8gbram m1 iPad Pro. it’s blazingly fast and a miracle that something like this even work. but for my use case (understanding a creative text) seems dumb as a rock, which i guess it’s tied to the model itself, and the 1-bit compression
Stunning_Mast2001@reddit
We needs a hybrid 1 bit diffusion mamba multimodal models with turbo quant caches
ganonfirehouse420@reddit
mamba mamba mamba
It's sure gonna be fast.
pulse77@reddit
From whitepaper (https://github.com/PrismML-Eng/Bonsai-demo/blob/main/1-bit-bonsai-8b-whitepaper.pdf):
"1-bit Bonsai 8B is built from Qwen3-8B"
So this seems to be a transformation of the Qwen model. I wonder if the same transformation may be applied to Qwen3.5-27B or even larger MoE models...
Marcuss2@reddit
Went trough the paper, their methodologies are somewhat questionable how they measure knowledge density.
For example, we already quantize models to 4 bits, they tend to almost always take full bf16 weights for the other models.
Also they measure intelligence per GB, but intelligence does not scale linearly, but logarithmically.
tarruda@reddit
Even if they were comparing with current SOTA 4-bit quantizations, it is still impressive if the 1-bit can deliver what they promise.
Due_Net_3342@reddit
cant wait for the 0 bit version
ketosoy@reddit
Rand_between() rides again!
Shiny-Squirtle@reddit
Can't wait for the 1 qubit version
charmander_cha@reddit
Proprietary? If it were made open source, it would cause the AI bubble to burst.
bolmer@reddit
Open weight. It's not really a revolution. Even Unsloth(mainstream local LLM org) have 1bit quants of Qwen 3
9r4n4y@reddit
Yeah, unsloth also have one bit quant but those quant are very unstable and very poor in quality but this llm is very high in quality as we can see in benchmark also
AnonymousTransfem@reddit
tried Bonzai 8B gguf on their fork, prompt: "hii how are you !!", output was this
cafedude@reddit
Similar (and it's dog slow even though I built their llama.cpp fork with AVX2 enabled):
Inside-Spot4136@reddit
Can you try building it like they do it in colab notebook. I tested that, and it is slow, but the output is ok. I even asked to summarize some pdf of papers and I liked the summaries.
https://www.reddit.com/r/LocalLLaMA/comments/1s90wo4/comment/odml25w/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
cafedude@reddit
ah, there's the rub: before you build you have to checkout their branch: git checkout prism But you only see that in the collab code.
Far_Composer_5714@reddit
Just yesterday I was looking at paddleocr and it has very similar installation requiring pulling the master Branch in order to install properly.
Firepal64@reddit
Fork lesson: when you clone a repo for the first time, unless you specify a branch, it'll use the default branch. But most forks keep the default branch unchanged and work in new branches instead.
Use
git branchto list the branches in a cloned repo.Opposite_Parsley677@reddit
It worked for me using the instructions on the Github and was fast: https://github.com/PrismML-Eng/Bonsai-demo
Initially didn't realize they are working off their own llama.cpp fork - won't work without it
./scripts/run_llama.sh -p "How to grow a Bonsai tree?"> How to grow a Bonsai tree?Growing a **Bonsai tree** is a rewarding and artistic endeavor that requires patience, care, and attention to detail. Bonsai is a Japanese art form of cultivating small, carefully pruned trees in a pot, often representing a larger tree in a miniature form. Here's a comprehensive guide to growing a bonsai tree:Inside-Spot4136@reddit
I tried their Colab (basically just ran all the cells). The 10th cell gave me url in output, which I opened, and it showed a chat interface. I entered the same prompt, but the response I got was:
```Hello! I'm an AI assistant, so I don't have feelings or emotions like humans, but I'm here to help with any questions or tasks you have. How can I assist you today?```
hideo_kuze_@reddit
use wrong parameters?
either you're doing something wrong or this model is a scam
because the benchmarks look good https://huggingface.co/prism-ml/Bonsai-8B-gguf#benchmarks
cafedude@reddit
I'm getting similar results using their llama.cpp fork. It's pretty brain-dead. And slow even though I built for CPU with AVX2 enabled.
Bubbly-Staff-9452@reddit
About what I expect lol. In theory this has the potential to be amazing for something like sorting or classification on low power devices but with quants this low I’ve never had a good experience so I just move to a smaller model at a higher quant, I’ve settled on 4B models at 4 bit quant as the smallest usable models for my fine-tuned scenarios.
w8cycle@reddit
What is a 1bit model? How is 1bit going to be enough?
MonkeyOnFire120@reddit
It can only answer yes or no questions
dark-light92@reddit
Chain enough yes/no and you can get pretty complex behavior. Fun fact: All modern computing is built on top of yes/no.
Ok_Reference_1100@reddit
What’s the quality tradeoff?
M0ULINIER@reddit
As will all quants, knowledge and edge cases, will have to see if it's still good generally tho
Adventurous-Okra-407@reddit
hmm... exact same parameters and chat template as Qwen. Looks sus to me.
M0ULINIER@reddit
It's just a Qwen3 model under the hood
redonculous@reddit
https://youtu.be/LRq_SAuQDec
denoflore_ai_guy@reddit
What they don’t say is the whitepaper is deliberately vague on the actual compression method - they call it “proprietary Caltech IP” and “mathematically grounded advances” without publishing the technique. So you can use the models but you can’t reproduce the compression pipeline. No native 1-bit hardware exists yet, so the speed gains come purely from software kernel optimizations on standard GPUs.
Alarming-Ad8154@reddit
Looks like they just quantized qwen3 8-bit to me, but it would def require some innovation in quant aware finetuning? Or just like a lot of it?
alexchen_gamer@reddit
The memory footprint angle is what caught my eye here. Been running a local AI companion setup and the whisper + LLM stack already eats through RAM fast. A solid 8B at ~1GB would genuinely change what's possible on a mid-range laptop without a dedicated GPU. The conversational task performance is the real question though - benchmarks always look better than real-world dialogue quality in my experience.
alexchen_gamer@reddit
This is actually huge for edge inference use cases. 1.15GB at 8B parameter scale means you could run this thing on basically any laptop or even a higher-end phone without breaking a sweat.
I have been tinkering with running a local AI companion setup on my machine and memory footprint has always been the bottleneck once you stack whisper + the LLM + any other services. Having a solid 8B that fits in ~1GB changes the calculus a lot. Curious how the quality holds up on conversational/creative tasks vs just benchmarks though.
Worried_Drama151@reddit
Way too fragile at 1bit, abstract things make it go bananas
AppealSame4367@reddit
wtf! wow
Cinci_Socialist@reddit
This is the way.
nicholas_the_furious@reddit
Gimme a big one.
the__storm@reddit
It'd be nice if they compared to some quantized models, or at least something with natively lower precision weights like GPT-OSS. Running all the competition at fp16 is a bit disingenuous when it's well known that fp16 models retain a lot of their capability down to 5-6 bpw and are still usable even at 3-4.
Due_Net_3342@reddit
so this is a fancy binary tree?
silentus8378@reddit
How much did it cost to make those 1 bit models?