ggml: add Q1_0 1-bit quantization support (CPU) - 1-bit Bonsai models
Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 43 comments
Bonsai's 8B model is just 1.15GB so CPU alone is more than enough.
aero-spike@reddit
So should I install the Windows x64 CPU version or the Vulkan version?
pmttyji@reddit (OP)
CPU, Metal & Vulkan support done. So yes.
aero-spike@reddit
Oh sorry, I mean I’m using a Ryzen 5600G, which specific version should I install?
ChatGPT told me to install the CPU version, Gemini & Claude told me to install the Vulcan version. So I’m confused.
pmttyji@reddit (OP)
GPUs will give you best t/s as those are (10X) faster than RAM. So if you're using NVIDIA GPUs, use CUDA version. If you're using AMD GPUs, use Vulkan or ROCm version. If you don't have any GPUs, use CPU version.
tarruda@reddit
Will this quantization be available to other models or is it only for Bonsai's models?
ilintar@reddit
It's available for any models (but YMMV for any models that are not explicitly trained for this :>)
pmttyji@reddit (OP)
Haven't tried yet, Hope this update would run models like Falcon3-10B-Instruct-1.58bit
Silver-Champion-4846@reddit
Why 1bit and not 1.58bit ternary?
Party-Special-5177@reddit
Smoke and mirrors and PrOpRiEtArY AlGoRiThMs. I still don’t know why Prism didn’t use any of the industry standard naming conventions for derived models - the model isn’t theirs, it’s just Qwen 3 quantized and healed.
The damn thing should be named Qwen-3-Q1-xxx like everyone else who quants someone else’s model into bitnets (e.g. the TII guys out of the UAE made the Falcon-E series)
Silver-Champion-4846@reddit
How did they heal a fricking 1bit llm?
Party-Special-5177@reddit
The method’s pretty ridiculous, but you generally turn the donor model’s weights into your master weights, then ‘gently’ turn the quantization up on your model until it is a bitnet lol.
More on the process in general (the blog isn’t mine): https://www.emergentmind.com/topics/bitnet-b1-58
Silver-Champion-4846@reddit
I'm confused, is this q1_0 1bit or 1.58bit?
Party-Special-5177@reddit
Blog is 1.58 bit; both are bitnets and the process is the same.
1 bit definition is an evenly weighted (-1,1) across -1 to 1, 1.58 bit is (-1,0,1) across the same -1 to 1.
Silver-Champion-4846@reddit
I was asking about this Bonsai thing. The 0 adds more power to the model so they should maybe have used 1.58, unless they actually did that and they're just mindfricking us with the naming convention
Party-Special-5177@reddit
Ahh, sorry; bonsai is properly 1 bit binary.
Silver-Champion-4846@reddit
Still waiting for support of that model on Jan, but that would require llama.cpp to support it fully.
whitestuffonbirdpoop@reddit
I thought they had trained a 1 bit model from scratch. It's just a quant of qwen 3?
Party-Special-5177@reddit
Completely; just an unattributed distill of qwen 3 8B. When you quant/distill, you need an architecture and a donor/teacher model. They use qwen 8B as both the donor and the architecture.
They acknowledge this deep in their white paper (page 6, section 4), which hilariously is the first page to be hidden from the preview on GitHub https://github.com/PrismML-Eng/Bonsai-demo/blob/main/1-bit-bonsai-8b-whitepaper.pdf :
Please contrast that to how they’ve marketed it here.
whitestuffonbirdpoop@reddit
lmao shameful display
is it too difficult to train a 1 bit model from scratch? if the efficiency gains are so good, wouldn't it be worth doing it?
Silver-Champion-4846@reddit
It would probably need a lot more pretraining, and since the shadow weights are 16bit the compute cost is not small. Companies be stinjy in the interest of time and lessening headaches lol
lolwutdo@reddit
Qwen 3 or Qwen 3.5? Would be neat if they could 1bit Qwen 3.5 397b.
Party-Special-5177@reddit
Qwen 3 8B
Silver-Champion-4846@reddit
Please keep us posted when it's done! slirp slirp. I wonder, does it use Imatrix? If so then the calibration dataset might just not account for some of my usecases like Arabic language processing.
pmttyji@reddit (OP)
1 bit version? Please do it
Party-Special-5177@reddit
I’ll run it both ways if it actually turns out to be good. I put a system together that actually adds parameters to the model to ensure certain loss targets are hit.
My hope is to be able to guarantee that the output will be indistinguishable from the original model within some error tolerance, and I mapped the error tolerances onto standard naive quants (e.g. 6-bit, 4-bit, etc). I have high hopes but the system is unproven and I’m quite worried of failure.
pmttyji@reddit (OP)
Any plans to try medium size models like Qwen3.5-27B or Qwen3.5-35B or Gemma4-26B or Gemma4-31B first? Because medium size models won't take long time like large models Qwen3.5-397B. You could find results quickly.
Thanks again
Then-Topic8766@reddit
Something is wrong. Just updated llama.cpp and Bonsai works but incredibly slow (0.5 t/s). With prism fork generation speed is 165 t/s.
Silver-Champion-4846@reddit
Someone needs to migrate that implementation into mainstream llama.cpp
pmttyji@reddit (OP)
I tested with my old laptop which has 16GB DDR3 RAM. Got 0.3 t/s. Don't know why. I'll check with current laptop(32GB DDR5) soon/later.
FastDecode1@reddit
yeah, came here to say this... 0.09 tps
Feels like the model took a long time to load as well, though I'm using router mode and just tabbed out for a while.
121507090301@reddit
I got 0.06T/s too, running fully on the cpu...
ilintar@reddit
Backends will follow don't worry :)
Kahvana@reddit
Awesome! Can't wait to try this out on my Intel N5000 + Intel UHD Graphics 605 with Vulkan.
Speaking of which, I saw other models being quantinized in Q1_0. Anything special I need to do to reproduce these or could I simply target `Q1_0` in llama-quantize?
Silver-Champion-4846@reddit
Wait how are you running Vulcan on cpu+Uhd igpu?
Kahvana@reddit
Llama.cpp (Windows, Vulkan build). Llama.cpp on linux didn't work since the iGPU has bad drivers, the windows version did (and supports vulkan 1.3). BF16 models aren't supported, but F16, Q8_0 and Q4_K_S work fine. IQ4 models don't run as well, and Unsloth's XL quants run terrible,
Silver-Champion-4846@reddit
What about Apexquants?
Skyline34rGt@reddit
Wonder about dense Qwen3.5 27b or Gemma 31b 1bit fits fully to 8-10Gb Vram.
Or If my math is correct the MoE Minimax 2.5-2.7 1bit fits to 12Gb Vram and 48Gb Vram.
That will be something!
pmttyji@reddit (OP)
Just a rough math by AI. Yes, MiniMax will fly with 48GB VRAM.
Skyline34rGt@reddit
Amazing. Can't wait what will be next
pmttyji@reddit (OP)
u/Party-Special-5177 Please cook small/medium models
Foreign-Beginning-49@reddit
its moving like molasses....but at least it generated a few words so we are on our way towards it working! using the gguf from the huggingface prism repo...and newest llama.cpp fetched....
Zestyclose_Yak_3174@reddit
I am looking forward to giving this a try on edge devices and smartphones. Could be a lot faster even on slower hardware. Hard to believe it really does deliver in terms of its coherence and intelligence. If so, it can give us a small glimpse of what might be possible in the future in terms of better quantization and compression.
spaceman_@reddit
Looking forward to trying this in pocketpal!