ggml: add Q1_0 1-bit quantization support (CPU) - 1-bit Bonsai models

[-]

aero-spike@reddit

So should I install the Windows x64 CPU version or the Vulkan version?

[-]

pmttyji@reddit (OP)

CPU, Metal & Vulkan support done. So yes.

[-]

aero-spike@reddit

Oh sorry, I mean I’m using a Ryzen 5600G, which specific version should I install?

ChatGPT told me to install the CPU version, Gemini & Claude told me to install the Vulcan version. So I’m confused.

[-]

GPUs will give you best t/s as those are (10X) faster than RAM. So if you're using NVIDIA GPUs, use CUDA version. If you're using AMD GPUs, use Vulkan or ROCm version. If you don't have any GPUs, use CPU version.

[-]

tarruda@reddit

Will this quantization be available to other models or is it only for Bonsai's models?

[-]

ilintar@reddit

It's available for any models (but YMMV for any models that are not explicitly trained for this :>)

[-]

pmttyji@reddit (OP)

Haven't tried yet, Hope this update would run models like Falcon3-10B-Instruct-1.58bit

[-]

Silver-Champion-4846@reddit

Why 1bit and not 1.58bit ternary?

[-]

Party-Special-5177@reddit

Smoke and mirrors and PrOpRiEtArY AlGoRiThMs. I still don’t know why Prism didn’t use any of the industry standard naming conventions for derived models - the model isn’t theirs, it’s just Qwen 3 quantized and healed.

The damn thing should be named Qwen-3-Q1-xxx like everyone else who quants someone else’s model into bitnets (e.g. the TII guys out of the UAE made the Falcon-E series)

[-]

Silver-Champion-4846@reddit

How did they heal a fricking 1bit llm?

[-]

Party-Special-5177@reddit

The method’s pretty ridiculous, but you generally turn the donor model’s weights into your master weights, then ‘gently’ turn the quantization up on your model until it is a bitnet lol.

More on the process in general (the blog isn’t mine): https://www.emergentmind.com/topics/bitnet-b1-58

[-]

Silver-Champion-4846@reddit

I'm confused, is this q1_0 1bit or 1.58bit?

[-]

Party-Special-5177@reddit

Blog is 1.58 bit; both are bitnets and the process is the same.

1 bit definition is an evenly weighted (-1,1) across -1 to 1, 1.58 bit is (-1,0,1) across the same -1 to 1.

[-]

Silver-Champion-4846@reddit

I was asking about this Bonsai thing. The 0 adds more power to the model so they should maybe have used 1.58, unless they actually did that and they're just mindfricking us with the naming convention

[-]

Party-Special-5177@reddit

Ahh, sorry; bonsai is properly 1 bit binary.

[-]

Silver-Champion-4846@reddit

Still waiting for support of that model on Jan, but that would require llama.cpp to support it fully.

[-]

whitestuffonbirdpoop@reddit

I thought they had trained a 1 bit model from scratch. It's just a quant of qwen 3?

[-]

Party-Special-5177@reddit

Completely; just an unattributed distill of qwen 3 8B. When you quant/distill, you need an architecture and a donor/teacher model. They use qwen 8B as both the donor and the architecture.

They acknowledge this deep in their white paper (page 6, section 4), which hilariously is the first page to be hidden from the preview on GitHub https://github.com/PrismML-Eng/Bonsai-demo/blob/main/1-bit-bonsai-8b-whitepaper.pdf :

1-bit Bonsai 8B is built from Qwen3-8B … The architecture is unchanged; the novelty lies entirely in the deployment stack.

Please contrast that to how they’ve marketed it here.

[-]

whitestuffonbirdpoop@reddit

lmao shameful display
is it too difficult to train a 1 bit model from scratch? if the efficiency gains are so good, wouldn't it be worth doing it?

[-]

Silver-Champion-4846@reddit

It would probably need a lot more pretraining, and since the shadow weights are 16bit the compute cost is not small. Companies be stinjy in the interest of time and lessening headaches lol

[-]

lolwutdo@reddit

Qwen 3 or Qwen 3.5? Would be neat if they could 1bit Qwen 3.5 397b.

[-]

Party-Special-5177@reddit

Qwen 3 8B

[-]

Silver-Champion-4846@reddit

Please keep us posted when it's done! slirp slirp. I wonder, does it use Imatrix? If so then the calibration dataset might just not account for some of my usecases like Arabic language processing.

[-]

pmttyji@reddit (OP)

I’m cooking the 397B right now, since you guys have such an appetite for bitnets.

1 bit version? Please do it

[-]

Party-Special-5177@reddit

I’ll run it both ways if it actually turns out to be good. I put a system together that actually adds parameters to the model to ensure certain loss targets are hit.

My hope is to be able to guarantee that the output will be indistinguishable from the original model within some error tolerance, and I mapped the error tolerances onto standard naive quants (e.g. 6-bit, 4-bit, etc). I have high hopes but the system is unproven and I’m quite worried of failure.

[-]

pmttyji@reddit (OP)

Any plans to try medium size models like Qwen3.5-27B or Qwen3.5-35B or Gemma4-26B or Gemma4-31B first? Because medium size models won't take long time like large models Qwen3.5-397B. You could find results quickly.

Thanks again

[-]

Then-Topic8766@reddit

Something is wrong. Just updated llama.cpp and Bonsai works but incredibly slow (0.5 t/s). With prism fork generation speed is 165 t/s.

[-]

Silver-Champion-4846@reddit

Someone needs to migrate that implementation into mainstream llama.cpp

[-]

pmttyji@reddit (OP)

I tested with my old laptop which has 16GB DDR3 RAM. Got 0.3 t/s. Don't know why. I'll check with current laptop(32GB DDR5) soon/later.

[-]

FastDecode1@reddit

yeah, came here to say this... 0.09 tps

Feels like the model took a long time to load as well, though I'm using router mode and just tabbed out for a while.

[-]

121507090301@reddit

I got 0.06T/s too, running fully on the cpu...

[-]

ilintar@reddit

Backends will follow don't worry :)

[-]

Kahvana@reddit

Awesome! Can't wait to try this out on my Intel N5000 + Intel UHD Graphics 605 with Vulkan.

Speaking of which, I saw other models being quantinized in Q1_0. Anything special I need to do to reproduce these or could I simply target `Q1_0` in llama-quantize?

[-]

Silver-Champion-4846@reddit

Wait how are you running Vulcan on cpu+Uhd igpu?

[-]

Kahvana@reddit

Llama.cpp (Windows, Vulkan build). Llama.cpp on linux didn't work since the iGPU has bad drivers, the windows version did (and supports vulkan 1.3). BF16 models aren't supported, but F16, Q8_0 and Q4_K_S work fine. IQ4 models don't run as well, and Unsloth's XL quants run terrible,

[-]

Silver-Champion-4846@reddit

What about Apexquants?

[-]

Skyline34rGt@reddit

Wonder about dense Qwen3.5 27b or Gemma 31b 1bit fits fully to 8-10Gb Vram.

Or If my math is correct the MoE Minimax 2.5-2.7 1bit fits to 12Gb Vram and 48Gb Vram.

That will be something!

[-]

pmttyji@reddit (OP)

Just a rough math by AI. Yes, MiniMax will fly with 48GB VRAM.

8 : 1.5
30: 5.625
50: 9.375
70: 13.125
100: 18.75
120: 22.5
200: 37.5
250: 46.875
300: 56.25
400: 75
500: 93.75
600: 112.5
700: 131.25
1000: 187.5

[-]

Skyline34rGt@reddit

Amazing. Can't wait what will be next

[-]

pmttyji@reddit (OP)

u/Party-Special-5177 Please cook small/medium models

[-]

Foreign-Beginning-49@reddit

its moving like molasses....but at least it generated a few words so we are on our way towards it working! using the gguf from the huggingface prism repo...and newest llama.cpp fetched....

[-]

Zestyclose_Yak_3174@reddit

I am looking forward to giving this a try on edge devices and smartphones. Could be a lot faster even on slower hardware. Hard to believe it really does deliver in terms of its coherence and intelligence. If so, it can give us a small glimpse of what might be possible in the future in terms of better quantization and compression.

[-]

spaceman_@reddit

Looking forward to trying this in pocketpal!