Is 1-bit and TurboQuant the future of OSS? A simulation for Qwen3.5 models.
Posted by GizmoR13@reddit | LocalLLaMA | View on Reddit | 84 comments
Simulation what the Qwen3.5 model family would look like using 1-bit technology and TurboQuant. The table below shows the results, this would be a revolution:
| Model | Parameters | Q4_K_M File (Current) | KV Cache (256K) (Current) | Hypothetical 1-bit Weights | KV Cache 256K with TurboQuant | Hypothetical Total Memory Usage |
|---|---|---|---|---|---|---|
| Qwen3.5-122B-A10B | 122B total / 10B active | 74.99 GB | 81.43 GB | 17.13 GB | 1.07 GB | 18.20 GB18.20 GB |
| Qwen3.5-35B-A3B | 35B total / 3B active | 21.40 GB | 26.77 GB | 4.91 GB | 0.89 GB | 5.81 GB |
| Qwen3.5-27B | 27B | 17.13 GB | 34.31 GB | 3.79 GB | 2.86 GB | 6.65 GB |
| Qwen3.5-9B | 9B | 5.89 GB | 14.48 GB | 1.26 GB | 1.43 GB | 2.69 GB |
| Qwen3.5-4B | 4B | 2.87 GB | 11.46 GB | 0.56 GB | 1.43 GB | 1.99 GB |
| Qwen3.5-2B | 2B | 1.33 GB | 4.55 GB | 0.28 GB | 0.54 GB | 0.82 GB |
No-Refrigerator-1672@reddit
Why stop at 1-bit? Let's go with 0 bit! Who even needs weights at all? Imagine running a model with literally zero vram needed!
bapuc@reddit
This is what i am working on
Silver-Champion-4846@reddit
Where have you reached so far?
bapuc@reddit
Glm 4.7 with quant 4, 2 tokens per second, cpu only, I need at least 50 and I need to scale up until no quant is needed
Silver-Champion-4846@reddit
What do you mean Quant 4? Can you please explain in detail?
bapuc@reddit
4 bit quantitization
https://huggingface.co/unsloth/GLM-4.7-GGUF/tree/main/Q4_0
Silver-Champion-4846@reddit
So, you're trying to go from that one into 1.58bit?
bapuc@reddit
No, I am trying to:
Or
Or
Or
Or
There's many more than those, I am at experiment 90 or so, learned so many things, will continue to experiment now
pmttyji@reddit
Hope your experiments went/going well. Please share here any updates.
bapuc@reddit
Still experimenting, nothing new for now, I am currently experimenting with training from scratch / compressing a moe into a multi-dimensional stenogram (like the ones you get in the audio files)
Modal.com funded me with some startup credits, so things should go faster now
Will make a post in the sub with all the experiments once I get into something truly interesting
pmttyji@reddit
Nice to hear this!
live_love_laugh@reddit
Why the sarcasm? Maybe the wins are overblown a bit and the performance loss underplayed, but I still think the benefits are real and significant.
I'm still surprised that PrismML went for 1-bit and not for 1.58-bit (i.e. ternary) parameters. Intuitively I would think that having both 0 and -1 at your disposal would be a massive win for the expressiveness of the network. But I'm not really educated enough for my intuition to be worth much.
I have seen people talk about how the real world performance of PrismML's Bonsai models is disappointing. But I mean, if the performance loss can be mitigated by adding 15% more parameters then it would still be a net win.
I just wish we would see more companies pour more resources in trying to get 1(.58)-bit models to work. It's not just the memory savings, but also the compute savings from simplifying the matrix multiplications.
Brou1298@reddit
Also confused by the no ternary
Flinchie76@reddit
I think they're literally just partitioning the input activations into two groups based on the weight signs, then sum each group separately, and subtract the two sums. Basically 0 and 1 map to -1, +1, so you don't need the "zero escape hatch".
Brou1298@reddit
1.58 has 3 possible encodings (-1,0,1) theirs has (0,1) if im not mistaken or (-1,1) 1.58 does give a lot more capacity so im just expressing confusion at the choice lol
Flinchie76@reddit
Yeah, but it depends on how much informations the zero carries in the ternary case relative to the extra 0.58 bits storage. The matrix ops go from multiplications to sums, and if the zero's are effectively a no-op in many cases, and you have essentially 2 subspace partition, and not 3, then it would make sense.
No-Refrigerator-1672@reddit
Because there are no real models listed, no real tests run, not even a theoretical proposition on how to quant to 1-bit without lobotomizing a model. Just some numbers that are completely made up and have nothing behind them. Why would anyone consider it serious?
JayPSec@reddit
The prismml folks did they're benchmarks and I'm sure we'll get plenty more as the days go by. So far all seems good and legit. Have you tried it?
live_love_laugh@reddit
I see your point. Some people, me included, just like to fantasize about what could be achieved in the future given what's happening today on the cutting edge. And then those people want to share the excitement they feel about the great potential they see.
So yeah, nothing to take serious. Just something to either enjoy or ignore.
sonicnerd14@reddit
I think the way to look at is what would this do for the quantized versions that are already very coherent at sizes like 3 and 4 bits, maybe even 2 bit. If this is what a model with 1 bit can do then just imagine what a usable sized model would be able to do with the same optimizations applied.
gigaflops_@reddit
I run all my LLMs at -1 bit quantization and that way they increase the amount of available VRAM on my graphics card
ohgoditsdoddy@reddit
I thought models with ternary weights (~1.5bit) are possible if they are trained for it (as opposed to quantized after the fact).
Silver-Champion-4846@reddit
They should be, but there's not much support by companies as of yet
TopChard1274@reddit
Fun at parties \(ϋ)/♩
JsThiago5@reddit
You can simply imagine it running and then type the answer
Constant-Simple-1234@reddit
This is already possible. Just switch from Qwen at 27B to the one at 2B. Seems like thinking is very compressible and can span wide range of sizes. It is just a lossy compression and the loss is real. (Partially joking, at least in tone ;) )
Koalateka@reddit
The kind of mentality that makes science advance...
OXKSA1@reddit
technically you could run use ram instead of vram and use cloud ai models lol
sammcj@reddit
Ideally models would start giving bits back, it's about time
DR4G0NH3ART@reddit
Do it in your head, we might even call it, hmm.. Let us go with Natural intelligence.
dero_name@reddit
> Imagine running a model with literally zero vram needed!
You mean thinking? For myself? Heretic.
Pulselovve@reddit
At some point you reach reasonable physics limits. Weight store information, routines, reasoning patterns, etc. you can squeeze them till some point but you can't have all human knowledge and thinking patterns (on text at least) compressed in 8 gb...
Poluact@reddit
I don't believe even people reason with no information. We have immense amount of prior knowledge and reasoning patterns acquired through the lifetime starting from infancy.
MartiniCommander@reddit
This is exactly like xvid-x264-x265-AV1
We always say there's a limit but then we find ways to keep going around them.
Feztopia@reddit
You don't need perfect compression. You just need compression that gives you a Qwen3.5-35B-A3B in the size of Qwen3.5-9B Q4 that's still better than the 9B. That would be already progress.
plaintexttrader@reddit
The “very good reasoner without information” is an interesting point. Though I think that might be impossible. LLMs reason through chains of thoughts, which is possible via language training on tons of data, and knowledge is inseparable “side effect”. You do need knowledge to develop common sense for CoTs. It is not possible to reason that “1kg of feather is the same weight as 1kg of steel” without knowing what is a kg, a feather, steel.
waruby@reddit
The latest paper from Deepseek kind of does that, and is orthogonal with MoE, so it further reduces the number of active parameters required for the same quality of answers from the model.
ai_without_borders@reddit
yeah the deepseek paper is wild. i was reading some analysis on zhihu about it and the interesting context is that this efficiency research isnt just academic for them, its directly motivated by the chip export restrictions. when you cant buy h100s you have to squeeze every bit of performance out of what you have. so moe, low bit quantization, and kv compression arent nice to haves, theyre survival strategies. the fact that these techniques also happen to benefit the local llm community running on consumer gpus is kind of a happy accident. basically chinese labs are speed running efficient inference because they literally have no choice, and we all benefit from it
Imaginary-Unit-3267@reddit
Tfw someone named "ai_without_borders" accidentally makes a case for AI with borders. :P (since in this case it's the borders that are leading to the restrictions and thus to the improved outcomes for local LLM users)
pointer_to_null@reddit
The borders/restrictions impact is overblown, mostly for political reasons.
Arguably, the export restrictions had limited impact. Chinese firms had numerous proxies in Singapore and other countries where trade enforcement was lax, and the limitations in Nvidia's compliant versions were often circumvented via a combination of Nvidia's own half-assed protections and Chinese tech ingenuity. Arguably, restrictions were a hindrance at most, not a hurdle.
The export restrictions is one of many factors applying downward pressure on model scale, driving down the threshold where ROI on optimization begins to outweigh benefits from scaling. Once you include power costs (which greatly favors China), memory scarcity, rising hardware costs, and growing public negativity towards datacenters/big silicon/government/etc and you'll find this to be universal.
Worth mentioning that Bitnet came from Microsoft and TurboQuant came from Google. This isn't simply a China "needs smaller efficient models" to survive; everyone wants tiny models- even closed/hosted AI companies desire cheap-yet-powerful LLMs. Without optimizations, OpenAI and Anthropic have zero hope of ever breaking even on inference costs.
Opposite-Swimmer2752@reddit
Lol
oxygen_addiction@reddit
And if engram separation between knowledge/logic ends up working in practice and we get better RAM+VRAM utilization, a bit of all of the above will lead to better local inference.
Ell2509@reddit
But you arent in the case of turboquant. They have changed the format that data is stored as, without losing accuracy. Now that is the "polar" stuff, and relates yo KV cache (convo history).
For model weights, they are actually building models from the ground up in 1 bit, rather than training a model and then compressing. The claim is that this also changes the form without losing accuracy compared to unquantised models.
The nature of it is changing from navigating the matrix of relational probability with more concise instructions. Instead of "go left, up 2 floors, forward 4 meters left, etc.... until you arrive"
The data is instead stored as (X, Y) where X is direction and Y is distance to destination. Finding the same answer by using a coordinates system which has more possible characters per "digit". Smaller size file, same accuracy.
I am not claiming any results on this. Have nor tested yet, but am eager. However, from what I understand, the claim is indeed that we can keep the same accuracy with less data. Remember, the file is not full of facts, it is just relational statistics about words.
In a sentence, they have learned to navigate meaning without grammar.
Pulselovve@reddit
Yes they found a way to compress efficiently context, that's very far in terms of order of magnitude compared to the knowledge that gets compressed in weights. We are already using similar optimisations in quantisation.
Lorian0x7@reddit
Sure, that's true but it's been 3 years that I keep reading this argument and look how far we went since then. People were already doomed about knowledge density 3 years ago, what makes you think now is different? I'm pretty sure in another 3 years we will have the same discussion.
tmjumper96@reddit
122B models down to 18GB would be insane but what about quality degradation with 1-bit? have you actually tested any of these or is it just theoretical math
Savantskie1@reddit
Can you not read? He simulated, so therefore not a test, but theory.
Colecoman1982@reddit
Jokes on him, I'm theorizing that dropping it down to 1-buit will actually IMPROVE performance!
Savantskie1@reddit
At the cost of intelligence and tool use
tmjumper96@reddit
lmaooo your right
geneusutwerk@reddit
0.5-bit when?
_-_David@reddit
I heard something like six months ago a rumor that Gemma 4 would be a bitnet and push their QAT to the limit. I didn't really put my faith into that, but I do think that is ultimately the better architecture. But of course, there are often esoteric reasons why things don't work like a curious layperson might think. Training stability? Inference efficiency? Don't know. But it wouldn't surprise me in the last if it were to turn out that way eventually, and models over 2bit precision are a relic.
AI_Enhancer@reddit
This aged like fine milk
_-_David@reddit
Why? I said I didn't anticipate it happening. "I didn't really put my faith into that" was what I said, then never claimed that other than "eventually" it might be that way. This seems like you just wanted a "gotcha!" pretty badly.
AI_Enhancer@reddit
Woah, the hell? That wasn't a personal attack, just thought it would've been a funny comment lol. Maybe I should go back to abstaining from commenting on stuff on the internet. Anyways, maybe the proper wording would've been "THAT RUMOR aged like fine milk". Have good day.
_-_David@reddit
I didn't take it as a personal attack. I just thought it didn't make a whole lot of sense. I totally get the "maybe I will just go back to not engaging people online" feeling. I'm honestly fairly new to all of this. You have a good day as well :)
overand@reddit
Good to see people de-escalating stuff. I got accused of being an LLM or using an LLM (by someone who's been on reddit for 2 years vs my 18 years lol), and it Feels Bad, so it's super nice to see folks trying to be kind to each other.
Hell, even if the "person" one responds to is a bot, at the very least we're modeling positive behavior for other people. (And I guess for bots to train on 🙃)
Soft_Match5737@reddit
The numbers are exciting but one thing the simulation misses is attention compute overhead. Even with 1-bit weights shrinking the model file dramatically, attention is still the bottleneck at long contexts. KV cache compression via TurboQuant helps with memory, but the actual compute for attending over 256K tokens hits a wall regardless of weight precision. The real unlock would be 1-bit weights paired with some form of sparse attention that lets you skip cache entries entirely. That combo would make 122B on consumer hardware genuinely practical, not just technically possible with heroic memory paging.
prudant@reddit
would be really usable at those. quant limits o_O at q4 with kv cache at 8fp moes suffer a lot of degradation
rm-rf-rm@reddit
methodology?
cnmoro@reddit
I was just wondering about this today, and it's pretty exciting imo
Background-Initial13@reddit
Wouldn’t this also show that this is the best way to compress information right? Like asking these LLMs to recite a book that it has trained on
jaker86@reddit
Turboquant is great, but does not apply linearly to cache numbers for models like Qwen3.5; due to their hybrid architecture, a some of the cache is not K or V.
Source: running turboquant’d 27b on my 3090
YearnMar10@reddit
But how would NVIDIA then earn any money if even a Jetson Orin nano super could run those models, they’d be ruined!
unbannedfornothing@reddit
Where did you get this numbers for k\v cache? This is incorrect. Even 397B model gives `llama_kv_cache: size = 7680.00 MiB (262144 cells, 15 layers, 4/1 seqs), K (f16): 3840.00 MiB, V (f16): 3840.00 MiB` for 256K context for me. And for q8_0: `llama_kv_cache: size = 4080.00 MiB (262144 cells, 15 layers, 4/1 seqs), K (q8_0): 2040.00 MiB, V (q8_0): 2040.00 MiB`
TopChard1274@reddit
I wonder which one would run in my M1 iPad Pro with 8gbram. Now I use Rosetta 4b q6_k for rough translation and Qwen3.5 4b Claude Abliterated q6_k. Right now with the current architecturearound the size of 4.60gb is the maximum that my ipad could even load. Would a 1-bit 27b model potentially work on it? That honestly seems too good to be true. But when did impossible things stopped anyone dreaming
spaceman_@reddit
The 1-bit models which Microsoft (BitNet) and PrismML (Bonsai) developed are NOT 1-bit quantized versions of other models. They are specialized models. You cannot have a 1-bit 8B model that competes against a 4, 8 or 16-bit 8B model and expect the same level of quality.
droans@reddit
I don't think the thought is that a 1-bit model would compete against models with the same number of parameters. It's more a question on how it would compete against an equivalent sized model.
anykeyh@reddit
What's good with 1 bit or 1 trit (-1 0 1) models is that they work only with additions. Even better AND and XOR operations is all you need. No need of floating point multiplications.
One_Key_8127@reddit
Bonsai is quantized Qwen3 8b. I wonder whether you can quantize the Qwen3.5 MoE models to 1bit, but the dense 27b Qwen3.5 should be within PrismML's reach.
a_beautiful_rhind@reddit
Ahh.. ok.. then it's just more fucking grift. Fool me once.
Computationally heavy conversion to low-bit and getting meh performance has been done. Basically will never go anywhere.
In before a bunch of downvotes saying "n-n-ooo you're wrong this time, its good... :rocket: :rocket:"
Also see why Revolutionalredstone made that mistake. It was a bit misrepresented.
ambient_temp_xeno@reddit
I think the catch could be that they lost 17.3 from the MMLU Redux score compared to the original Qwen 3 8b.
Aha but it lets you run a much bigger model than you would otherwise... or can it? Maybe larger models have an even worse drop from the treatment.
a_beautiful_rhind@reddit
Yea no way to know. Super secret proprietary at the moment.
Is everything a literal scam now? Companies using this sub to spread their shaky misrepresented projects.
They really did seem to imply they had made another bitnet. Ok, we quantized qwen 8b to 1 bit and it's now as good as a 2b model, doesn't have quite the same ring to it.
Revolutionalredstone@reddit
Bonsai's page claims to be a new model trained on 23trillion tokens.
Makers7886@reddit
why do people pull shit out of their ass, like is it fun or is it laziness? Like you hear someone on the street and just parrot or don't really even care to look. Sorry I simply hate mis-information. Like asking for directions and the person saying with confidence "Yes, I know exactly where that is, go right" and they were completely full of shit. Why do people do that? What compels you?
One_Key_8127@reddit
From their own whitepaper.
Revolutionalredstone@reddit
Wow Okay Straight Up
Odd-Ordinary-5922@reddit
I dont understand why you say things with such certainty when the optimization improvement of llms has been crazy this past year
exaknight21@reddit
I saw their info on their website and saw a video performing with AnythingLLM. Its responses are coherent, deep research has me blown away.
linumax@reddit
Cool down. Let’s just wait for real test results once it’s out
retireb435@reddit
but when
ketosoy@reddit
I think you’re double counting the kv cache. Turbo quant works by exploiting kurtosis, Gaussian normalization, and sparsity to store most or all of what matters from 16 bits of information in 4 bits average.
So you could theoretically and fairly practically use turbo quant on a 1 bit cache and convert it into a 4 bit representation. But it’s pretty obvious why you don’t win when you do that.
There are likely to be explotable patterns for compression in the turbo quant cache, but it’s unlikely to be a 4x compression like turboquant.
ambient_temp_xeno@reddit
I'm not sure how I ended up in a 1.25 bit model quant timeline. I had chest pains the night before.
Due_Net_3342@reddit
no