Is 1-bit and TurboQuant the future of OSS? A simulation for Qwen3.5 models.

Model	Parameters	Q4_K_M File (Current)	KV Cache (256K) (Current)	Hypothetical 1-bit Weights	KV Cache 256K with TurboQuant	Hypothetical Total Memory Usage
Qwen3.5-122B-A10B	122B total / 10B active	74.99 GB	81.43 GB	17.13 GB	1.07 GB	18.20 GB18.20 GB
Qwen3.5-35B-A3B	35B total / 3B active	21.40 GB	26.77 GB	4.91 GB	0.89 GB	5.81 GB
Qwen3.5-27B	27B	17.13 GB	34.31 GB	3.79 GB	2.86 GB	6.65 GB
Qwen3.5-9B	9B	5.89 GB	14.48 GB	1.26 GB	1.43 GB	2.69 GB
Qwen3.5-4B	4B	2.87 GB	11.46 GB	0.56 GB	1.43 GB	1.99 GB
Qwen3.5-2B	2B	1.33 GB	4.55 GB	0.28 GB	0.54 GB	0.82 GB

[-]

No-Refrigerator-1672@reddit

Why stop at 1-bit? Let's go with 0 bit! Who even needs weights at all? Imagine running a model with literally zero vram needed!

[-]

bapuc@reddit

Glm 4.7 with quant 4, 2 tokens per second, cpu only, I need at least 50 and I need to scale up until no quant is needed

[-]

Silver-Champion-4846@reddit

What do you mean Quant 4? Can you please explain in detail?

[-]

bapuc@reddit

4 bit quantitization
https://huggingface.co/unsloth/GLM-4.7-GGUF/tree/main/Q4_0

[-]

Silver-Champion-4846@reddit

So, you're trying to go from that one into 1.58bit?

[-]

bapuc@reddit

No, I am trying to:

go full precision, didn't found amy way for this for now, still trying things to not need to fit the model in ram / vram

Or

going to 1.58 but lossless

Or

transforming the model in many little MoEs (so MoMoE)

Or

embedd the model into the code to run it directly (this helps, I called that flasking since I didn't found any term for that or didn't searched where I should, you compile the binary directly with the code to go trough the resurse and inference it), basically kinda like when those guys embeeded the model into hardware directly, but on software, seems to help a lot with the speed (it becomes about 5 times faster)

Or

transform all the operations into CPU primitive ops

There's many more than those, I am at experiment 90 or so, learned so many things, will continue to experiment now

[-]

pmttyji@reddit

Hope your experiments went/going well. Please share here any updates.

[-]

bapuc@reddit

Still experimenting, nothing new for now, I am currently experimenting with training from scratch / compressing a moe into a multi-dimensional stenogram (like the ones you get in the audio files)

Modal.com funded me with some startup credits, so things should go faster now

Will make a post in the sub with all the experiments once I get into something truly interesting

[-]

live_love_laugh@reddit

Why the sarcasm? Maybe the wins are overblown a bit and the performance loss underplayed, but I still think the benefits are real and significant.

I'm still surprised that PrismML went for 1-bit and not for 1.58-bit (i.e. ternary) parameters. Intuitively I would think that having both 0 and -1 at your disposal would be a massive win for the expressiveness of the network. But I'm not really educated enough for my intuition to be worth much.

I have seen people talk about how the real world performance of PrismML's Bonsai models is disappointing. But I mean, if the performance loss can be mitigated by adding 15% more parameters then it would still be a net win.

I just wish we would see more companies pour more resources in trying to get 1(.58)-bit models to work. It's not just the memory savings, but also the compute savings from simplifying the matrix multiplications.

[-]

Brou1298@reddit

Also confused by the no ternary

[-]

Flinchie76@reddit

I think they're literally just partitioning the input activations into two groups based on the weight signs, then sum each group separately, and subtract the two sums. Basically 0 and 1 map to -1, +1, so you don't need the "zero escape hatch".

[-]

Brou1298@reddit

1.58 has 3 possible encodings (-1,0,1) theirs has (0,1) if im not mistaken or (-1,1) 1.58 does give a lot more capacity so im just expressing confusion at the choice lol

[-]

Flinchie76@reddit

Yeah, but it depends on how much informations the zero carries in the ternary case relative to the extra 0.58 bits storage. The matrix ops go from multiplications to sums, and if the zero's are effectively a no-op in many cases, and you have essentially 2 subspace partition, and not 3, then it would make sense.

[-]

No-Refrigerator-1672@reddit

Why the sarcasm?

Because there are no real models listed, no real tests run, not even a theoretical proposition on how to quant to 1-bit without lobotomizing a model. Just some numbers that are completely made up and have nothing behind them. Why would anyone consider it serious?

[-]

JayPSec@reddit

The prismml folks did they're benchmarks and I'm sure we'll get plenty more as the days go by. So far all seems good and legit. Have you tried it?

[-]

live_love_laugh@reddit

I see your point. Some people, me included, just like to fantasize about what could be achieved in the future given what's happening today on the cutting edge. And then those people want to share the excitement they feel about the great potential they see.

So yeah, nothing to take serious. Just something to either enjoy or ignore.

[-]

sonicnerd14@reddit

I think the way to look at is what would this do for the quantized versions that are already very coherent at sizes like 3 and 4 bits, maybe even 2 bit. If this is what a model with 1 bit can do then just imagine what a usable sized model would be able to do with the same optimizations applied.

[-]

gigaflops_@reddit

I run all my LLMs at -1 bit quantization and that way they increase the amount of available VRAM on my graphics card

[-]

ohgoditsdoddy@reddit

I thought models with ternary weights (~1.5bit) are possible if they are trained for it (as opposed to quantized after the fact).

[-]

Silver-Champion-4846@reddit

They should be, but there's not much support by companies as of yet

[-]

TopChard1274@reddit

Fun at parties \⁠(⁠ϋ⁠)⁠/⁠♩

[-]

JsThiago5@reddit

You can simply imagine it running and then type the answer

[-]

Constant-Simple-1234@reddit

This is already possible. Just switch from Qwen at 27B to the one at 2B. Seems like thinking is very compressible and can span wide range of sizes. It is just a lossy compression and the loss is real. (Partially joking, at least in tone ;) )

[-]

Koalateka@reddit

The kind of mentality that makes science advance...

[-]

OXKSA1@reddit

technically you could run use ram instead of vram and use cloud ai models lol

[-]

sammcj@reddit

Ideally models would start giving bits back, it's about time

[-]

DR4G0NH3ART@reddit

Do it in your head, we might even call it, hmm.. Let us go with Natural intelligence.

[-]

dero_name@reddit

> Imagine running a model with literally zero vram needed!

You mean thinking? For myself? Heretic.

[-]

Pulselovve@reddit

At some point you reach reasonable physics limits. Weight store information, routines, reasoning patterns, etc. you can squeeze them till some point but you can't have all human knowledge and thinking patterns (on text at least) compressed in 8 gb...

[-]

Poluact@reddit

I don't believe even people reason with no information. We have immense amount of prior knowledge and reasoning patterns acquired through the lifetime starting from infancy.

[-]

MartiniCommander@reddit

This is exactly like xvid-x264-x265-AV1

We always say there's a limit but then we find ways to keep going around them.

[-]

Feztopia@reddit

You don't need perfect compression. You just need compression that gives you a Qwen3.5-35B-A3B in the size of Qwen3.5-9B Q4 that's still better than the 9B. That would be already progress.

[-]

plaintexttrader@reddit

The “very good reasoner without information” is an interesting point. Though I think that might be impossible. LLMs reason through chains of thoughts, which is possible via language training on tons of data, and knowledge is inseparable “side effect”. You do need knowledge to develop common sense for CoTs. It is not possible to reason that “1kg of feather is the same weight as 1kg of steel” without knowing what is a kg, a feather, steel.

[-]

waruby@reddit

The latest paper from Deepseek kind of does that, and is orthogonal with MoE, so it further reduces the number of active parameters required for the same quality of answers from the model.

[-]

ai_without_borders@reddit

yeah the deepseek paper is wild. i was reading some analysis on zhihu about it and the interesting context is that this efficiency research isnt just academic for them, its directly motivated by the chip export restrictions. when you cant buy h100s you have to squeeze every bit of performance out of what you have. so moe, low bit quantization, and kv compression arent nice to haves, theyre survival strategies. the fact that these techniques also happen to benefit the local llm community running on consumer gpus is kind of a happy accident. basically chinese labs are speed running efficient inference because they literally have no choice, and we all benefit from it

[-]

Imaginary-Unit-3267@reddit

Tfw someone named "ai_without_borders" accidentally makes a case for AI with borders. :P (since in this case it's the borders that are leading to the restrictions and thus to the improved outcomes for local LLM users)

[-]

pointer_to_null@reddit

The borders/restrictions impact is overblown, mostly for political reasons.

Arguably, the export restrictions had limited impact. Chinese firms had numerous proxies in Singapore and other countries where trade enforcement was lax, and the limitations in Nvidia's compliant versions were often circumvented via a combination of Nvidia's own half-assed protections and Chinese tech ingenuity. Arguably, restrictions were a hindrance at most, not a hurdle.
The export restrictions is one of many factors applying downward pressure on model scale, driving down the threshold where ROI on optimization begins to outweigh benefits from scaling. Once you include power costs (which greatly favors China), memory scarcity, rising hardware costs, and growing public negativity towards datacenters/big silicon/government/etc and you'll find this to be universal.

Worth mentioning that Bitnet came from Microsoft and TurboQuant came from Google. This isn't simply a China "needs smaller efficient models" to survive; everyone wants tiny models- even closed/hosted AI companies desire cheap-yet-powerful LLMs. Without optimizations, OpenAI and Anthropic have zero hope of ever breaking even on inference costs.

[-]

Opposite-Swimmer2752@reddit

Lol

[-]

oxygen_addiction@reddit

And if engram separation between knowledge/logic ends up working in practice and we get better RAM+VRAM utilization, a bit of all of the above will lead to better local inference.

[-]

Ell2509@reddit

But you arent in the case of turboquant. They have changed the format that data is stored as, without losing accuracy. Now that is the "polar" stuff, and relates yo KV cache (convo history).

For model weights, they are actually building models from the ground up in 1 bit, rather than training a model and then compressing. The claim is that this also changes the form without losing accuracy compared to unquantised models.

The nature of it is changing from navigating the matrix of relational probability with more concise instructions. Instead of "go left, up 2 floors, forward 4 meters left, etc.... until you arrive"

The data is instead stored as (X, Y) where X is direction and Y is distance to destination. Finding the same answer by using a coordinates system which has more possible characters per "digit". Smaller size file, same accuracy.

I am not claiming any results on this. Have nor tested yet, but am eager. However, from what I understand, the claim is indeed that we can keep the same accuracy with less data. Remember, the file is not full of facts, it is just relational statistics about words.

In a sentence, they have learned to navigate meaning without grammar.

[-]

Pulselovve@reddit

Yes they found a way to compress efficiently context, that's very far in terms of order of magnitude compared to the knowledge that gets compressed in weights. We are already using similar optimisations in quantisation.

[-]

Lorian0x7@reddit

Sure, that's true but it's been 3 years that I keep reading this argument and look how far we went since then. People were already doomed about knowledge density 3 years ago, what makes you think now is different? I'm pretty sure in another 3 years we will have the same discussion.

[-]

tmjumper96@reddit

122B models down to 18GB would be insane but what about quality degradation with 1-bit? have you actually tested any of these or is it just theoretical math

[-]

Savantskie1@reddit

Can you not read? He simulated, so therefore not a test, but theory.

[-]

Colecoman1982@reddit

Jokes on him, I'm theorizing that dropping it down to 1-buit will actually IMPROVE performance!

[-]

Savantskie1@reddit

At the cost of intelligence and tool use

[-]

tmjumper96@reddit

lmaooo your right

[-]

geneusutwerk@reddit

0.5-bit when?

[-]

_-_David@reddit

I heard something like six months ago a rumor that Gemma 4 would be a bitnet and push their QAT to the limit. I didn't really put my faith into that, but I do think that is ultimately the better architecture. But of course, there are often esoteric reasons why things don't work like a curious layperson might think. Training stability? Inference efficiency? Don't know. But it wouldn't surprise me in the last if it were to turn out that way eventually, and models over 2bit precision are a relic.

[-]

AI_Enhancer@reddit

This aged like fine milk

[-]

_-_David@reddit

Why? I said I didn't anticipate it happening. "I didn't really put my faith into that" was what I said, then never claimed that other than "eventually" it might be that way. This seems like you just wanted a "gotcha!" pretty badly.

[-]

AI_Enhancer@reddit

Woah, the hell? That wasn't a personal attack, just thought it would've been a funny comment lol. Maybe I should go back to abstaining from commenting on stuff on the internet. Anyways, maybe the proper wording would've been "THAT RUMOR aged like fine milk". Have good day.

[-]

_-_David@reddit

I didn't take it as a personal attack. I just thought it didn't make a whole lot of sense. I totally get the "maybe I will just go back to not engaging people online" feeling. I'm honestly fairly new to all of this. You have a good day as well :)

[-]

overand@reddit

Good to see people de-escalating stuff. I got accused of being an LLM or using an LLM (by someone who's been on reddit for 2 years vs my 18 years lol), and it Feels Bad, so it's super nice to see folks trying to be kind to each other.

Hell, even if the "person" one responds to is a bot, at the very least we're modeling positive behavior for other people. (And I guess for bots to train on 🙃)

[-]

Soft_Match5737@reddit

The numbers are exciting but one thing the simulation misses is attention compute overhead. Even with 1-bit weights shrinking the model file dramatically, attention is still the bottleneck at long contexts. KV cache compression via TurboQuant helps with memory, but the actual compute for attending over 256K tokens hits a wall regardless of weight precision. The real unlock would be 1-bit weights paired with some form of sparse attention that lets you skip cache entries entirely. That combo would make 122B on consumer hardware genuinely practical, not just technically possible with heroic memory paging.

[-]

prudant@reddit

would be really usable at those. quant limits o_O at q4 with kv cache at 8fp moes suffer a lot of degradation

[-]

rm-rf-rm@reddit

methodology?

[-]

cnmoro@reddit

I was just wondering about this today, and it's pretty exciting imo

[-]

Background-Initial13@reddit

Wouldn’t this also show that this is the best way to compress information right? Like asking these LLMs to recite a book that it has trained on

[-]

jaker86@reddit

Turboquant is great, but does not apply linearly to cache numbers for models like Qwen3.5; due to their hybrid architecture, a some of the cache is not K or V.

Source: running turboquant’d 27b on my 3090

[-]

YearnMar10@reddit

But how would NVIDIA then earn any money if even a Jetson Orin nano super could run those models, they’d be ruined!

[-]

unbannedfornothing@reddit

Where did you get this numbers for k\v cache? This is incorrect. Even 397B model gives `llama_kv_cache: size = 7680.00 MiB (262144 cells, 15 layers, 4/1 seqs), K (f16): 3840.00 MiB, V (f16): 3840.00 MiB` for 256K context for me. And for q8_0: `llama_kv_cache: size = 4080.00 MiB (262144 cells, 15 layers, 4/1 seqs), K (q8_0): 2040.00 MiB, V (q8_0): 2040.00 MiB`

[-]

TopChard1274@reddit

I wonder which one would run in my M1 iPad Pro with 8gbram. Now I use Rosetta 4b q6_k for rough translation and Qwen3.5 4b Claude Abliterated q6_k. Right now with the current architecturearound the size of 4.60gb is the maximum that my ipad could even load. Would a 1-bit 27b model potentially work on it? That honestly seems too good to be true. But when did impossible things stopped anyone dreaming

[-]

spaceman_@reddit

The 1-bit models which Microsoft (BitNet) and PrismML (Bonsai) developed are NOT 1-bit quantized versions of other models. They are specialized models. You cannot have a 1-bit 8B model that competes against a 4, 8 or 16-bit 8B model and expect the same level of quality.

[-]

droans@reddit

I don't think the thought is that a 1-bit model would compete against models with the same number of parameters. It's more a question on how it would compete against an equivalent sized model.

[-]

anykeyh@reddit

What's good with 1 bit or 1 trit (-1 0 1) models is that they work only with additions. Even better AND and XOR operations is all you need. No need of floating point multiplications.

[-]

One_Key_8127@reddit

Bonsai is quantized Qwen3 8b. I wonder whether you can quantize the Qwen3.5 MoE models to 1bit, but the dense 27b Qwen3.5 should be within PrismML's reach.

[-]

a_beautiful_rhind@reddit

Ahh.. ok.. then it's just more fucking grift. Fool me once.

Computationally heavy conversion to low-bit and getting meh performance has been done. Basically will never go anywhere.

In before a bunch of downvotes saying "n-n-ooo you're wrong this time, its good... :rocket: :rocket:"

Also see why Revolutionalredstone made that mistake. It was a bit misrepresented.

[-]

ambient_temp_xeno@reddit

I think the catch could be that they lost 17.3 from the MMLU Redux score compared to the original Qwen 3 8b.

Aha but it lets you run a much bigger model than you would otherwise... or can it? Maybe larger models have an even worse drop from the treatment.

[-]

a_beautiful_rhind@reddit

Yea no way to know. Super secret proprietary at the moment.

Is everything a literal scam now? Companies using this sub to spread their shaky misrepresented projects.

They really did seem to imply they had made another bitnet. Ok, we quantized qwen 8b to 1 bit and it's now as good as a 2b model, doesn't have quite the same ring to it.

[-]

Revolutionalredstone@reddit

Bonsai's page claims to be a new model trained on 23trillion tokens.

[-]

Makers7886@reddit

why do people pull shit out of their ass, like is it fun or is it laziness? Like you hear someone on the street and just parrot or don't really even care to look. Sorry I simply hate mis-information. Like asking for directions and the person saying with confidence "Yes, I know exactly where that is, go right" and they were completely full of shit. Why do people do that? What compels you?

[-]

There are likely to be explotable patterns for compression in the turbo quant cache, but it’s unlikely to be a 4x compression like turboquant.

[-]

ambient_temp_xeno@reddit

I'm not sure how I ended up in a 1.25 bit model quant timeline. I had chest pains the night before.

[-]

Due_Net_3342@reddit

no