By when do you think will TurboQuant get a proper release and be adopted by everyone
Posted by Crystalagent47@reddit | LocalLLaMA | View on Reddit | 32 comments
The gains when using asymmetric setup on K and V are quite huge
YehowaH@reddit
Ada 40xx and ampere 30xx still have a problem with the implementation, the Tom, who is the great mind behind the best fork is working on it, to get it stable also for these generations. We will see if it gets a big fix. However the current implementations have issues with big contexts>100k, and loose exponentially tg. From possible 85 tg (f16/f16), to only 24 TG on 130k context with qwen3.6 35b a3b Q4 nl, with q8/t4 Combo. Fingers crossed there will be a solution soon.
Velocita84@reddit
Atom_101@reddit
Bruh you are like the number 1 TQ hater on this sub.
Velocita84@reddit
Negative-Web8619@reddit
I wanted to see other posts and maybe a reason but you hiding 😔
JamesEvoAI@reddit
I'm convinced the only reason this gained as much attention as it did was because of the name
EffectiveCeilingFan@reddit
Velocita is the only news source I trust 🫡
dinerburgeryum@reddit
Listen if it's good enough for CeilingFan it's good enough for me. 🫡
DefNattyBoii@reddit
I'm more interested in sub 4 bit weight quants to squeeze out more from our vram
draconic_tongue@reddit
On which model? I tried qwen 3.6 dense and moe and the savings between q8/q4 and q8/tq4 are miniscule
pmttyji@reddit
Don't know.
llama.cpp Links related to TurboQuant here to track progress.
Jatilq@reddit
I think I asked this question of an AI a couple months ago and tried KV Cache. I'd been mixing 2x3060s with an AMD 6900xt so this could work better with others with a full Cuda or ROCm setup.
"What are some Turboquant alternatives that could be better"
If you are looking for alternatives to TurboQuant, the options generally fall into two categories: high-speed research-grade methods for KV cache compression or established production standards for weight quantization.
1. Research-Grade Alternatives (KV Cache Focus)
Since TurboQuant is specifically optimized for KV cache memory during inference, your closest alternatives are other methods that target memory bottlenecks in long-context tasks.
2. High-Performance Inference Engines
If your interest in TurboQuant is about maximizing "tokens per second" on high-parameter models like Qwen 3.5 or DeepSeek-R1, these engines often provide better alternatives to custom quantization scripts:
3. Established Standard Alternatives
For general-purpose use where compatibility and stability are more important than experimental KV cache compression:
Summary Comparison
Dany0@reddit
Keep your clanker discussions to yourself. This is like showing your genitals to other people
We don't care, we're not amazed, and you're making us feel uncomfortable and wasting our time
Jatilq@reddit
This is funny. It suggest an insecurity about the size of your "genitals". I was trying to helps so the you feel unconformable maybe goes back to that insecurity you confess you have. Don't worry, your partner was never laughing at you, but with you.
Dany0@reddit
Did you ask a clanker to come up with a good comeback? Might want to find a burns & comebacks LoRA and give it another shot. Maybe you'll get a pass@5
Jatilq@reddit
I wish I could say there was some noble reason for posting something like this, but there isn't. I had some real health scares this week, and now I'm more in a "fuck it / fuck you" attitude.
Let's break this whole discussion down to its core: motivations. Your issue is that I was trying to help someone. You might not like how I was trying to do it, but that was the overall motivation.
For some strange reason, you mentioned genitals, and it's making people uncomfortable. See "motivations" above. This says more about the people who have a problem with it than it does about me. I picture those people as the type who jump out of the bushes to complain when you give a homeless person money.
I'm in my 50s, and for the most part, I've played a young man's game. What's that game? Not responding if you know the masses will have a problem with it. That comes from insecurity. I know people will downvote anything related to AI helping, but you can revisit my motivations.
Just in case you don't understand the young man's game, it's insecurity. Why else would you or others have a problem with me trying to help?
The amusing part is that you think anything you say has any real weight. It doesn't, because it's not rooted in anything remotely close to trying to help the OP—it's only there to feed your insecurities.
Dany0@reddit
No one's gonna read that bro, but thanks for helping us poison bad actors training on reddit data
randomfoo2@reddit
I've tested all of these btw in non-production code. I've found HIGGS to be the best in terms of quality (especially paired with some other minimization techniques that can be stacked) however I've been unable to get it past \~50% prefill/decode speed. I do have something to announce soon, that I think should be a big deal on the KVCache front that is faster and better than current TurboQuant implementations.
_hephaestus@reddit
Fwiw it’s been in oMLX for a while now. Not really noticing speed/memory gains but haven’t done a thorough analysis
insanemal@reddit
Isn't TurboQuant in 0.20.0?
Middle_Bullfrog_6173@reddit
Are there any benchmarks now that it's out there? And I don't mean speed, I've seen those.
inky_wolf@reddit
It is, but hybrid mamba models aren't supported.
stoppableDissolution@reddit
Likely never, because even q8 context quantization hurts the models very big time.
edsonmedina@reddit
How come? Do you have a source?
stoppableDissolution@reddit
https://github.com/ggml-org/llama.cpp/pull/21038#issuecomment-4150413357
SexyAlienHotTubWater@reddit
That shows very minimal loss to the model when you apply TurboQuant Q8. 37.9% vs. 37.1% - noticeable, but not "very big time"
a_beautiful_rhind@reddit
That test doesn't repeat for other models. Everyone took it at face value to confirm their beliefs.
dsanft@reddit
Yup.
The real win is activation rotation to minimise quantisation error for high kurtosis tensors. You don't need low-bit TQ for that. It will actually make Q8 kv cache precision feasible.
Mashic@reddit
Apparently there is friciton and the llama.cpp devs don't like it. I don't think they want to implement it in the first place.
Dany0@reddit
Yes, because it's not so simple, and ggeranov made the right call. Besides, attn-rot which brings 80% of the benefits with almost none of the downsides has already been merged and is automatically on
I think KVTC/deepseek v4 style cache compression has a higher likelihood of getting merged to be perfectly honest. But it'll be a few months
TQ forks exist for those that want it!
DigRealistic2977@reddit
I guess nobody will know.. as it seems there are many versions coming out better than turbo quant and it's a wikd west out there for kv cache I guess.. so many others claiming this is better than X over X etc.
So it seems understandable that they won't stick to it right away.
Crystalagent47@reddit (OP)
Makes sense, great take