[-]

YehowaH@reddit

Ada 40xx and ampere 30xx still have a problem with the implementation, the Tom, who is the great mind behind the best fork is working on it, to get it stable also for these generations. We will see if it gets a big fix. However the current implementations have issues with big contexts>100k, and loose exponentially tg. From possible 85 tg (f16/f16), to only 24 TG on 130k context with qwen3.6 35b a3b Q4 nl, with q8/t4 Combo. Fingers crossed there will be a solution soon.

[-]

Velocita84@reddit

[-]

Atom_101@reddit

Bruh you are like the number 1 TQ hater on this sub.

[-]

Velocita84@reddit

[-]

Negative-Web8619@reddit

I wanted to see other posts and maybe a reason but you hiding 😔

[-]

JamesEvoAI@reddit

I'm convinced the only reason this gained as much attention as it did was because of the name

[-]

EffectiveCeilingFan@reddit

Velocita is the only news source I trust 🫡

[-]

dinerburgeryum@reddit

Listen if it's good enough for CeilingFan it's good enough for me. 🫡

[-]

DefNattyBoii@reddit

I'm more interested in sub 4 bit weight quants to squeeze out more from our vram

[-]

draconic_tongue@reddit

On which model? I tried qwen 3.6 dense and moe and the savings between q8/q4 and q8/tq4 are miniscule

[-]

pmttyji@reddit

Don't know.

llama.cpp Links related to TurboQuant here to track progress.

[-]

Jatilq@reddit

I think I asked this question of an AI a couple months ago and tried KV Cache. I'd been mixing 2x3060s with an AMD 6900xt so this could work better with others with a full Cuda or ROCm setup.

"What are some Turboquant alternatives that could be better"

If you are looking for alternatives to TurboQuant, the options generally fall into two categories: high-speed research-grade methods for KV cache compression or established production standards for weight quantization.

1. Research-Grade Alternatives (KV Cache Focus)

Since TurboQuant is specifically optimized for KV cache memory during inference, your closest alternatives are other methods that target memory bottlenecks in long-context tasks.

RotorQuant (or PlanarQuant): This is currently a top alternative for those prioritizing speed. It uses Clifford Algebra "rotors" instead of TurboQuant’s random orthogonal matrices.
The Advantage: It can be 10–19x faster than standard matmuls and up to 30x faster on specific Metal shaders (Apple Silicon). It significantly reduces the parameter count needed for rotation, making it leaner than TurboQuant.
The Tradeoff: It mixes information in smaller blocks (e.g., 3D groups), which can lead to higher error on "worst-case" vectors compared to TurboQuant’s global rotation.
KIVI (2-bit KV Cache): If your goal is extreme memory savings to fit massive contexts, KIVI is a robust alternative. It uses a 2-bit quantization scheme that helps maintain accuracy by treating "outlier" tokens differently from the rest of the cache.
Hadamard Transform Methods: Some recent implementations replace TurboQuant's Haar transform with a Walsh-Hadamard transform. This is often more efficient for extracting entropy and can be easier to implement in certain hardware kernels while offering similar "spreading" of energy across dimensions.

2. High-Performance Inference Engines

If your interest in TurboQuant is about maximizing "tokens per second" on high-parameter models like Qwen 3.5 or DeepSeek-R1, these engines often provide better alternatives to custom quantization scripts:

ExLlamaV2 (EXL2): Widely considered the "gold standard" for raw throughput on NVIDIA hardware. It uses a unique quantization method that allows for non-integer bitrates (e.g., 3.5-bit or 4.25-bit), letting you fit a model precisely to your available VRAM while maintaining higher performance than GGUF.
vLLM with Marlin Kernels: If you are running 4-bit AWQ or GPTQ models, using the Marlin kernel within vLLM can nearly double your throughput compared to standard kernels. It is highly optimized for high-density GPU compute.

3. Established Standard Alternatives

For general-purpose use where compatibility and stability are more important than experimental KV cache compression:

GGUF (with Importance Matrix): While usually slower than EXL2, using GGUF with an "i-matrix" (Importance Matrix) often yields better perplexity (accuracy) than standard 4-bit quants. It is the best choice if you need to offload specific layers to a CPU or split workloads across mismatched GPUs.
AWQ (Activation-aware Weight Quantization): This is typically better than GPTQ for instruction-following and coding tasks. It protects the most "salient" weights during the compression process, which results in fewer "hallucinations" or broken logic at lower bitrates.

Summary Comparison

Method	Best For	Primary Advantage
RotorQuant	Ultra-fast inference	10-30x faster kernels than TurboQuant.
EXL2	NVIDIA GPU Speed	Highest throughput for local single-user chat.
AWQ	Creative/Coding Accuracy	Better coherence than GPTQ or standard GGUF.
GGUF (i-matrix)	Mixed Hardware	Most flexible for CPU/GPU splitting.If you are looking for alternatives to TurboQuant, the options generally fall into two categories: high-speed research-grade methods for KV cache compression or established production standards for weight quantization.1. Research-Grade Alternatives (KV Cache Focus)Since TurboQuant is specifically optimized for KV cache memory during inference, your closest alternatives are other methods that target memory bottlenecks in long-context tasks.RotorQuant (or PlanarQuant): This is currently a top alternative for those prioritizing speed. It uses Clifford Algebra "rotors" instead of TurboQuant’s random orthogonal matrices. The Advantage: It can be 10–19x faster than standard matmuls and up to 30x faster on specific Metal shaders (Apple Silicon). It significantly reduces the parameter count needed for rotation, making it leaner than TurboQuant.The Tradeoff: It mixes information in smaller blocks (e.g., 3D groups), which can lead to higher error on "worst-case" vectors compared to TurboQuant’s global rotation.KIVI (2-bit KV Cache): If your goal is extreme memory savings to fit massive contexts, KIVI is a robust alternative. It uses a 2-bit quantization scheme that helps maintain accuracy by treating "outlier" tokens differently from the rest of the cache.Hadamard Transform Methods: Some recent implementations replace TurboQuant's Haar transform with a Walsh-Hadamard transform. This is often more efficient for extracting entropy and can be easier to implement in certain hardware kernels while offering similar "spreading" of energy across dimensions. 2. High-Performance Inference EnginesIf your interest in TurboQuant is about maximizing "tokens per second" on high-parameter models like Qwen 3.5 or DeepSeek-R1, these engines often provide better alternatives to custom quantization scripts:ExLlamaV2 (EXL2): Widely considered the "gold standard" for raw throughput on NVIDIA hardware. It uses a unique quantization method that allows for non-integer bitrates (e.g., 3.5-bit or 4.25-bit), letting you fit a model precisely to your available VRAM while maintaining higher performance than GGUF.vLLM with Marlin Kernels: If you are running 4-bit AWQ or GPTQ models, using the Marlin kernel within vLLM can nearly double your throughput compared to standard kernels. It is highly optimized for high-density GPU compute. 3. Established Standard AlternativesFor general-purpose use where compatibility and stability are more important than experimental KV cache compression:GGUF (with Importance Matrix): While usually slower than EXL2, using GGUF with an "i-matrix" (Importance Matrix) often yields better perplexity (accuracy) than standard 4-bit quants. It is the best choice if you need to offload specific layers to a CPU or split workloads across mismatched GPUs.AWQ (Activation-aware Weight Quantization): This is typically better than GPTQ for instruction-following and coding tasks. It protects the most "salient" weights during the compression process, which results in fewer "hallucinations" or broken logic at lower bitrates. Summary ComparisonMethod Best For Primary AdvantageRotorQuant Ultra-fast inference 10-30x faster kernels than TurboQuant.EXL2 NVIDIA GPU Speed Highest throughput for local single-user chat.AWQ Creative/Coding Accuracy Better coherence than GPTQ or standard GGUF.GGUF (i-matrix) Mixed Hardware Most flexible for CPU/GPU splitting.

[-]

Dany0@reddit

Keep your clanker discussions to yourself. This is like showing your genitals to other people

We don't care, we're not amazed, and you're making us feel uncomfortable and wasting our time

[-]

Jatilq@reddit

This is funny. It suggest an insecurity about the size of your "genitals". I was trying to helps so the you feel unconformable maybe goes back to that insecurity you confess you have. Don't worry, your partner was never laughing at you, but with you.

[-]

Dany0@reddit

Did you ask a clanker to come up with a good comeback? Might want to find a burns & comebacks LoRA and give it another shot. Maybe you'll get a pass@5

[-]

Jatilq@reddit

I wish I could say there was some noble reason for posting something like this, but there isn't. I had some real health scares this week, and now I'm more in a "fuck it / fuck you" attitude.

Let's break this whole discussion down to its core: motivations. Your issue is that I was trying to help someone. You might not like how I was trying to do it, but that was the overall motivation.

For some strange reason, you mentioned genitals, and it's making people uncomfortable. See "motivations" above. This says more about the people who have a problem with it than it does about me. I picture those people as the type who jump out of the bushes to complain when you give a homeless person money.

I'm in my 50s, and for the most part, I've played a young man's game. What's that game? Not responding if you know the masses will have a problem with it. That comes from insecurity. I know people will downvote anything related to AI helping, but you can revisit my motivations.

Just in case you don't understand the young man's game, it's insecurity. Why else would you or others have a problem with me trying to help?

The amusing part is that you think anything you say has any real weight. It doesn't, because it's not rooted in anything remotely close to trying to help the OP—it's only there to feed your insecurities.

[-]

Dany0@reddit

No one's gonna read that bro, but thanks for helping us poison bad actors training on reddit data

[-]

randomfoo2@reddit

I've tested all of these btw in non-production code. I've found HIGGS to be the best in terms of quality (especially paired with some other minimization techniques that can be stacked) however I've been unable to get it past \~50% prefill/decode speed. I do have something to announce soon, that I think should be a big deal on the KVCache front that is faster and better than current TurboQuant implementations.

[-]

_hephaestus@reddit

Fwiw it’s been in oMLX for a while now. Not really noticing speed/memory gains but haven’t done a thorough analysis

[-]

insanemal@reddit

Isn't TurboQuant in 0.20.0?

[-]

Middle_Bullfrog_6173@reddit

Are there any benchmarks now that it's out there? And I don't mean speed, I've seen those.

[-]

inky_wolf@reddit

It is, but hybrid mamba models aren't supported.

[-]

stoppableDissolution@reddit

Likely never, because even q8 context quantization hurts the models very big time.

[-]

edsonmedina@reddit

How come? Do you have a source?

[-]

stoppableDissolution@reddit

https://github.com/ggml-org/llama.cpp/pull/21038#issuecomment-4150413357

[-]

SexyAlienHotTubWater@reddit

That shows very minimal loss to the model when you apply TurboQuant Q8. 37.9% vs. 37.1% - noticeable, but not "very big time"

[-]

a_beautiful_rhind@reddit

That test doesn't repeat for other models. Everyone took it at face value to confirm their beliefs.

[-]

dsanft@reddit

Yup.

The real win is activation rotation to minimise quantisation error for high kurtosis tensors. You don't need low-bit TQ for that. It will actually make Q8 kv cache precision feasible.

[-]

Mashic@reddit

Apparently there is friciton and the llama.cpp devs don't like it. I don't think they want to implement it in the first place.

[-]

Dany0@reddit

Yes, because it's not so simple, and ggeranov made the right call. Besides, attn-rot which brings 80% of the benefits with almost none of the downsides has already been merged and is automatically on

I think KVTC/deepseek v4 style cache compression has a higher likelihood of getting merged to be perfectly honest. But it'll be a few months

TQ forks exist for those that want it!

[-]

DigRealistic2977@reddit

I guess nobody will know.. as it seems there are many versions coming out better than turbo quant and it's a wikd west out there for kv cache I guess.. so many others claiming this is better than X over X etc.

So it seems understandable that they won't stick to it right away.

[-]

Crystalagent47@reddit (OP)

Makes sense, great take