Can we already use Google's TurboQuant (TQ) for KV Cache in llama-server? Or are we waiting for a PR?

Posted by DjsantiX@reddit | LocalLLaMA | View on Reddit | 20 comments

Hey everyone,

Ever since the day Google announced TurboQuant, I've been following the news about its extreme compression capabilities without noticeable quality degradation. I see it mentioned constantly on this sub, but despite all the discussions, I'm honestly still a bit confused: is it actually applicable for us right now? And if so, how?

I recently saw an article/post where someone applied this TQ quantization directly to the model weights. They managed to get Qwen3.5-27B running at near-Q4_0 quality, making it about 10% smaller, which finally allowed it to fit comfortably on a 16GB card (specifically an RTX 5060 Ti). This is huge for us with consumer GPUs.

However, since TurboQuant was initially heavily pitched for its efficiency with context and memory, my main question is about the KV Cache.

As we know, context length is the real VRAM killer. So my doubts are:

Can we currently apply TQ quantization to the KV cache when using llama-server (llama.cpp)?
If yes, how do we enable it? Is there already a CLI flag similar to --cache-type q4_0 / --cache-type q8_0?
Or is this strictly limited to model weights right now, and we are still waiting for an official PR/release from the llama.cpp team to implement TQ for the KV cache?

I'd love to hear if anyone has tested this or knows the current development status. Thanks!

[-]

Suspicious-Talk-5703@reddit

Ich würde mich freuen, wenn ich mal Leute finden würde, die es testen könnten. Es ist mein eigener Fork und kombiniert verschiedene Ansätze anderer Forks mit einer riesen Portion Self-Research. Habe es für ein unkonventionelles Setup optimiert (2× RTX 2060 12GB, asymmetrisches PCIe), aber laut meinen Benchmarks ist es sehr solide und es ist meines Wissens derzeit der einzige Fork, der für K und V getrennte Algorithmen nutzt (asymmetrische KV-Quantisierung).

https://github.com/LL4nc33/llama-tq

Pitpeaches@reddit

There's a fork that has it. You set it with -kvalue tsomething and -vvalue tsomething.

I'm using it on a 3090 to have 256k context. Works well

send-moobs-pls@reddit

Crazy how every time Google does something it gets hyped beyond belief by people who have never used it and likely don't even do much direct work with LLMs and then everyone slowly talks talking about it

unjustifiably_angry@reddit

Turboquant is a scam and Google's numbers were based on fantasy-land worst-case vs best-case

Velocita84@reddit

This is what GGerganov thinks about the only turboquant PR left open, so i don't think it will be implemented any time soon. Hopefully in a month's time people will forget about it and stop asking for turboquant like a dog begging for boiling water on the stove

_underlines_@reddit

There's a windows compiled fork of llama.cpp / server somewhere on github I loaded.
Doing tests with sparse Qwen3.6 35B yielded almost no benefits, as to my understanding, the architecture of Qwen3.6 sparse keeps KV Cache size fairly small for large context lengths.

putrasherni@reddit

Turboquant does not help retail GPU setups. Yea great I have 32GB and my model loads up 24GB, I have 8 gb for my kv cache

Turboquant helps multiple GPU setups which host concurrent sessions on a LLM

Your excitement is unwarranted

Instead hope for dflash at large context sizes, speculative decoding and MTP

DefNattyBoii@reddit

It does help by increasing the accuracy of the v(and the k if you are willing to quant that) cache quanted to ~4 bits, which on large (256k+) contexts can be huge.

MrMisterShin@reddit

From what I understand, it’s not full here yet for llama.cpp , Partial implementations exist (I think this is called attn-rot), but not the actual thing (TurboQuant) for llama.cpp

Equivalent-Repair488@reddit

Is TheTom's implementation not it either?

Gemma 4 31B was 30+k tokens on lm studio, but 160+k on TheTom's llama.cpp-turboquant. Only tried a little, and I can't send a second prompt, yet to fix.

I tried the tom's fork but hit some regression bugs which I could not debug. But it was working good! Right now I'm just using q8_0 and q4_1 for most models

Kyuiki@reddit

I didn’t see this at all. The only way I could see this being true is if in LM Studio you weren’t using kv cache at all and then suddenly switched to turbo4 in llama-cpp.

Turboquant from my understanding only provides q3_0 quant to kv cache, with no impact to performance and quality. That’s not a 30k -> 160k difference going from q4_0 -> q3_0 and q4_0 barely had a noticeable impact to performance and quality already.

in LM Studio you weren’t using kv cache at all

Yes I wasn't

Middle_Bullfrog_6173@reddit

Yes and attention rotation is enabled by default when quantizing KV cache.

MachineZer0@reddit

https://github.com/richginsberg/llama-cpp-turboquant

It’s a fork of https://github.com/TheTom/llama-cpp-turboquant but with two weeks of commits from https://github.com/ggml-org/llama.cpp

Add

-ctk turbo4 -ctv turbo4

Thump604@reddit

It’s barely worth it - the redeeming parts have been implemented in various engines.

There is a Turboquant fork you can build a llama-cpp image from. Once you build it, it runs just like llama-cpp and you use cache types turbo3 and turbo4.

I’ll be honest in saying it didn’t do anything really noticeable for me because I was already using q4_0 for my cache and I did not notice any degradation to the generated content.

Enabling Turboquant, based on my research and understanding (which could be flawed!) provides something similar to q3_0 with no impact to performance compared to q8_0.

But since I didn’t see noticeable impact using q4_0 all it bought me was a few thousand extra context length before offloading to RAM. So not a huge win but it helps? I also don’t get the latest and greatest llama-cpp features without rebuilding the Turboquant image.