TurboQuant - Extreme KV Cache Quantization · ggml-org/llama.cpp · Discussion #20969

[-]

Acrobatic_Bee_6660@reddit

I'm the author of the HIP/ROCm port for this. Running on RX 7900 XTX / gfx1100 / ROCm 6.4.

Quick summary of what works on AMD:

- Qwen3.5-9B: turbo3 PPL +1.17% vs f16, throughput within 1%

- 27B @ 80K context: f16 OOMs, turbo3 runs (314 t/s pp, 29.4 t/s tg)

- Gemma 4 26B MoE: turbo3 on all layers is catastrophic, but turbo3 on global + f16 on SWA works — I added `--cache-type-k-swa` / `--cache-type-v-swa` flags for this

Repo: https://github.com/domvox/llama.cpp-turboquant-hip

Full benchmarks: https://github.com/ggml-org/llama.cpp/discussions/21526

Would love validation from other AMD GPU owners.

[-]

LippyBumblebutt@reddit

I tried your fork on gfx1201. It lets me run turbo3/turbo4 kv cache with the promised VRAM reduction.

But I don't really see the difference to the version from TheTom. It compiles with ROCm and runs turboquant just as well.

Actually, llama-bench fails with an error main: error: failed to create context with model on your treem while TheToms version works. I didn't compile exactly the same version for both though...

Another thing. I tried your turboquant-hip tests. tq_validate passes without errors. tq_bench fails onv MSE Verification (GPU MSE (TQ3): 0.994817) and has Time: 0.000 ms on the other tests.

[-]

Fair point on TheTom’s branch too — the core TurboQuant implementation is closely related. The main extra thing on my side is SWA-aware KV overrides for models like Gemma 4, where turbo on sliding-window layers can be catastrophic.

If you can share the exact llama-bench command, ROCm version, and tq_bench output, I can try to narrow down the issues you hit.

[-]

LippyBumblebutt@reddit

tq_bench

./llama-bench --model ~/Downloads/gemma-4-E4B-it-UD-Q8_K_XL.gguf --cache-type-k $quant --cache-type-v $quant

quant q4_0 & q8_0 fail on your and TheToms version (also on official vulkan build). turbo3/4 fails on your and succeeds on TheTom. f16 succeeds on all.

Same results for Qwen3.5-9B-UD-Q6_K_XL.

Thanks for your work.

[-]

Acrobatic_Bee_6660@reddit

For tq_bench: I think I see at least one problem on my side. The standalone benchmark build script currently had --offload-arch=gfx1100 hardcoded, so on your gfx1201 it would be compiling for the wrong target. That fits pretty well with both symptoms you saw: Time: 0.000 ms and the bad GPU MSE.

I just pushed a fix — build.sh now auto-detects the target via rocminfo (or you can override it manually with AMDGPU_TARGET=gfx1201 ./build.sh).

For llama-bench: thanks, that’s useful to know. From what you’re seeing, it sounds like:

f16 works everywhere
q4_0 / q8_0 fail on both my tree and TheTom’s (and even official Vulkan)
turbo3/4 succeed on TheTom’s but fail on mine

So I probably have a llama-bench-specific issue on my side for the turbo cache types, separate from the broader kv-quant issues you’re seeing elsewhere.

So this sounds less like “TurboQuant fundamentally doesn’t work on gfx1201” and more like:

wrong target in the standalone benchmark build script
a llama-bench integration gap on my fork

Thanks for testing this on RDNA4. If you happen to try llama-cli, llama-server, or llama-perplexity with turbo3/4, I’d be very interested in whether those paths work cleanly for you.

[-]

LippyBumblebutt@reddit

new tq_validate new tq_bench

llama-cli Qwen3.5-9B-UD-Q6_K_XL.gguf --cache-type-k turbo4 --cache-type-v turbo4 works

llama-perflexity works as well. These are the results (same Qwen3.5): - Your tree, F16: 8.1853 +/- 0.05541 - Your tree, turbo4: 8.2894 +/- 0.05646 - Your tree, turbo3: 8.3037 +/- 0.05642 - Your tree, q4_0: 8.2180 +/- 0.05565 - upstream, q4_0: 8.2014 +/- 0.05552 - TheTom, turbo4: 8.2894 +/- 0.05646

So upstream q4_0 beats turboquant... Also if I read that right, q4 has 219MB kv_cache, turbo4 218MB and turbo3 uses 213MB ... probably only for the 512 Token Perplexity test

[-]

Acrobatic_Bee_6660@reddit

Great data, thanks for running all of this.

The fact that turbo4 matches exactly between my fork and TheTom’s (8.2894) is reassuring — it suggests the turbo4 path is behaving consistently on gfx1201.

And yes, you’re right that q4_0 wins on PPL in this short-context test (8.20 vs 8.29). At 512 tokens the KV footprint is still small, so this is mostly a quality comparison, not yet the regime where KV compression really pays off.

The use case where turbo3/turbo4 starts to matter is much longer context, where KV dominates VRAM. On my gfx1100, for example, f16 OOMs on a 27B model at 80K, while turbo3 still runs.

Glad to hear llama-cli and llama-perplexity are working cleanly on gfx1201, and that the updated tq_bench / tq_validate path looks sane now.

Really appreciate the thorough RDNA4 testing — this is by far the most complete gfx1201 validation I’ve gotten so far.

[-]

LippyBumblebutt@reddit

Glad I can help.

I increased context for the perplexity test to 20k (same Qwen3.5):

your turbo4: PPL = 6.5515 +/- 0.04291
your turbo3: PPL = 6.5411 +/- 0.04262
mainline q4_0: PPL = 6.4864 +/- 0.04218
mainline f16: PPL = 6.4995 +/- 0.04231

It seems the rotation alone is enough to not score lower on perplexity. I didn't do any other tests though.

[-]

Acrobatic_Bee_6660@reddit

Yes, q4_0 still comes out ahead on PPL here. That's a fair read.

For me the main value proposition of TurboQuant isn't "better PPL than q4_0" — it's more aggressive KV compression for cases where the extra VRAM headroom is what determines whether

a long-context run fits at all.

So I'd read your result as:

* q4_0 looks better on perplexity in this test

* turbo3/4 trade some quality for a smaller KV footprint

* the real win for TurboQuant shows up once context gets large enough that KV memory becomes the bottleneck

On my gfx1100, that's exactly where it starts to matter: at long context, the difference is less about short-context PPL and more about whether the run still fits cleanly in VRAM.

Really appreciate you running these comparisons.

[-]

BeeNo7094@reddit

Anyone checked out this fork? https://github.com/mitkox/vllm-turboquant

[-]

Velocita84@reddit

All i see is 30 vibe coded forks that will all get rejected from merging because of excessive ai use and non compliance to contributing standards

[-]

relmny@reddit

I was trying to read (I have no idea about any of that) that discussion for a few days and I did got that impression.

I also read this particular comment from another discussion:

https://github.com/ggml-org/llama.cpp/issues/20977#issuecomment-4166048956

and without having any idea about it, it makes (common) sense to me (I understand that a "proper" implementation will either be extremely difficult or even incompatible with llama.cpp's philosophy)

Also that not many are focusing on whether being "lossless" is actually true or not. Or the levels of it.

[-]

EffectiveCeilingFan@reddit

Always quick to set the record straight 🫡

[-]

celsowm@reddit

I hope people are doing similar in vllm too

[-]

pmttyji@reddit (OP)

https://github.com/vllm-project/vllm/issues/38171

[-]