TurboQuant seems to work very well on Gemma 4 — and separately, per-layer outlier-aware K quantization is beating current public fork results on Qwen PPL

Posted by Fearless-Wear8100@reddit | LocalLLaMA | View on Reddit | 15 comments

I’ve been experimenting with TurboQuant KV cache quantization in llama.cpp (CPU + Metal) on Gemma 4 26B A4B-it Q4_K_M on an Apple M4 Pro 48GB, and the results look surprisingly strong.

Gemma 4 findings

On Gemma 4, QJL seems to work well, and FWHT as a structured rotation substitute also looks like a good fit for the large attention heads (dk=256/512).

My benchmark results:

So on this setup, \~3.1 bits per K channel gets near-zero accuracy loss with a meaningful long-context speedup.

What’s also interesting is that this looks better than the public Gemma 4 fork results I’ve seen so far. In the linked 512-d Gemma 4 experiments, 512-WHT + global norm reaches 31/65, while the TBQP3 512 + QJL variants land around 23–28/65. That’s a very different outcome from what I’m seeing with the Metal implementation above.

Also worth noting: I’m not using Gemma 4 PPL right now, because PPL seems unreliable / broken there in llama.cpp at the moment, so for Gemma 4 I’m judging mostly from direct quality evals, NIAH, and long-context speed.

Separate result: Qwen PPL

Separately from the Gemma 4 work, I also have a per-layer / per-channel outlier-aware adaptive K quantization setup for Qwen2.5 / Qwen3.

Those results seem to beat current public fork-style implementations on PPL at comparable bpv:

That makes me think a lot of the gap is in per-layer allocation / calibration / outlier handling, not just in the base quantizer.

I also did some per-layer variance analysis on Gemma 4, and the spread differs a lot across layers, so there’s probably still room to improve further with mixed per-layer K types instead of one fixed recipe everywhere.
Gemma 4 benchmarks / details:

https://github.com/andrei-ace/llama.cpp/tree/turboquant-gemma/benches/tq-metal

Qwen per-layer / outlier-aware PPL results:

https://github.com/ggml-org/llama.cpp/discussions/21297

Gemma 4 comparison point in the TurboQuant thread:

https://github.com/ggml-org/llama.cpp/discussions/20969#discussioncomment-16450839