TurboQuant seems to work very well on Gemma 4 — and separately, per-layer outlier-aware K quantization is beating current public fork results on Qwen PPL
Posted by Fearless-Wear8100@reddit | LocalLLaMA | View on Reddit | 15 comments
I’ve been experimenting with TurboQuant KV cache quantization in llama.cpp (CPU + Metal) on Gemma 4 26B A4B-it Q4_K_M on an Apple M4 Pro 48GB, and the results look surprisingly strong.
Gemma 4 findings
On Gemma 4, QJL seems to work well, and FWHT as a structured rotation substitute also looks like a good fit for the large attention heads (dk=256/512).
My benchmark results:
- tq3j/q4_0: 37/37 on quality tests, 8/8 on NIAH
- tq2j/q4_0: 36/37, with the only miss being an empty response
- +34% faster than q4_0/q4_0 at 131K context
- TurboQuant overtakes q4_0 from 4K context onward
So on this setup, \~3.1 bits per K channel gets near-zero accuracy loss with a meaningful long-context speedup.
What’s also interesting is that this looks better than the public Gemma 4 fork results I’ve seen so far. In the linked 512-d Gemma 4 experiments, 512-WHT + global norm reaches 31/65, while the TBQP3 512 + QJL variants land around 23–28/65. That’s a very different outcome from what I’m seeing with the Metal implementation above.
Also worth noting: I’m not using Gemma 4 PPL right now, because PPL seems unreliable / broken there in llama.cpp at the moment, so for Gemma 4 I’m judging mostly from direct quality evals, NIAH, and long-context speed.
Separate result: Qwen PPL
Separately from the Gemma 4 work, I also have a per-layer / per-channel outlier-aware adaptive K quantization setup for Qwen2.5 / Qwen3.
Those results seem to beat current public fork-style implementations on PPL at comparable bpv:
- Qwen2.5 1.5B: 11.514 vs q8_0 11.524 at 6.21 bpv
- Qwen2.5 7B: 8.927 vs q8_0 8.949 at 6.41 bpv
- Qwen3 8B: 10.848, within CI of both f16 and q8_0, at 5.125 bpv
That makes me think a lot of the gap is in per-layer allocation / calibration / outlier handling, not just in the base quantizer.
I also did some per-layer variance analysis on Gemma 4, and the spread differs a lot across layers, so there’s probably still room to improve further with mixed per-layer K types instead of one fixed recipe everywhere.
Gemma 4 benchmarks / details:
https://github.com/andrei-ace/llama.cpp/tree/turboquant-gemma/benches/tq-metal
Qwen per-layer / outlier-aware PPL results:
https://github.com/ggml-org/llama.cpp/discussions/21297
Gemma 4 comparison point in the TurboQuant thread:
https://github.com/ggml-org/llama.cpp/discussions/20969#discussioncomment-16450839
Overall-Somewhere760@reddit
Wait. Has llamacpp already merged the turboquant support ? Or you guys using some forks ?
akavel@reddit
he's linking a fork in the post.
paperboyg0ld@reddit
TurboQuant is also working very well on other topologies like SSM and Linear RNN for me.
Jemito2A@reddit
This is exactly what we've been waiting for. Running a bio-inspired autonomous system 24/7 on a single 5070 Ti
(16GB), and VRAM is our hard limit — we can only load one model at a time.
We use gemma4:e4b for introspective tasks and qwen3.5:9b for everything else. The problem: when Ollama swaps between
them, both stay cached in VRAM (\~9.6 + 6.6 = 16.2GB). Today that caused a TDR crash (VIDEO_TDR_FAILURE with
STATUS_INSUFFICIENT_RESOURCES). We had to build a VRAM watchdog that auto-unloads models when usage exceeds 85%.
TurboQuant KV cache compression would be a game changer for us — not just for longer context, but because smaller KV
cache means less VRAM pressure during model transitions. The +34% speed gain at 131K context is impressive but
honestly, even at 8K context, freeing up 2-3GB of VRAM would let us run two models simultaneously for the first time.
Tracking the llama.cpp discussion and the Ollama PR closely. Any ETA on when this might land in Ollama builds?
Rich_Artist_8327@reddit
Could you do similar testing with professional inferencing engines like vLLM? Would be useful for those who use LLMs in production, not at home.
Fearless-Wear8100@reddit (OP)
I haven’t tested vLLM yet, so I can’t speak to exact engine-specific numbers. But I’d expect the main findings to transfer, because the important part here seems to be the calibration, not llama.cpp itself.
What I found is that calibration is architecture-specific, not weight-specific: the set of “important” / outlier channels is mostly determined by the model architecture, and calibrating on fp16 / q8_0 / q4_k_m versions of the same model gave 96%+ identical channel selections.
So in practice you can probably calibrate once and reuse the same channel ordering / outlier split across quantizations of the same model. The main caveat is that calibration has to be done pre-RoPE — post-RoPE gave garbage because RoPE changes the channel variance structure. And you don’t need much data either: PTB train with around 4096 tokens was already enough.
Sabin_Stargem@reddit
There is an extensive thread about Turbo Quant for LlamaCCP, and discussions are starting to move towards how to get TQ tweaked for model families.
https://github.com/ggml-org/llama.cpp/discussions/20969
dsanft@reddit
There's some very weird shit going on in that thread.
fragment_me@reddit
LLMs talking to LLMs.
imp_12189@reddit
It's not in yet.
https://github.com/vllm-project/vllm/issues/38171
Sweet-Argument-7343@reddit
Will Turboquant affect the quality of smaller models like Gemma 26B and 4B as well on local hardware?
Gringe8@reddit
How do you know if its better than the public fork when you arent doing the same benchmark?
GWGSYT@reddit
well google made both
Fearless-Wear8100@reddit (OP)
Yeah, exactly. That’s why I pushed the quantization pretty aggressively - I had a feeling QJL might actually work on Gemma, unlike what people were seeing on other models.
Sambojin1@reddit
See if you can still make a q4_0_4_4 gguf.out.of.it. it'll be funny as fuck!