TurboQuant seems to work very well on Gemma 4 — and separately, per-layer outlier-aware K quantization is beating current public fork results on Qwen PPL

Posted by Fearless-Wear8100@reddit | LocalLLaMA | View on Reddit | 15 comments

I’ve been experimenting with TurboQuant KV cache quantization in llama.cpp (CPU + Metal) on Gemma 4 26B A4B-it Q4_K_M on an Apple M4 Pro 48GB, and the results look surprisingly strong.

Gemma 4 findings

On Gemma 4, QJL seems to work well, and FWHT as a structured rotation substitute also looks like a good fit for the large attention heads (dk=256/512).

My benchmark results:

tq3j/q4_0: 37/37 on quality tests, 8/8 on NIAH
tq2j/q4_0: 36/37, with the only miss being an empty response
+34% faster than q4_0/q4_0 at 131K context
TurboQuant overtakes q4_0 from 4K context onward

So on this setup, \~3.1 bits per K channel gets near-zero accuracy loss with a meaningful long-context speedup.

What’s also interesting is that this looks better than the public Gemma 4 fork results I’ve seen so far. In the linked 512-d Gemma 4 experiments, 512-WHT + global norm reaches 31/65, while the TBQP3 512 + QJL variants land around 23–28/65. That’s a very different outcome from what I’m seeing with the Metal implementation above.

Also worth noting: I’m not using Gemma 4 PPL right now, because PPL seems unreliable / broken there in llama.cpp at the moment, so for Gemma 4 I’m judging mostly from direct quality evals, NIAH, and long-context speed.

Separate result: Qwen PPL

Separately from the Gemma 4 work, I also have a per-layer / per-channel outlier-aware adaptive K quantization setup for Qwen2.5 / Qwen3.

Those results seem to beat current public fork-style implementations on PPL at comparable bpv:

Qwen2.5 1.5B: 11.514 vs q8_0 11.524 at 6.21 bpv
Qwen2.5 7B: 8.927 vs q8_0 8.949 at 6.41 bpv
Qwen3 8B: 10.848, within CI of both f16 and q8_0, at 5.125 bpv

That makes me think a lot of the gap is in per-layer allocation / calibration / outlier handling, not just in the base quantizer.

I also did some per-layer variance analysis on Gemma 4, and the spread differs a lot across layers, so there’s probably still room to improve further with mixed per-layer K types instead of one fixed recipe everywhere.
Gemma 4 benchmarks / details:

https://github.com/andrei-ace/llama.cpp/tree/turboquant-gemma/benches/tq-metal

Qwen per-layer / outlier-aware PPL results:

https://github.com/ggml-org/llama.cpp/discussions/21297

Gemma 4 comparison point in the TurboQuant thread:

https://github.com/ggml-org/llama.cpp/discussions/20969#discussioncomment-16450839

[-]

Overall-Somewhere760@reddit

Wait. Has llamacpp already merged the turboquant support ? Or you guys using some forks ?

[-]

akavel@reddit

he's linking a fork in the post.

[-]

paperboyg0ld@reddit

TurboQuant is also working very well on other topologies like SSM and Linear RNN for me.

[-]

Jemito2A@reddit

This is exactly what we've been waiting for. Running a bio-inspired autonomous system 24/7 on a single 5070 Ti

(16GB), and VRAM is our hard limit — we can only load one model at a time.

We use gemma4:e4b for introspective tasks and qwen3.5:9b for everything else. The problem: when Ollama swaps between

them, both stay cached in VRAM (\~9.6 + 6.6 = 16.2GB). Today that caused a TDR crash (VIDEO_TDR_FAILURE with

STATUS_INSUFFICIENT_RESOURCES). We had to build a VRAM watchdog that auto-unloads models when usage exceeds 85%.

TurboQuant KV cache compression would be a game changer for us — not just for longer context, but because smaller KV

cache means less VRAM pressure during model transitions. The +34% speed gain at 131K context is impressive but

honestly, even at 8K context, freeing up 2-3GB of VRAM would let us run two models simultaneously for the first time.

Tracking the llama.cpp discussion and the Ollama PR closely. Any ETA on when this might land in Ollama builds?

[-]

Rich_Artist_8327@reddit

Could you do similar testing with professional inferencing engines like vLLM? Would be useful for those who use LLMs in production, not at home.

[-]

Fearless-Wear8100@reddit (OP)

I haven’t tested vLLM yet, so I can’t speak to exact engine-specific numbers. But I’d expect the main findings to transfer, because the important part here seems to be the calibration, not llama.cpp itself.

What I found is that calibration is architecture-specific, not weight-specific: the set of “important” / outlier channels is mostly determined by the model architecture, and calibrating on fp16 / q8_0 / q4_k_m versions of the same model gave 96%+ identical channel selections.

So in practice you can probably calibrate once and reuse the same channel ordering / outlier split across quantizations of the same model. The main caveat is that calibration has to be done pre-RoPE — post-RoPE gave garbage because RoPE changes the channel variance structure. And you don’t need much data either: PTB train with around 4096 tokens was already enough.

[-]

Sabin_Stargem@reddit

There is an extensive thread about Turbo Quant for LlamaCCP, and discussions are starting to move towards how to get TQ tweaked for model families.

https://github.com/ggml-org/llama.cpp/discussions/20969

[-]

dsanft@reddit

There's some very weird shit going on in that thread.

[-]

fragment_me@reddit

LLMs talking to LLMs.

[-]

imp_12189@reddit

It's not in yet.

https://github.com/vllm-project/vllm/issues/38171

[-]

Fearless-Wear8100@reddit (OP)

Yeah, exactly. That’s why I pushed the quantization pretty aggressively - I had a feeling QJL might actually work on Gemma, unlike what people were seeing on other models.

[-]

Sambojin1@reddit

See if you can still make a q4_0_4_4 gguf.out.of.it. it'll be funny as fuck!