KV cache compression on Qwen 3.6 — 1M context: 10.7GB → 6.9GB (V: 3.5× smaller)

Posted by Spirited-Toe-3988@reddit | LocalLLaMA | View on Reddit | 12 comments

Quick demo of KV cache compression on Qwen 3.6 at 1M context.

In this run:

KV cache: 10.74 GB → 6.92 GB

V cache: 5.37 GB → 1.55 GB (\~3.5× reduction)

Still seeing near-zero PPL change in early tests (3 seeds), but focusing mainly on memory + long-context behavior for now.

Curious how people think about structured compression vs eviction approaches for KV cache.

[-]

MmmmMorphine@reddit

I mean... You gotta actually tell us what's going on. What type of compression, impact on speed, etc

[-]

Spirited-Toe-3988@reddit (OP)

here’s what’s going on in the video (trying not to dump the whole paper 😅)

Model is Qwen 3.6 35B (hybrid MoE). Only the standard attention layers have a KV cache, so that’s what’s getting compressed — the linear attention layers are untouched.

Method-wise it’s KV cache quantization (not rank reduction or token eviction). For V: rotate into a data-driven basis, keep a few important dims in FP16, quantize the rest (3 bits). K is similar but I left it FP16 in this demo. No retraining, just a forward hook on HF.

What you’re seeing in the clip:

V: 5.37 GB → 1.55 GB (\~3.5×)
Total KV: 10.74 GB → 6.92 GB (\~1.6×) since K is untouched here

Speed-wise: haven’t benchmarked properly yet — this is still a PyTorch/HF wrapper, so current win is memory, not latency.

On quality: the generation in the video is fine, but I’ve mainly been checking long-context retrieval. Seeing \~98.5% on NIAH vs \~97% FP16 on smaller models (Mistral 7B, Qwen 2.5 3B). Haven’t run the full harness on this 35B yet.

Still early — mostly trying to understand what holds up and what breaks.

There’s a short writeup on arXiv (“Quantization Dominates Rank Reduction for KV-Cache Compression”) if anyone wants more detail.

[-]

Spirited-Toe-3988@reddit (OP)

fraQtl — our KV cache compression library. Runs as a forward hook on top of HuggingFace transformers, pure PyTorch, no custom kernels. Works on any HF causal LM.