KV cache compression on Qwen 3.6 — 1M context: 10.7GB → 6.9GB (V: 3.5× smaller)
Posted by Spirited-Toe-3988@reddit | LocalLLaMA | View on Reddit | 12 comments
Quick demo of KV cache compression on Qwen 3.6 at 1M context.
In this run:
KV cache: 10.74 GB → 6.92 GB
V cache: 5.37 GB → 1.55 GB (\~3.5× reduction)
Still seeing near-zero PPL change in early tests (3 seeds), but focusing mainly on memory + long-context behavior for now.
Curious how people think about structured compression vs eviction approaches for KV cache.
MmmmMorphine@reddit
I mean... You gotta actually tell us what's going on. What type of compression, impact on speed, etc
m3kw@reddit
Lossy compression
MmmmMorphine@reddit
Gee thanks
Spirited-Toe-3988@reddit (OP)
here’s what’s going on in the video (trying not to dump the whole paper 😅)
Model is Qwen 3.6 35B (hybrid MoE). Only the standard attention layers have a KV cache, so that’s what’s getting compressed — the linear attention layers are untouched.
Method-wise it’s KV cache quantization (not rank reduction or token eviction). For V: rotate into a data-driven basis, keep a few important dims in FP16, quantize the rest (3 bits). K is similar but I left it FP16 in this demo. No retraining, just a forward hook on HF.
What you’re seeing in the clip:
V: 5.37 GB → 1.55 GB (\~3.5×)
Total KV: 10.74 GB → 6.92 GB (\~1.6×) since K is untouched here
Speed-wise: haven’t benchmarked properly yet — this is still a PyTorch/HF wrapper, so current win is memory, not latency.
On quality: the generation in the video is fine, but I’ve mainly been checking long-context retrieval. Seeing \~98.5% on NIAH vs \~97% FP16 on smaller models (Mistral 7B, Qwen 2.5 3B). Haven’t run the full harness on this 35B yet.
Still early — mostly trying to understand what holds up and what breaks.
There’s a short writeup on arXiv (“Quantization Dominates Rank Reduction for KV-Cache Compression”) if anyone wants more detail.
FourSquash@reddit
Can you post this comment again, but with more emdashes?
superdariom@reddit
I thought qwen 3.6 was limited to 256k context?
qwen_next_gguf_when@reddit
Loading it with ctx 1024000 doesn't mean you can utilize it at full. It will crash when you actually load a big context.
jack-in-the-sack@reddit
I thought you needed tens to hundreds of GB for 1M context... I must have been living under a rock.
No-Refrigerator-1672@reddit
I'm running Qwen 3.5 35B with 256k context on 2x20GB GPUs, so 40 total; and that's with unquantized FP16 KV cache, and like 5GB are left unallocated. Their new MoEs are crazy efficient on memory.
suicidaleggroll@reddit
Depends on the model. Qwen is already very small, others not so much (MiniMax is 240 GB for 1M for example)
TheQuantumPhysicist@reddit
What software are you using?
Spirited-Toe-3988@reddit (OP)
fraQtl — our KV cache compression library. Runs as a forward hook on top of HuggingFace transformers, pure PyTorch, no custom kernels. Works on any HF causal LM.