FastDMS: 6.4X KV-cache compression running faster than vLLM BF16/FP8
Posted by randomfoo2@reddit | LocalLLaMA | View on Reddit | 16 comments
Last year researchers affiliated with NVIDIA, University of Warsaw, and University of Edinburgh published Dynamic Memory Sparsification (DMS), a KV-cache sparsification technique using learned per-head token eviction, reporting up to 8x KV-cache compression.
I found the results intriguing to build a small reference implementation and trainer to sanity-check the idea. On WikiText-2 with Llama 3.2 1B, I was able to get a rough replication:
| Configuration | PPL | Delta | KLD (nats/tok) | Compression |
|---|---|---|---|---|
| Vanilla Llama-3.2-1B | 9.226 | - | - | 1x |
| DMS (trained, eviction active) | 9.200 | -0.28% | 0.026 | 6.4x |
Training the DMS predictors took about 20 minutes on the PRO 6000 and the compression looked basically lossless. One small problem though, my HF reference implementation ran at about... 18 tok/s.
So, after a few weeks of kernel grinding, I'm pleased to announce FastDMS, an MIT-licensed implementation of DMS with compact KV storage that physically reclaims evicted slots. It is tested on NVIDIA's original Qwen 3 8B DMS checkpoint as well as my own Llama 3.2 1B DMS checkpoint. (the original HF reference version and my trainer are in the repo as well): https://github.com/shisa-ai/FastDMS
On my benchmark setup, FastDMS uses 5-8x less KV memory than vLLM BF16 KV at 8K context while also decoding 1.5-2X faster than vLLM.
Compact DMS saves real allocator/device memory, not just theoretical KV bytes. The table below uses ctx_len=8192, gen_len=128. All vLLM baselines use exact-sized token pools matching the workload. KV/stage memory is the cache or cache-plus-staging footprint. vLLM BF16 means dtype=bfloat16 with kv_cache_dtype=auto; vLLM FP8 means kv_cache_dtype=fp8.
| Model / compact-DMS row | c | vLLM BF16 KV → FastDMS KV | BF16 KV saved | vLLM FP8 KV → FastDMS KV | FP8 KV saved | vLLM TQ4 KV → FastDMS KV | TQ4 KV saved |
|---|---|---|---|---|---|---|---|
| Llama-3.2-1B FastDMS default | 1 | 0.312 → 0.056 GiB |
5.6x |
0.156 → 0.056 GiB |
2.8x |
0.142 → 0.056 GiB |
2.5x |
| Llama-3.2-1B FastDMS default | 8 | 2.062 → 0.431 GiB |
4.8x |
1.031 → 0.431 GiB |
2.4x |
0.939 → 0.431 GiB |
2.2x |
| Qwen3-8B FastDMS compact DMS | 1 | 1.406 → 0.184 GiB |
7.6x |
0.703 → 0.184 GiB |
3.8x |
— | — |
| Qwen3-8B FastDMS compact DMS | 8 | 9.281 → 1.462 GiB |
6.3x |
4.641 → 1.462 GiB |
3.2x |
— | — |
For those that are curious, yes, this beats out TurboQuant in both speed and memory usage:
| Path | c | Prefill tok/s | Prefill vs BF16 | Decode tok/s | Decode vs BF16 | KV / stage memory | Status |
|---|---|---|---|---|---|---|---|
| vLLM BF16 | 1 | 123098.0 |
1.00x |
459.4 |
1.00x |
0.312 GiB BF16 KV |
dense BF16-KV baseline |
| vLLM FP8 | 1 | 119991.3 |
0.97x |
489.4 |
1.07x |
0.156 GiB FP8 KV |
dense FP8-KV baseline |
vLLM TurboQuant 4bit_nc |
1 | 126429.0 |
1.03x |
333.4 |
0.73x |
0.142 GiB TQ4 KV |
4-bit KV baseline |
| FastDMS FP8 compact-DMS default | 1 | 123194.6 |
1.00x |
698.9 |
1.52x |
0.056 GiB |
promoted zero-BF16 row |
| FastDMS B46 int4 speed profile | 1 | 121489.9 |
0.99x |
1060.0 |
2.31x |
0.056 GiB + 0.719 GiB int4 shadow |
default-off storage-for-speed |
| vLLM BF16 | 8 | 103668.5 |
1.00x |
2357.5 |
1.00x |
2.062 GiB BF16 KV |
dense BF16-KV baseline |
| vLLM FP8 | 8 | 102959.5 |
0.99x |
2888.7 |
1.23x |
1.031 GiB FP8 KV |
dense FP8-KV baseline |
vLLM TurboQuant 4bit_nc |
8 | 104409.9 |
1.01x |
1696.0 |
0.72x |
0.939 GiB TQ4 KV |
4-bit KV baseline |
| FastDMS FP8 compact-DMS default | 8 | 105531.7 |
1.02x |
3606.9 |
1.53x |
0.431 GiB |
promoted zero-BF16 row |
| FastDMS B25 narrow int4 speed profile | 8 | 104753.7 |
1.01x |
3640.7 |
1.54x |
0.431 GiB + 0.078 GiB int4 shadow |
default-off storage-for-speed |
| FastDMS BF16-attention speed control | 8 | 108070.5 |
1.04x |
3745.3 |
1.59x |
0.429 GiB + 0.312 GiB BF16 backing |
explicit speed control |
Of course, none of this matters if the compression tanks output quality. In theory, DMS eviction is applied before FP8 quantization, deciding which tokens to keep or evict, so the quality comparison for FastDMS compact-DMS should be the same versus FP8 quantization alone, but it's still worth double-checking quality.
This is measured by generating tokens with a compressed KV cache and comparing against an uncompressed reference, token by token. Lower KLD (KL divergence) is better - it means the compressed model's next-token probabilities are closer to the reference. Higher token match is better - it means greedy decoding produces the same output.
How to read the columns:
- KLD vs ref - KL divergence in nats/token between the compressed and reference logits. Measures how much the probability distribution over next tokens shifts due to compression. Lower is better;
0.000means identical. - Token match - percentage of greedy-decoded tokens that are identical to the reference.
96.9%means ~2 out of 64 tokens differed. - Tokens scored - how many decode steps could be compared. Once the candidate produces a different token than the reference, the sequences diverge and later steps aren't comparable.
33/60means quality metrics only cover the first 33 tokens before divergence - the reported KLD and PPL are over that prefix, not the full generation. A higher ratio means the comparison is more complete.
Test setup: ctx_len=1024, decode_len=16, four prompts (60-64 total decode steps). vLLM rows compare against vLLM BF16 full-KV logits. FastDMS rows compare against FastDMS with eviction disabled (reference window of 1M tokens, effectively keeping the full KV cache).
shisa-ai/Llama-3.2-1B-DMS-8x
| Path | Reference | KLD vs ref | Token match | PPL | Tokens scored |
|---|---|---|---|---|---|
| vLLM BF16 full KV | self | 0.000000 |
100.0% |
2.3748 |
60/60 |
| vLLM FP8 KV | vLLM BF16 | 0.005110 |
92.2% |
2.0893 |
33/60 |
vLLM TurboQuant 4bit_nc |
vLLM BF16 | 0.012730 |
76.6% |
1.9606 |
22/60 |
| FastDMS FP8 compact-DMS | FastDMS no-evict | 0.003009 |
96.9% |
2.2810 |
64/64 |
nvidia/Qwen3-8B-DMS-8x
| Path | Reference | KLD vs ref | Token match | PPL | Tokens scored |
|---|---|---|---|---|---|
| vLLM BF16 full KV | self | 0.000000 |
100.0% |
1.6738 |
60/60 |
| vLLM FP8 KV | vLLM BF16 | 0.001042 |
70.3% |
1.1971 |
32/60 |
vLLM TurboQuant 4bit_nc |
vLLM BF16 | 0.006039 |
84.4% |
1.4910 |
45/60 |
| FastDMS FP8 compact-DMS | FastDMS no-evict | 0.005284 |
95.3% |
1.8301 |
64/64 |
FastDMS compact-DMS scores 64/64 tokens on both models - every decode step was comparable to the reference, and the KLD is lower than or comparable to vLLM's own FP8 and TurboQuant compression. Note that PPL values across rows are not directly comparable when Tokens scored differs, because each row's PPL is computed over a different-length prefix.
What's the catch?
So, if this is so darn great, why wasn't everyone using it already? Well, it turns out if you want to implement this in a production engine like vLLM, you have to do major surgery to it. DMS compact KV touches nearly every serving-engine subsystem:
| Subsystem | What changes for DMS |
|---|---|
| PagedAttention / KV memory pool | DMS needs per-layer, per-head variable token counts with partial block deallocation - not standard fixed-page blocks |
| Prefill kernel | Must stream surviving K/V into compact per-layer storage after DMS extraction, rather than writing dense KV pages |
| Decode kernel | Each decode step evaluates per-head keep/evict, manages a sliding retention window, and appends to compact storage |
| Attention scoring | Replaced entirely: split-K grouped compact decode attention over variable-length per-head live spans |
| Scheduler / admission | Must admit requests based on compact KV capacity, not dense full-sequence page count - this is the hardest boundary |
| Prefix caching | DMS eviction is per-sequence and per-head; shared prefix blocks need per-sequence eviction overlays or must be disabled |
| Continuous batching | Memory accounting must reflect actual surviving token count, not logical sequence length |
God bless anyone that wants to give this a swing. The kvcache compression seems real, and with a correct implementation there's no quality hit, and as shown by the FastDMS implementation, it looks like can run faster than non-DMS inferencing.
(lots more perf benchmarks, comparisons, and raw logs in the repo for those interested)
47FsXMj@reddit
This is awesome!
schuttdev@reddit
Good post! Very similar to my own implementation on CASK it seems, I’ll look into what can be done with it on AMD side
randomfoo2@reddit (OP)
Nice. I feel like I burnt way past my time and token limits on this but will be cheering you on!
tomByrer@reddit
Thanks for all you do, & I don't even own anything AMD. 😄
FaustAg@reddit
what about vs turboquant at 8 bit?
randomfoo2@reddit (OP)
TurboQuant at 8-bit would be slower, worse quality, and larger than FP8, so wouldn't make sense, but the neat thing about DMS is that it can basically be composed w/ any quant scheme since they work at different layers of the kvcache.
The "optimal" quality/memory combo that gave positive results from my testing was DMS+HIGGS+AQUA, however I wasn't able to get HIGGS to the speed I wanted so just dropped it and took the "reasonable" win.
FaustAg@reddit
wait wait wait. how would turboquant 8 bit be worse than fp8? it's literally 8 bit with compression vs 8 bit without compression.
PaceZealousideal6091@reddit
Wow! Thanks a lot for your work to get a PoC. These are some of best KV cache compression performance I have seen out here. I know for a fact that some of the developers taking care of Llama.cpp are madlads indeed. If they find your work useful, they won't leave any stones unturned to implement it.
randomfoo2@reddit (OP)
Yeah, started off as a review everything out there and sort of just kept grinding. TBT, if you're ok w/ 50% throughput, you can get 20-25X smaller kvcache w/ HIGGS+AQUA added to the mix w/ basically 0 perplexity loss which is even more eye-popping. Maybe for another project if/when I get bored 😄
FullOf_Bad_Ideas@reddit
How well would DMS translate to modern MoE's like MiMo V2.5 ir Deepseek V4 Flash? I assume that they mostly squeezed out the juice out of KV cache optimization during architecture design, so there are no more easy gains left.
randomfoo2@reddit (OP)
The eviction would still largely work the same I'd imagine, although smaller activations, less attention layers ofc means less kvcache to start with. I'd bet you'd get similar kvcache memory savings. The good thing btw, is that the kernels I built actually scale pretty well to max context length for the models I tested (128K and 256K). I bet at 1M w/ DSv4 it'd still be worth it.
ikkiho@reddit
Per-head learned eviction is the reason -0.28% PPL holds at 6.4x while heuristic eviction breaks. The predictor is trained to align with where attention mass already concentrates, so what gets evicted is the tail of the attention distribution that was contributing near-zero to the output anyway. That's why the curve is flat: you're not really "compressing" KV, you're learning that the head didn't need those tokens to begin with. SnapKV, H2O and StreamingLLM share that insight but the policy is hand-tuned (recency + attention-sum heuristic), and the heuristic fails on retrieval-style queries where the relevant token sits in the middle. Per-head learning fixes the tail estimation directly.
The decode speedup is what's actually interesting. Most KV compression schemes lose decode throughput because unpacking plus paged-mask redirection eats the bandwidth win. Compact storage that physically reclaims evicted slots avoids the index indirection: the next-step QK matmul reads contiguous DRAM at the new (smaller) length, so SMEM working set drops linearly with eviction ratio, and on H100 decode-1 the bottleneck IS SMEM bandwidth not compute. That's where the 1.52x and 2.31x come from, not the compression itself.
For FullOf_Bad_Ideas on MoE: DMS is orthogonal to expert routing. KV cache is per-attention-head not per-expert, so the predictor cost rides on K-heads only, and GQA already shrinks that 4 to 8x. Deepseek V4 and MiMo V2.5 use MLA which compresses KV along a different axis (low-rank latent), so DMS-on-MLA would need to predict eviction in latent space, and the loss landscape is non-trivial there because the latent dimension mixes the token dimension. That's the real open question.
One test I'd want to see: 32k context with retrieval-style needle-in-haystack. Per-head learned eviction should beat StreamingLLM and H2O on the hard cases by exactly the gap where heuristics misjudge the tail. RULER or LongBench-Recall would lock that in.
AccomplishedFix3476@reddit
6.4x compression while running faster than fp8 vllm is a different scale of result than ive seen in the kv cache space this year. the per head learned eviction is the part that probably kills naive caching on long contexts. ran a quick test on a 32k prompt last week and the prefill speed was way better than i expected
Silver-Champion-4846@reddit
LLAMA CPP Hope it slirps it fast
Dany0@reddit
Thank you!!!! I wanted to see someone try to implement this oss since the paper came out
Ye it will probably take a long time for engines to pick this up. You can suggest it though as a feature request. Maybe someone will pick it up. This has more potential than TQ. But ye... I wouldn't want to be the one implementing it
randomfoo2@reddit (OP)
I very briefly considered doing an actual vLLM or SGLang implementation and then after looking at the lift that'd be involved, noped out real fast. 😅
But I hope some madlad does it! DMS, unlike most things I tests legit works! (I'm not so impressed by TQ - HIGGS+AQUA test much better for me, but the problem is always getting it fast)