The exact KV cache usage of DeepSeek V4

Posted by Ok_Warning2146@reddit | LocalLLaMA | View on Reddit | 40 comments

Figure 1 of DSV4 paper seems to imply that DSV3.2 uses \~50GB at 1m context and DSV4 uses

\~5GB:

https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf

From my own calculations, the correct FP16 KV cache at 1m context should be:

Model	Params	128k	160k	1m	KV%
V3.x	671B	8.58GiB	10.72GiB	68.63GiB	5.11%
V4 Flash	284B	0.76GiB	0.95GiB	6.08GiB	1.07%
V4 Pro	1600B	1.09GiB	1.36GiB	8.71GiB	0.272%

So while KV cache saving is not 9.5x but 7.879x. It is still very impressive. If you look at the KV% metric, then we are seeing close to 20x gain. This basically obliterates all current transformer-SSM hybrid models' KV cache usage. But the transformer-SSM crowd can just use DSV4's CSA and HCA on their transformer layers to catch up.

At this KV cache usage, that also means when DSV4 is supported at llama.cpp, we can easily run 1m context for DSV4 Flash on 256GB RAM and 3090 or for DSV4 Pro on 1.5TB RAM and RTX 6000 Blackwell. I suppose the various speed gain mentioned in the paper can make this viable.

While DSV4 Pro doesn't do well at artificial analysis. We can expect Kimi and Zhipu will make derivatives off it such that we have a beast that uses very little KV cache.

All in all, DS is still doing very well as the research backbone of the Chinese AI scene.

PS More detailed calculations for people interested. Please let me know if I did any math wrong:

Based on what I see by actually running V3.2 with llama.cpp, the actual FP16 KV cache usage for DSV3.2 is 10.72GiB at 160k context and 68.625GiB at hypothetical 1m context.

This number can be validated with the per token per layer MLA KV cache formula:(kv_lora_rank + qk_rope_head_dim) * precision = (512 + 64) * 2 = 1152 bytes. So for 61 layers and 1m token, it will be 1152*61*1024*1024 = 68.625GiB which is not 50GB.

On the other hand, for DSV4 Pro, it has 30 CSA layers and 31 HCA layers interleaved. My understanding is that CSA only stores 1/4 of MLA KV cache, so per token per layer is 288 bytes and HCA only stores 1/128 of MLA KV cache, so per token per layer is 9 bytes. Therefore, KV cache at (288*30+9*31)*1024*1024 =\~ 8.70996GiB. So KV cache saving is 7.879x not 9.5x.

For DSV4 Flash, the first two layers are Sliding Window Attention with a window size of 128 tokens. Normally, for these two layers, the per layer KV cache for any length longer than 128 should be 2*n_head_kv*head_dim*precision*window = 2*1*128*2*128 = 65536 bytes. The current llama.cpp implementation adds 256 byes to the window for better batching, it becomes 2*1*128*2*(128+256) = 196608 bytes.

There are 21 CSA layers and 20 HCA layers for DSV4 Flash, so the KV cache at 1m context is (288*21+9*20)*1024*1024+2*196608 = 6.0824GiB. This is 11.3x saving compare to DSV3.2 not 13.7x as claimed.

[-]

cantgetthistowork@reddit

Any ELI5 version?

[-]

Ok_Warning2146@reddit (OP)

The point of DSV4 is KV cache VRAM saving as well as algorithms to speed up inference not raw intelligence.

[-]

cantgetthistowork@reddit

Yes but ELI5 version of how it achieves these savings?

[-]

Ok_Warning2146@reddit (OP)

The main idea is that it processes input tokens as batch of 4 or 128 such that it can do the compression magic.

[-]

JayPSec@reddit

Yes, but ELI5 how you so good with ELI5??

[-]

Technical-Earth-3254@reddit

V4 Pro uses the same amount of KV cache at 1m(!) as my Gemma 31B at below 100k. smh.

[-]

Ok_Warning2146@reddit (OP)

Yeah. Now serving more users at 1m context become much easier thanks to DS.

[-]

guiopen@reddit

Id I'm not mistaken, deepseek 3.2.is already VERY efficient on kV cache, right? So this massive improvement is actually even bigger if you compare it to, for example, kimi

[-]

FullOf_Bad_Ideas@reddit

V3.2 is about as efficient on KV cache as Kimi K2.6 and GLM 5.1.

DSA reduces arithmetical complexity but increases KV cache usage.

[-]

Ok_Warning2146@reddit (OP)

How does DSA increase KV cache? Does Lightning Indexer use KV cache to store k_s\^I which depends on the preceding token?

[-]

FullOf_Bad_Ideas@reddit

it's mentioned in vllm write-up

Indexer stores per-layer per-token cache.

Indexer cache per token per layer: 128 x 2 = 256 bytes.

it should end up being about 18% of total KV cache.

[-]

Ok_Warning2146@reddit (OP)

That's a lot of kv cache. Is it worth it for the speed gain from 3.1 to 3.2?

[-]

FullOf_Bad_Ideas@reddit

Probably no, but I think it's just a stepping stone architecturally towards scalable 1M and longer context that we see in Deepseek V4 Pro. DSA is basically a necessity if you want to predict tokens when you have 900k tokens loaded up on the cheap.

[-]

FullOf_Bad_Ideas@reddit

vLLM has good blog entry on V4 and they break down KV cache usage, I think we can say that it's an authorative source - https://vllm.ai/blog/deepseek-v4

Indexer is taking a lot of prefill time and Zhipu uses IndexCache, while DS doesn't seem to be fixing it themselves it seems, so it may be a roadblock to fast long-context inference - https://github.com/THUDM/IndexCache

[-]

Ok_Warning2146@reddit (OP)

Thanks for the link. It is interesting that vllm cited 83.9GiB for V3.2 which is even higher than mine as well as the 50GB in Figure 1. V4 Pro also is 9.62GiB which is also higher than mine. I suppose maybe the Lightning Indexer needs KV cache? I think I need to look into it more carefully...

[-]

fairydreaming@reddit

Yes, indexer needs KV cache (specifically key cache, there are no value vectors in indexer). In V3.2 model indexer cache is fp8.

[-]

Ok_Warning2146@reddit (OP)

How much faster is 3.2 compare to 3.1? Does it make sense to spend this much amount of kv cache?

[-]

Ok_Warning2146@reddit (OP)

Lightning Indexer does indeed use KV cache. I have updated my numbers, now they match the vllm numbers.

[-]

Middle_Bullfrog_6173@reddit

I think this part of your calculation is wrong:

This number can be validated with the per token per layer MLA KV cache formula:(kv_lora_rank + qk_rope_head_dim) * precision = (512 + 64) * 2 = 1152 bytes.

My understanding is that the 64 rope dimensions are part of the 512 total. Also the 512-64 = 448 dimensions use FP8. So this should be:

64 * 2 + 448 * 1 = 576

But honestly there's so much new that I may be misunderstanding.

[-]

Ok_Warning2146@reddit (OP)

Updated my numbers using the real life numbers in vllm. Should be correct now.

https://vllm.ai/blog/deepseek-v4

[-]

Ok_Warning2146@reddit (OP)

Hmm.. I was talking about the MLA KV cache that is used for all V3.x. I believe I checked quite many times my calculation is correct as well as verification with llama.cpp's kv cache in real life.

[-]

Middle_Bullfrog_6173@reddit

Ah, I misunderstood. I thought that was where you derived the numbers for the later calculations. Anyway, I still think your numbers look 2x as large as they should.

HCA layers are 576 bytes / 128 tokens = 4.5 bytes per token rather than your 9. CSA layers are 576 bytes / 4 tokens = 144 bytes per token rather than your 288. (Although that doesn't include the indexer, so somewhere between is probably correct.)

[-]

Ok_Warning2146@reddit (OP)

I am assuming fp16 kv cache, so you need to multiply by 2. You can assume FP8 and get another set of numbers but the ratio will remain the same.

Anyway, my numbers can be an underestimation as vllm quoted higher KV cache for both V3.2 and V4:

https://vllm.ai/blog/deepseek-v4

Maybe the Lightning Indexer uses KV cache? But re-reading the paper doesn't seem so. I think I need to look into vllm code to see what's going on.

[-]

Middle_Bullfrog_6173@reddit

If you assume FP32 you get even larger numbers... But that split FP16/8 precision is part of how they are achieving the memory reduction.

[-]

Ok_Warning2146@reddit (OP)

If you read the DeepSeek V3 technical report, it said you need to store both C_t\^KV and K_t\^R which have dimensions of 512 and 64 respectively. So it should be addition not subtraction.

[-]

Luca3700@reddit

And this is the citation for the FP8 part:

we adopt a mixed storage format for KV entries: BF16 precision is used for the rotary positional embedding (RoPE) dimensions, while FP8 precision is applied to the remaining dimensions. This hybrid representation reduces the KV cache size by nearly half compared with pure BF16 storage.

[-]

Luca3700@reddit

Yes, I understood the same. From the deepseek V4 paper they say:

For both CSA and HCA, we partially employ the Rotary Positional Embedding (RoPE) (Su et al., 2024) to the attention queries, KV entries, and the core attention outputs. To be specific, for each query vector and KV entry vector used in CSA and HCA, we apply RoPE to its last 64 dimensions.

[-]

QuackerEnte@reddit

Anyone who CAN run the model in the first place wouldn't complain about whether it's 5 or 8 GBs for 1M context. Like come on.

[-]

ResidentPositive4122@reddit

wouldn't complain about whether it's 5 or 8 GBs for 1M context.

It's not about complaining, but how many "slots" you can fill on a given unit of compute. Say you have 10 users per unit of compute. If those 10 users take ~50GB or ~80GB is different. There's 6 more users you can serve in the first case, on the same unit of compute. It's relevatn to usage / pricing overall.

[-]

QuackerEnte@reddit

I don't understand localllamars. Are you for or against cloud inference? Where's the privacy in that

[-]

inevitabledeath3@reddit

The people who can run this model are serving multiple users and so need more than 1m of total context. They might be serving 20 users all with 1m of context so need 20x that amount.

[-]

Worried-Squirrel2023@reddit

50GB to 5GB at 1m context if those numbers hold is the bigger story than the model itself. that's the difference between needing a server and running on a workstation. the architecture changes there matter more than the benchmark scores everyone is debating.

[-]

-dysangel-@reddit

They both matter. I could make you an algo that uses 0GB, but the benchmark scores are also going to be 0

[-]

Ok_Warning2146@reddit (OP)

Yeah. Not to mention now you can serve a few more users simultaneously.

[-]

Monkey_1505@reddit

Flash does well on artificial analysis. The larger model they clearly struggled a little more with, which probably delayed their release.

[-]

Ok_Warning2146@reddit (OP)

That can mean the bigger model is under-trained. Kimi might do a better job of training big models.

[-]

Monkey_1505@reddit

Yeah, it could be. DS has been relying a much larger weight of post-training than is normal for other labs (from their v3.2 exp experimentation), and the larger model is larger than any they have tried before.

So it could be the flash model got the optimal amount of post-training, and the 'pro' model needed more. Their approach here, has been experimental in that regard, weighting more post-training, and less pre-training than is typical, so a wrong ratio for a new model size, makes a lot of sense.

[-]

LegacyRemaster@reddit

The technology developed by Deepseek continues to be state-of-the-art. What I regret is that, unlike Qwen, Minimax, etc., they never actively supported llama.cpp. DSA was never fully implemented, and who knows when we'll have V4 at 100% (if we ever will).

[-]

power97992@reddit

I counted the xet files and calculated the size for v4 pro, it is around 866gb , it is only 1.6TB in fp8 precision , but the model itself is fp4+fp8 mixed precision

[-]

power97992@reddit

Dude did u test it yet on a set of gpus yet?