Nemotron-49B uses 70% less KV cache compare to source Llama-70B

Posted by Ok_Warning2146@reddit | LocalLLaMA | View on Reddit | 45 comments

While studying how much KV cache major models uses using formula and empirically running it with llama.cpp if possible, I found that the Nemotron models are not only 30% smaller in model size, KV cache is also 70% less. Overall, it is 38% VRAM saving if you run at 128k context.

This is because the non-self attention layers doesn't have any KV cache at all. For Nemotron-49B, 31 out of 80 layers are non-self attention. For 51B, 26 out of 80 layers.

So if you are into 128k context and have 48GB VRAM, Nemotron can run at Q5_K_M at 128k with unquantized KV cache. On the other hand, QwQ can only run at IQ3_M due to 32GB KV cache.

https://www.reddit.com/r/LocalLLaMA/comments/1jl33br/qwq32b_has_the_highest_kv_cachemodel_size_ratio/

Other things I learned:

  1. gemma-3 is pretty bad at KV cache while running with llama.cpp but this is because llama.cpp doesn't implement interleaved sliding window attention that can reduce KV cache to one sixth. (probably HF's transformers is the only one that support iSWA?)

  2. Deepseek should make smaller MLA models that fit in 24GB or 48GB VRAM. This will blow the competition out of the water for local long context use.