Nemotron-49B uses 70% less KV cache compare to source Llama-70B
Posted by Ok_Warning2146@reddit | LocalLLaMA | View on Reddit | 45 comments
While studying how much KV cache major models uses using formula and empirically running it with llama.cpp if possible, I found that the Nemotron models are not only 30% smaller in model size, KV cache is also 70% less. Overall, it is 38% VRAM saving if you run at 128k context.
This is because the non-self attention layers doesn't have any KV cache at all. For Nemotron-49B, 31 out of 80 layers are non-self attention. For 51B, 26 out of 80 layers.
So if you are into 128k context and have 48GB VRAM, Nemotron can run at Q5_K_M at 128k with unquantized KV cache. On the other hand, QwQ can only run at IQ3_M due to 32GB KV cache.
Other things I learned:
-
gemma-3 is pretty bad at KV cache while running with llama.cpp but this is because llama.cpp doesn't implement interleaved sliding window attention that can reduce KV cache to one sixth. (probably HF's transformers is the only one that support iSWA?)
-
Deepseek should make smaller MLA models that fit in 24GB or 48GB VRAM. This will blow the competition out of the water for local long context use.
dinerburgeryum@reddit
I’d be curious to know how Nemotron does with lcpp’s Q8_0 KV cache quant, or better EXL2’s Q4.
Ok_Warning2146@reddit (OP)
Quantized KV cache seems to break some models, e.g. gemma 3. Not sure about Llama/Nemotron.
exllamav2 doesn't support nemotron but I created a hack to support it. You can try it and see if you can convert and run. I believe it should work with single GPU. Not sure about multi GPU.
https://github.com/ymcki/exllamav2
turboderp says there will soon be exllamav3 that can support layers with different configs such that Nemotron and OpenELM can be supported easily.
RebornZA@reddit
> exllamav2 doesn't support nemotron
is THAT why I haven't seen any exl2 quants for so long. Every day, checking. Sadge.
ICanSeeYou7867@reddit
I'm running Q8 for gemma 3 and I have been pleased with it so far.
Ok_Warning2146@reddit (OP)
https://github.com/ggml-org/llama.cpp/issues/12352
Some people also reported gemma 3 very slow when KV cache quantized.
ICanSeeYou7867@reddit
Yeah, one of those posts are mine, and a similar one for koboldcpp. But there have been a couple of fixes with gemma since that post.
Though maybe I should try again to make sure I am not getting my models mixed up :D
Ok_Warning2146@reddit (OP)
Oh I see. Maybe I should also update my llama.cpp.
AppearanceHeavy6724@reddit
So am I. Saw no visible difference so far.
Ok_Warning2146@reddit (OP)
q8_0 for both k and v and flash attention on?
I can run too. It just topk me 16hrs to fininsh 70k context for 12b q4km with 3090.
H3PO@reddit
sure this isn't a typo? with which inference software? with 128k context and no cache quant, llama.cpp tries to allocate 19.5gb for context on top of the 35gb model. not even the Q4 model with q8 v cache fits on my 2x24gb.
Ok_Warning2146@reddit (OP)
Oops. I made a mistake in multiplying the KV cache. The correct number for 49B is 24.5GB for unquantized KV cache at 128k. Sorry about that. So you can only run IQ3_M model at 128k
H3PO@reddit
thanks for checking!
Ok_Warning2146@reddit (OP)
That's only true if you have a single 48GB GPU. When you have multiple GPUs, since llama.cpp just split the llm at layer 40 and Nemotron model's self attention layers concentrated in the first 40 layers, so model size and KV cache allocation is uneven for 2x24GB. Someone discovered this bug and reported at github:
https://github.com/ggml-org/llama.cpp/issues/12654
AppearanceHeavy6724@reddit
They actually have it, called Deepseek V2 Lite; there is no support for that model cache in LLama.cpp whatsoever, so in LLama.cpp it does not have KV cache at all afaik. Which is strange DS V3 runs fine in Llama.cpp.
Ok_Warning2146@reddit (OP)
No KV cache? Wouldn't it be very slow? Or do you mean almost no cache?
AppearanceHeavy6724@reddit
no, no at all. all context is processed on the fly. yes very slow.
Ok_Warning2146@reddit (OP)
I see. That would mean mla is not implemented. But I remember it is implemented since v3 is out. Have u upgraded to latest llama.cpp?
AppearanceHeavy6724@reddit
DS V2 MLA does not work even DS V3 works.
It did not work even in llama.cpp I've built after release of DS V3.
Ok_Warning2146@reddit (OP)
https://github.com/ggml-org/llama.cpp/pull/11446
After reading multiple sources, it seems like MLA is not yet supported by llama.cpp. :(
AppearanceHeavy6724@reddit
I wonder how DS V3 works then.
Ok_Warning2146@reddit (OP)
Maybe in real life, people are using HF transformers to run it? At least we can be sure HF transformers support everything as the deepseek v3 provided modeling file.
Ok_Warning2146@reddit (OP)
Probably converted to MHA and uses a lot of KV cache?
At least seems like someone is working on it. Maybe we will see true MLA support in the near future.
Ok_Warning2146@reddit (OP)
https://github.com/ikawrakow/ik_llama.cpp/pull/188
There is a fork of llama.cpp claimed that it supports MLA
Ok_Warning2146@reddit (OP)
https://github.com/ggml-org/llama.cpp/discussions/8589
It was not supported back in July 2024. But back then, MLA gets converted to MHA such that it uses too much VRAM.
Ok_Warning2146@reddit (OP)
But DSV2 lite is only 32k context. Should make a 128k version to make the KV cache saving more noticeable.
perelmanych@reddit
In LM Studio I ran into problem with this model. I have dual RTX 3090 and in new version of LM Studio I choose to load model evenly on to two cards. However, it fills completely the first card and in the second it uses only 13Gb. If I try to increase context I get OOM on the first card. This is the first model that I have problem with on my dual GPU setup. All other models including QwQ 32B, R1-32B, Llama 3.3 70B are evenly distributed among two GPUs.
Am I alone, or some of you have similar problem with that model?
PS: I am using nvidia_Llama-3_3-Nemotron-Super-49B-v1-GGUF/nvidia_Llama-3_3-Nemotron-Super-49B-v1-Q4_K_M.gguf with 25k context and flash attention on. KV cache turned off.
Ok_Warning2146@reddit (OP)
Maybe you can post this bug at github llama.cpp issues to see if it can be fixed?
perelmanych@reddit
Thanks for answer. I am not quite sure that this is llama.cpp bug. I more inclined to believe that this is LM Studio bug. I posted the issue on the LM Studio GitHub page, but meanwhile I didn't get a response.
Ok_Warning2146@reddit (OP)
Ah. Maybe not a bug. This is because for the 49B and 51B, most self attention layers concentrated in the first 40 layers. So if it is split along middle, you will see the first card has way more KV cache.
perelmanych@reddit
I tried to run the model with llama.cpp cli and in case of equal split I saw the same picture as in LM Studio. When I used --tensor-split 4,6 to load only 40% to GPU0 and 60% to GPU1 I actually got even split in terms of VRAM use. So it seems that you were right. The first layers of the model are much bigger and splitting layers equally leads to uneven VRAM use.
Currently there is no option to manually set splitting percentages in LM Studio, but as I understand they are promising to implement it in a newer version. Still it will be a bit inconvenient, since each time I would like to use Nemotron or other model with asymmetric layers I will have to change the splitting. The better option would be to split model according to actual size of layers rather than number of layers. Probably should report the issue on llama.cpp GitHub as you suggested.
Ok_Warning2146@reddit (OP)
Maybe u can request llama.cpp to add a feature to split every other layer? I think if this is doable then it can solve your problem to some extant.
perelmanych@reddit
This will increase intra GPU communication by a factor of n/2, where n is number of layers. And if model for some reason has big/small layers in a sequence it will result in the same behavior. Much easier just to solve one equation with k, where k is a number of layers offloaded to GPU0 in such a way that sum of sizes of first k layers approximately equals to the sum of sizes of n-k last layers. They have full info about layers sizes so it shouldn't be a big deal for them.
Ok_Warning2146@reddit (OP)
Are you going to send a feature request at llama.cpp github? If not, I can do it for you.
perelmanych@reddit
Already sent)) Feel free to comment on it: https://github.com/ggml-org/llama.cpp/issues/12654
Ok_Warning2146@reddit (OP)
Oh I see. so probably there is no easy solution.
humanoid64@reddit
Has anyone tested AWQ? I like to use vLLM
AppearanceHeavy6724@reddit
Is this really true? Google's own tech report confirms that cache requirements are unusually high.
Ok_Warning2146@reddit (OP)
Figure 6 of the technical report says it is one sixth KV cache at 128k context.
LagOps91@reddit
I am running the model at IQ3XXs and 16k context on a single 24gb vram setup. The model holds up surprisingly well even at Q3 and yeah, I was also surprised that I could fit that much vram.
tmvr@reddit
What is your KV cache configuration? With the model size being 19.52GB itself I guess you'd need Q8 or even lower KV cache to fit the 16K context into 24GB?
LagOps91@reddit
actually, no i don't have any KV cache enabled. the context is really memory-friendly.
AppearanceHeavy6724@reddit
Nitpick: no you do have cache enabled, otherwise it'd be painfully slow. You do not have cache quantisation enabled.
LagOps91@reddit
Yeah that's what I meant to say. Don't know how I mixed that up
tmvr@reddit
You are right, just tried it and it indeed fits nicely with 16K and FA and there is still some VRAM left (about 2GB). That's pretty wild, I like it.
LagOps91@reddit
yeah i was also positively surprised. i had expected the model to be too large to be usable on 24gb vram, but it works surprisingly well!