llama.cpp Gemma 4 using up all system RAM on larger prompts

Posted by GregoryfromtheHood@reddit | LocalLLaMA | View on Reddit | 37 comments

Something I'm noticing that I don't think I've noticed before. I've been testing out Gemma 4 31B with 32GB of VRAM and 64GB of DDR5. I can load up the UD_Q5_K_XL Unsloth quant with about 100k context with plenty of VRAM headroom, but what ends up killing me is sending a few prompts and the actual system RAM fills up and the process gets terminated for OOM, not a GPU or CUDA OOM, like Linux killing it because llama.cpp was using 63GB of system RAM.

I've since switched to another slower PC with a bunch of older GPUs where I have with 128GB of DDR4, and while I've got heaps of GPU VRAM spare there, it still eats into the system RAM, but gives me a bigger buffer before the large prompts kill the process, so is more usable. Although I've been running a process for a little while now that has been prompting a bit and has done a few \~25k token prompts and I'm sitting at 80GB of system ram and climbing, so I don't think it'll make it anywhere near 100k.

I even tried switching to the Q4, which only used \~23GB of my 32GB of VRAM, but still, throw a few large prompts at it and the system RAM fills up quick and kills llama.cpp.

I'm using the latest llama.cpp as of 2 hours ago and have tested across a couple of different machines and am seeing the same thing.

It's weird that I would need to lower the context of the model so that it takes up only like 18GB of my 32GB of VRAM just because my system RAM isn't big enough, right?

running with params -ngl 999 -c 102400 -fa on --cache-type-k q8_0 --cache-type-v q8_0 --temp 1.0 --top-k 64 --top-p 0.95

[-]

dampflokfreund@reddit

It's because for some reason, context Checkpoints take up a lot of memory.

https://github.com/ggml-org/llama.cpp/discussions/21480

[-]

GregoryfromtheHood@reddit (OP)

Ah, ok that explains it now

[-]

finevelyn@reddit

You can use --ctx-checkpoints to limit the maximum number of checkpoints created to work around the issue for now. It's by default set to 32.

Also --checkpoint-every-n-tokens can be used to adjust how often a checkpoint is made (8192 by default).

With Gemma 4 31B the checkpoints are more than 3GB each. I use --ctx-checkpoints 4 and now llama-server doesn't use more than 20GB max.

[-]

shroddy@reddit

What do these checkpoints do? Does the quality of the answers decrease with less checkpoints?

[-]

finevelyn@reddit

It saves the KV cache of the prompt at that point, and if you later send another prompt with the exact same content up to that point, it can load the checkpoint and skip re-processing that part of the prompt. Assuming no bugs in the implementation, it shouldn't make any difference to the final result, but potentially faster prompt processing in some cases.

[-]

ProfessionalSpend589@reddit

Thanks for giving a layman explanation.

[-]

idiotiesystemique@reddit

Interesting. Would it be realistic to offload the least recent ones on the SSD as a cpu background task, as the cost of slightly longer load time if reverting to it?

[-]

DeepOrangeSky@reddit

So, is this some fixable error with the model, or is it just the way the model is? Also, I don't really know what any of this checkpoint stuff is/what it means. Is it related to the KV Cache thing that others are mentioning in this thread, or is it a separate issue?

[-]

ambient_temp_xeno@reddit

Seems fixable from what I've been told.

[-]

ambient_temp_xeno@reddit

A little bird told me it might be fixable:

"the current checkpointing defaults were changed to work better with linear models (qwen 3.5) and SSM/mamba types, which have very small non traditional KV bits compared to SWA the flag --ctx-checkpoints affects the same things as --swa-checkpoints because they never bothered to separate the logic for a model like qwen 3.5, 32 checkpoints is indeed truly a nothingburger, but assuming the same thing of swa models is insane"

[-]

shockwaverc13@reddit

i find it a bit dumb there isn't a ram limit on context checkpoints, i don't even know how much does 8k tokens take

[-]

Sadman782@reddit

use --cache-ram 0 --ctx-checkpoints 1

[-]

DeepOrangeSky@reddit

Is this something I would need to be using llama.cpp to be able to type somewhere? I am a noob and am just using LM Studio, and don't really know how to do anything fancy yet. Do I need to open the CLI or something and enter that flag or command or whatever it is called, in there somewhere? Or is it something I can put into the template/Jinja thing in the LM Studio app for the model?

Btw, I've noticed that if I keep reloading the model over and over again after each reply while interacting with it in a long interaction, that seems to keep bringing the memory usage back down, compared to if I just use it like a normal model and don't eject and reload the model after each reply. (Pretty annoying ad hoc fix I figured out to do by just random trial and error as a total noob who doesn't know how/where to put that --cache-ram 0 --ctx-checkpoints 1 thing yet, lol, but seems to help quite a bit so far, even if super annoying and ridiculous of a thing to keep doing after each and every reply)

[-]

sersoniko@reddit

Thank you for posting this, it was driving me insane, I was testing with Gemma 4 31B but had to switch to Qwen3.5 27B for long coding tasks or keep unloading the model manually with LM Studio at 64k context

[-]

matt-k-wong@reddit

Kv cache takes up way more memory than you might assume. Check out Google’s turboquant. Here’s one I did but it’s made for Mac: https://github.com/matt-k-wong/turboquant-mlx-full there are others for windows.

[-]

GregoryfromtheHood@reddit (OP)

Oh, I never knew KV cache sits on system RAM. Wild. I always assumed it was all GPU if you weren't offloading to CPU because you can lower the cache size and quant to fit more in VRAM and have never seen other models do this.

[-]

Velocita84@reddit

I never knew KV cache sits on system RAM.

It doesn't if you max ngl and don't have -nkvo, this guy is talking out his ass and of course shilling turboquant because that's a trend now

[-]

matt-k-wong@reddit

You’d have to double check your settings but it sounds like your system is offloading to system ram which you can control

[-]

hurdurdur7@reddit

Yeah kv cache should not end up in system ram in ideal conditions. This sounds like offloading and subpar performance.

Gemma 4 has a bunch of issues with llama.cpp still, erratic repeats, weird memory handling, crashes. The llama.cpp team needs to iron out the kinks before it's casually usable.

[-]

GregoryfromtheHood@reddit (OP)

I thought -ngl 999 would stop it offloading. It doesn't seem to be using much CPU at all during inference so it doesn't look like usual offloading. Unless Ubuntu is doing something weird.

But I'm nowhere near filling VRAM and I will usually OOM there without it spilling into RAM with other models when I'm testing how much context I can fit.

[-]

mr_Owner@reddit

Try llama cpp cache ram at 0? Gemma4 doesn't grow in my setup, or i dont use it long enough aha

[-]

ambient_temp_xeno@reddit

Sounds like a memory leak (on linux anyway)

-np 1 might help a bit

[-]

Aizen_keikaku@reddit

Facing the same on Windows 11 & np 1. So most likely llama.cpp is the culprit,

[-]

ambient_temp_xeno@reddit

I think it's just to do with the way gemma 4 works.

[-]

GregoryfromtheHood@reddit (OP)

I've noticed that I can get the memory to go down if I resend the same prompt or part of the same prompt. Seems to be a prompt caching thing of some sort maybe? Like it CAN free up the RAM if it wants, so it's not a proper leak, but if I'm just continuing a conversation or keeping a thread going, it'll keep filling.

I tried parallel at 1 but it didn't help.

[-]

ambient_temp_xeno@reddit

-np 1 is setting slots to 1 instead of 4. On Gemma 4 more slots = more system ram used afaik.

[-]

GregoryfromtheHood@reddit (OP)

Yeah I tried -no-mmap and no difference sadly.

[-]

ambient_temp_xeno@reddit

I can't reproduce it so far on Windows.

If it's memory pressure doing it on linux, I found this stops that kind of crap on ubuntu with gnome

sudo systemd-run --scope -p MemoryMax=infinity ./llama-server (etc)

[-]

dampflokfreund@reddit

Windows has the same issue, it's the context Checkpoints. You provably dont notice because you have a lot of RAM. but watch the usage climb steadily with every 8192 tokens of your prompt.

[-]

ambient_temp_xeno@reddit

I did see it increase for the checkpoints, but with only 1 slot it wasn't a huge amount. I tested at q8 k and v cache at a filled-up 131070 context.

[-]

Igot1forya@reddit

I have roughly the same issue. If I run the BF16, Q8, and Q4 the system eats up to 107GB of memory with just a handful of prompts.

[-]

ambient_temp_xeno@reddit

Uh oh. I just noticed: --cache-type-k q8_0 --cache-type-v q8_0

Maybe the lack of testing on the rotation adventure is to blame. I'm not seeing the same problem with fp16 kv cache.

[-]

Gringe8@reddit

i use koboldcpp and the amount of ram it uses when it loads is the most it will use. I did notice it working the way you describe when i tried tabbyAPI though, probably because i didnt have it set up correctly. Make sure you have SWA enabled when using gemma, it uses much less vram.

[-]

Aizen_keikaku@reddit

Facing the exact same issue. And - -np 1 doesn’t help.

[-]

matt-k-wong@reddit

You’d have to double check your settings but it sounds like your system is offloading to system ram which you can control

[-]

AdamFields@reddit

I have a similar issue on a 5090 with 32GB of DDR5 system RAM. My VRAM is at 26GB usage (weights + context) while my RAM fills up and my pagefile goes from the usual 20GB to nearly 100GB as the context grows. I have run models with close to 29GB VRAM usage (weights + context) and never had an issue before. I also get these random crashes with Gemma 4. The crashes typically occur when the model is attempting to process the prompt, it reaches 100% and then unloads the model with an error message instead of generating anything.

LM Studio error message: "Failed to send message. The model has crashed without additional information. (Exit code: 18446744072635812000)"

I have also tried llama.cpp and have the exact same issue when using it with SillyTavern.

[-]

sterby92@reddit

I have the same issue...