llama.cpp oom issue

Posted by TheTerrasque@reddit | LocalLLaMA | View on Reddit | 23 comments

I'm having an issue with llama.cpp going OOM *(system ram, not vram)* after some time, roughly 20-40 minutes of active use. I'm now running it in a cgroup with about 20gb allocated to it, so at least it gets killed and restarted before it start messing with other services on the machine. Command: ~/llama.cpp/build/bin/llama-server -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL --temp 0.6 --top-p 0.95 --top-k 20 -cram 4096 -c 90000 --min-p 0.00 --spec-draft-p-min 0.75 -np 1 -t 4 -ctk q5_1 -ctv q5_1 --cache-type-k-draft q5_1 --cache-type-v-draft q5_1 --spec-type draft-mtp --spec-draft-n-max 3 --fit off --image-min-tokens 1024 --image-max-tokens 2048 --chat-template-kwargs '{"preserve_thinking":true}' I've tried various settings, builds and even docker image, but over time the problem is the same. The process slowly takes more memory and eventually is killed. Tried --no-mmap and --cache-ram 0 - last one delayed the OOM but it still happened. Also tried without mtp. Is this expected behavior? I have another server with weaker gpu that runs llama.cpp server via llama-swap and that doesn't have the same problem, but then again the server process is not usually running for long periods there.

Reply to Post

23 Comments

[-]

superdariom@reddit

I've seen massive memory spikes with Vulcan like consuming double what it should for a short period of time. I had to increase swap memory to deal with it. Looks like a bug to me and feels like it only started happening recently.

[-]

Wrong_Mushroom_7350@reddit

I am wondering, why you do not have flash attention on?

[-]

TheTerrasque@reddit (OP)

it's on, via env vars. Besides, currently FA is auto as default, so should be on for most hardware.

[-]

JGeek00@reddit

I had the same issue. The solution is to reduce the amount of checkpoints and its size although I finally ended up installing more memory

[-]

TheTerrasque@reddit (OP)

yeah, spot on. Reducing the checkpoints made the problem go poof

[-]

shamitv@reddit

To troubleshoot , do you see created context checkpoint 1 of X messages in logs ?

[-]

TheTerrasque@reddit (OP)

yeah, it seems like this was the source of the issue. Jacek2023 figured it out :)

[-]

jacek2023@reddit

do you have any logs?

[-]

TheTerrasque@reddit (OP)

dmesg: https://pastebin.com/4FRxDpX4 service log: https://pastebin.com/hQqpVvvM

[-]

jacek2023@reddit

"created context checkpoint 32 of 32" - start from decreasing number of checkpoints to let's say 4, try to reproduce the OOM

[-]

TheTerrasque@reddit (OP)

I've changed it, will see if I can push it a bit. But I would expect that would be bound by the -cram setting? Or have I misunderstood how that works?

[-]

jacek2023@reddit

First let's find out is this a root cause. I have 128GB of RAM and I had some OOM with Gemma because checkpoints. With Qwen 27B I use 24 without issues.

[-]

llama-impersonator@reddit

gemma checkpoints are like 10x the size of qwen. i think i remember 330MB vs something in the 3GB range

[-]

TheTerrasque@reddit (OP)

Preliminary results looks really good, it has maxed at about 5-6 gb ram so far, which is reasonable for 4gb cache plus server overhead.

[-]

jacek2023@reddit

Now find a good value between 4 and 32

[-]

ali0une@reddit

When using MTP try to either lower context or use fit-ctx.

[-]

cptbeard@reddit

I had a system lock up with the same model when a coding agent started trying to compact it's context of 131073 with --no-mmap --mlock, kv q8 and draft kv q4, and nothing much else than llama.cpp running, with 7900xtx and 64GB of system RAM. solution for me was to drop context size to \~123k, hasn't happened since so didn't really bother investigating (although few times system got pretty slow with full context so I suspect some memory mgmt shenanigans)

[-]

Formal-Exam-8767@reddit

How much RAM do you have? I don't think it pre-allocates context on start.

[-]

TheTerrasque@reddit (OP)

machine has 24gb, the service llama-server runs under is limited to 20gb. The GPU has 24gb too, a 3090. But it's not cuda OOM, it's system ram OOM.

[-]

Anbeeld@reddit

Try --no-mmap --mlock

[-]

TheTerrasque@reddit (OP)

I have tried that, sadly that made no difference :(

[-]

xeroskiller@reddit

Im using q4_k_m on a 7900xtx (so 24gb vram). Im doing -np 1 -c 131073 and q8 on kv cache. It BARELY fits, but its stable. How much gram are you working with?

[-]

TheTerrasque@reddit (OP)

As I mentioned, this is system memory OOM, not vram. I posted the logs of one oom kill in a different comment here.