llama.cpp oom issue

Posted by TheTerrasque@reddit | LocalLLaMA | View on Reddit | 23 comments

I'm having an issue with llama.cpp going OOM *(system ram, not vram)* after some time, roughly 20-40 minutes of active use. I'm now running it in a cgroup with about 20gb allocated to it, so at least it gets killed and restarted before it start messing with other services on the machine. Command: ~/llama.cpp/build/bin/llama-server -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL --temp 0.6 --top-p 0.95 --top-k 20 -cram 4096 -c 90000 --min-p 0.00 --spec-draft-p-min 0.75 -np 1 -t 4 -ctk q5_1 -ctv q5_1 --cache-type-k-draft q5_1 --cache-type-v-draft q5_1 --spec-type draft-mtp --spec-draft-n-max 3 --fit off --image-min-tokens 1024 --image-max-tokens 2048 --chat-template-kwargs '{"preserve_thinking":true}' I've tried various settings, builds and even docker image, but over time the problem is the same. The process slowly takes more memory and eventually is killed. Tried --no-mmap and --cache-ram 0 - last one delayed the OOM but it still happened. Also tried without mtp. Is this expected behavior? I have another server with weaker gpu that runs llama.cpp server via llama-swap and that doesn't have the same problem, but then again the server process is not usually running for long periods there.