llama.cpp oom issue
Posted by TheTerrasque@reddit | LocalLLaMA | View on Reddit | 23 comments
I'm having an issue with llama.cpp going OOM *(system ram, not vram)* after some time, roughly 20-40 minutes of active use. I'm now running it in a cgroup with about 20gb allocated to it, so at least it gets killed and restarted before it start messing with other services on the machine.
Command:
~/llama.cpp/build/bin/llama-server
-hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL
--temp 0.6
--top-p 0.95
--top-k 20
-cram 4096
-c 90000
--min-p 0.00
--spec-draft-p-min 0.75
-np 1
-t 4
-ctk q5_1 -ctv q5_1
--cache-type-k-draft q5_1
--cache-type-v-draft q5_1
--spec-type draft-mtp
--spec-draft-n-max 3
--fit off
--image-min-tokens 1024
--image-max-tokens 2048
--chat-template-kwargs '{"preserve_thinking":true}'
I've tried various settings, builds and even docker image, but over time the problem is the same. The process slowly takes more memory and eventually is killed. Tried --no-mmap and --cache-ram 0 - last one delayed the OOM but it still happened. Also tried without mtp.
Is this expected behavior? I have another server with weaker gpu that runs llama.cpp server via llama-swap and that doesn't have the same problem, but then again the server process is not usually running for long periods there.
23 Comments
superdariom@reddit
Wrong_Mushroom_7350@reddit
TheTerrasque@reddit (OP)
JGeek00@reddit
TheTerrasque@reddit (OP)
shamitv@reddit
TheTerrasque@reddit (OP)
jacek2023@reddit
TheTerrasque@reddit (OP)
jacek2023@reddit
TheTerrasque@reddit (OP)
jacek2023@reddit
llama-impersonator@reddit
TheTerrasque@reddit (OP)
jacek2023@reddit
ali0une@reddit
cptbeard@reddit
Formal-Exam-8767@reddit
TheTerrasque@reddit (OP)
Anbeeld@reddit
TheTerrasque@reddit (OP)
xeroskiller@reddit
TheTerrasque@reddit (OP)