GPU VRAM only for small models with llama.cpp: is it possible?
Posted by Ps3Dave@reddit | LocalLLaMA | View on Reddit | 22 comments
I'm still in my learning process and so far I've been able to make satisfying use of my setup (4070 with 12GB VRAM + 32GB RAM and iGPU for my GUI). I've been able to run both Gemma4 26B and Qwen 3.6 35B MoEs up to high quants with large context and have about 40 t/s with both.
However, I'd like to try a smaller model, ideally a quant of Qwen3.5-9B, with full VRAM usage and no host memory to slow down things. In theory it should be possible, but even gemma4-e2b with a low quant (Q4_IXS) with small context (8192) ends up using about 3.5 GB of RAM on top of the GPU.
I've tried all the command line options I could find with llama-server, but so far...no cigar.
What am I doing wrong?
AdvisorIllustrious15@reddit
You’re probably not doing anything wrong. llama.cpp still keeps some CPU RAM usage even when weights fit in VRAM — KV cache, buffers, pinned memory, allocator overhead, metadata, etc. ‘Full VRAM’ usually doesn’t mean literally 0 host RAM. 3.5GB for small context does seem high though, maybe check if mmap/mlock, flash attention, embeddings, or server overhead are enabled.
Ps3Dave@reddit (OP)
Yeah I checked all of them, still getting GBs of RAM used up (as per llama-server log) and bottlenecked by RAM and CPU during tg. I can do 5000t/s in prompt parsing though.
ea_man@reddit
Maybe it is a lot of
Ps3Dave@reddit (OP)
Ok, fit-target I did not try yet. Also switching to qwen 4B as you suggested. Myabe it's gemma's architecture. Will report back.
ea_man@reddit
Do consider that you will still need some VRAM to run the desktop (if you are not headless), you should post the VRAM usage when you launch llama-serve, es: https://www.reddit.com/r/LocalLLaMA/comments/1tau4bk/comment/olf48kb/?context=3&utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
Ps3Dave@reddit (OP)
More details.
With this:
llama-server -m models/Qwen3.5-9B-IQ4_XS.gguf --no-mmap -ngl 999 -ctk q5_0 -ctv q4_0 --cache-ram 0 --fit-target 50 --flash-attn on -v -lv 4
I get this:
So still using a lot of host RAM even with more than 3GB VRAM free.
tonyboi76@reddit
yes its possible but llama.cpp defaults work against you here. couple flags:
--no-mmap forces the weights to load fully into VRAM instead of being memory-mapped from disk (which is what keeps some host RAM in play). --n-gpu-layers 999 to put everything on the gpu. and if you can, set the cpu pools to 0 (--n-cpu-moe 0 for MoEs).
also quantize the kv cache, -ctk q8_0 -ctv q8_0. that frees up real VRAM for the actual weights+context rather than padding for the kv side. on a 12GB card a 7-9b q4 model with kv-quant and --no-mmap should fit comfortably and host RAM use stays minimal.
if you want truly pure-VRAM with no llama.cpp host overhead, vLLM is a cleaner answer, its a different beast architecturally (gpu-resident the whole time) but the install is way heavier than llama.cpp.
m31317015@reddit
Easy answer is vLLM, no ram offloading by default. llama.cpp will offload some cache to CPU by default, maybe try `--no-mmap`?
Ps3Dave@reddit (OP)
Yeah I'm looking into vLLM. Got it running but still need to learn how to decipher the logs. Glad to learn new things anyway! :)
YearnMar10@reddit
Ye, not sure what you’re doing there. I run gemma4 e2b with max context on my ~ 5 gigs of ram with max context (unsloth q4km, no kv cache quantization)
NelsonMinar@reddit
Have you ever tried LM Studio? It has a nice GUI. I've definitely seen it load small models fully into VRAM.
koflerdavid@reddit
LM Studio uses llama.cpp internally and therefore would have a similar issue. This question is about llama.cpp itself.
vastaaja@reddit
Did you find
--cache-ram? https://github.com/ggml-org/llama.cpp/pull/16391Ps3Dave@reddit (OP)
Yeah, I disabled it.
CooperDK@reddit
The idea was to hold the model in VRAM, not to offload it.
vastaaja@reddit
If OP doesn't set cache ram to zero, he'll still see up to 8GiB of prompt cache in RAM with the default setting. It's not clear from the post if he knows what the RAM is used for.
CooperDK@reddit
You can quantize the k/v caches into q8 and save according to your context size. That will lower the VRAM required to hold it all.
Sidran@reddit
My thought exactly. To optimize cache, flash attention is also important.
I would try with this:
Ps3Dave@reddit (OP)
Yup, did this. I went down to q_5 for k and q_4 for v. With a small context I get 600MB of kv cache, and still a few GBs of RAM offloaded.
arnav080@reddit
p sure llama.cpp still keeps some buffers / KV cache allocations in system RAM even when all layers are offloaded to VRAM does
--cache-type-k q4_0/--cache-type-v q4_0change it for you? (im still learning, just my two cents)tomByrer@reddit
You might be thinking about a megakernal, which you might not have enough VRAM
https://github.com/Luce-Org/lucebox-hub#01--megakernel-qwen35-08b-on-rtx-3090
Independent_Exit_260@reddit
Damn, I love this community