GPU VRAM only for small models with llama.cpp: is it possible?

Posted by Ps3Dave@reddit | LocalLLaMA | View on Reddit | 22 comments

I'm still in my learning process and so far I've been able to make satisfying use of my setup (4070 with 12GB VRAM + 32GB RAM and iGPU for my GUI). I've been able to run both Gemma4 26B and Qwen 3.6 35B MoEs up to high quants with large context and have about 40 t/s with both.

However, I'd like to try a smaller model, ideally a quant of Qwen3.5-9B, with full VRAM usage and no host memory to slow down things. In theory it should be possible, but even gemma4-e2b with a low quant (Q4_IXS) with small context (8192) ends up using about 3.5 GB of RAM on top of the GPU.

I've tried all the command line options I could find with llama-server, but so far...no cigar.

What am I doing wrong?

[-]

AdvisorIllustrious15@reddit

You’re probably not doing anything wrong. llama.cpp still keeps some CPU RAM usage even when weights fit in VRAM — KV cache, buffers, pinned memory, allocator overhead, metadata, etc. ‘Full VRAM’ usually doesn’t mean literally 0 host RAM. 3.5GB for small context does seem high though, maybe check if mmap/mlock, flash attention, embeddings, or server overhead are enabled.

[-]

Ps3Dave@reddit (OP)

Yeah I checked all of them, still getting GBs of RAM used up (as per llama-server log) and bottlenecked by RAM and CPU during tg. I can do 5000t/s in prompt parsing though.

[-]

ea_man@reddit

Maybe it is a lot of

 --cache-ram

[-]

Ps3Dave@reddit (OP)

Ok, fit-target I did not try yet. Also switching to qwen 4B as you suggested. Myabe it's gemma's architecture. Will report back.

[-]

ea_man@reddit

Do consider that you will still need some VRAM to run the desktop (if you are not headless), you should post the VRAM usage when you launch llama-serve, es: https://www.reddit.com/r/LocalLLaMA/comments/1tau4bk/comment/olf48kb/?context=3&utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

[-]

Ps3Dave@reddit (OP)

More details.

With this:

llama-server -m models/Qwen3.5-9B-IQ4_XS.gguf --no-mmap -ngl 999 -ctk q5_0 -ctv q4_0 --cache-ram 0 --fit-target 50 --flash-attn on -v -lv 4

I get this:

0.07.998.744 I common_memory_breakdown_print: | memory breakdown [MiB]     | total   free    self   model   context   compute    unaccounted |
0.07.998.746 I common_memory_breakdown_print: |   - CUDA0 (RTX 4070 SUPER) | 11876 = 3659 + (7950 =  4373 +    2761 +     816) +         266 |
0.07.998.747 I common_memory_breakdown_print: |   - Host                   |                 1321 =   545 +       0 +     776                |

So still using a lot of host RAM even with more than 3GB VRAM free.

[-]

tonyboi76@reddit

yes its possible but llama.cpp defaults work against you here. couple flags:

--no-mmap forces the weights to load fully into VRAM instead of being memory-mapped from disk (which is what keeps some host RAM in play). --n-gpu-layers 999 to put everything on the gpu. and if you can, set the cpu pools to 0 (--n-cpu-moe 0 for MoEs).

also quantize the kv cache, -ctk q8_0 -ctv q8_0. that frees up real VRAM for the actual weights+context rather than padding for the kv side. on a 12GB card a 7-9b q4 model with kv-quant and --no-mmap should fit comfortably and host RAM use stays minimal.

if you want truly pure-VRAM with no llama.cpp host overhead, vLLM is a cleaner answer, its a different beast architecturally (gpu-resident the whole time) but the install is way heavier than llama.cpp.

[-]

m31317015@reddit

Easy answer is vLLM, no ram offloading by default. llama.cpp will offload some cache to CPU by default, maybe try `--no-mmap`?

[-]

Ps3Dave@reddit (OP)

Yeah I'm looking into vLLM. Got it running but still need to learn how to decipher the logs. Glad to learn new things anyway! :)

[-]

YearnMar10@reddit

even gemma4-e2b with a low quant (Q4_IXS) with small context (8192) ends up using about 3.5 GB of RAM on top of the GPU.

Ye, not sure what you’re doing there. I run gemma4 e2b with max context on my ~ 5 gigs of ram with max context (unsloth q4km, no kv cache quantization)

[-]

NelsonMinar@reddit

Have you ever tried LM Studio? It has a nice GUI. I've definitely seen it load small models fully into VRAM.

[-]

koflerdavid@reddit

LM Studio uses llama.cpp internally and therefore would have a similar issue. This question is about llama.cpp itself.

[-]

vastaaja@reddit

I've tried all the command line options I could find

Did you find --cache-ram? https://github.com/ggml-org/llama.cpp/pull/16391

[-]

Ps3Dave@reddit (OP)

Yeah, I disabled it.

[-]

CooperDK@reddit

The idea was to hold the model in VRAM, not to offload it.

[-]

vastaaja@reddit

If OP doesn't set cache ram to zero, he'll still see up to 8GiB of prompt cache in RAM with the default setting. It's not clear from the post if he knows what the RAM is used for.

[-]

CooperDK@reddit

You can quantize the k/v caches into q8 and save according to your context size. That will lower the VRAM required to hold it all.

[-]

Sidran@reddit

My thought exactly. To optimize cache, flash attention is also important.
I would try with this:

--flash-attn on ^
--cache-type-k q8_0 ^
--cache-type-v q4_0 ^

[-]

Ps3Dave@reddit (OP)

Yup, did this. I went down to q_5 for k and q_4 for v. With a small context I get 600MB of kv cache, and still a few GBs of RAM offloaded.

[-]

arnav080@reddit

p sure llama.cpp still keeps some buffers / KV cache allocations in system RAM even when all layers are offloaded to VRAM does --cache-type-k q4_0 / --cache-type-v q4_0 change it for you? (im still learning, just my two cents)

[-]

tomByrer@reddit

You might be thinking about a megakernal, which you might not have enough VRAM

https://github.com/Luce-Org/lucebox-hub#01--megakernel-qwen35-08b-on-rtx-3090

[-]

Independent_Exit_260@reddit

Damn, I love this community