To 16GB VRAM users, plug in your old GPU

Posted by akira3weet@reddit | LocalLLaMA | View on Reddit | 95 comments

For those who want to run latest dense \~30b models and only have 16GB VRAM, if you have a old card with 6GB VRAM or more, plug it in.

It matters that everything fits on the VRAM, even on 2 cards. Even if one of them is quite weak.

I have a 5070Ti 16GB and a old 2060 6GB. The common idea is you need 2 same GPU to maximize performance. But one day I was strike by the idea, why not give it a try?

Let's see, if you did not bought a mother board just for LLM, it's very possible you have a true PCI-E x16 slot and a couple that looks like x16 but are actually wired with x4, just like me. That's a perfect slot for a old card.

16GB + 6GB = 22GB, it's getting close to the 24GB class card. If you have a better old card, lucky you!

Then you use llama-server with a config like this

[*]
jinja = true
cache-prompt = true
n-gpu-layers = 999
no-mmap = true
mlock = false
np = 1
t = 0

[qwen/qwen3.6-27b]
model = ./Qwen3.6-27B-GGUF/Qwen3.6-27B-Q4_K_M.gguf
mmproj = ./Qwen3.6-27B-GGUF/mmproj-Qwen3.6-27B-BF16.gguf
reasoning = on
dev = Vulkan1,Vulkan2
c = 128000
no-mmproj-offload = true
cache-type-k = q8_0
cache-type-v = q8_0

A couple specific points:
- dev=Vulkan1,Vulkan2, this enables the two GPUs, run `llama-server.exe --list-devices` to see what you should set.
- no-mmap and mlock=false keeps the model away from your RAM
- np=1, no-mmproj-offload (or do not supply mmproj model), cache-type-k and cache-type-v to minimize VRAM needed
- n-gpu-layers=999 to prefer GPU offloading, well this may be unnecessary, but I'd keeps it
- split-mode=layer to split the layers asymmetrically across the device, "layer" is the default though so you don't see it above.
- c=128000 could be a little stretch, but works well enough for me.

BTW I also have intel integrated GPU that I plugged the monitors into.

Some numbers, basically, at 128k max context, 71k actual context useage, pp=186t/s and tg=19t/s, quite usable speed.

[56288] prompt eval time =    5761.53 ms /  1076 tokens (    5.35 ms per token,   186.76 tokens per second)
[56288]        eval time =   58000.15 ms /  1114 tokens (   52.06 ms per token,    19.21 tokens per second)
[56288]       total time =   63761.69 ms /  2190 tokens
[56288] slot      release: id  0 | task 654 | stop processing: n_tokens = 71703, truncated = 0