Can I combine a RTX5060ti 16gb with 7900XTX 24gb for llama.cpp?

Posted by soyalemujica@reddit | LocalLLaMA | View on Reddit | 9 comments

I bought this 7900XTX for 905 euro in Spain, and wondering if can I combine them together to run Qwen 3.5 27B for example ?

Using a MSI B650 Gaming Plus Wifi and 64gb DDR5 6400mt/s

[-]

Flamenverfer@reddit

If im not mistaken there was some improvements towards multi-backend setups.

Highly recommand testing different combinations out. (All vulkan is where ill put my money for best experience depending on your needs.)

But you might have some good success with your 5060 using cuda and xtx on vulkan or using ROCm for your xtx and vulkan for your 5060.

Play around with the --tensor-split 50,50. You might see some good performance increased by doing 60,40 (60 being your xtx, 60% of the model is sitting on the xtx in that example.)

[-]

Typical-Arugula-8555@reddit

No, the VRAM on the two GPUs has to be the same.

[-]

soyalemujica@reddit (OP)

Why would they have to be the same ?

[-]

Typical-Arugula-8555@reddit

This is required by inference frameworks like vLLM and sglang—otherwise they just won’t run.

[-]

Typical-Arugula-8555@reddit

Oh, I’ll give llama.cpp a try and see how it goes.

[-]

Typical-Arugula-8555@reddit

llama-server ^
    --model gemma/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf ^
    --mmproj gemma/gemma-4-26B-A4B-it-GGUF/mmproj-F16.gguf ^
    --temp 1.0 ^
    --top-p 0.95 ^
    --top-k 64 ^
    --alias "unsloth/gemma-4-26B-A4B-it-GGUF" ^
    --port 8001 ^
    --chat-template-kwargs "{\"enable_thinking\": true}"



llama-server ^
    --model gemma/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf ^
    --mmproj gemma/gemma-4-26B-A4B-it-GGUF/mmproj-F16.gguf ^
    --temp 1.0 ^
    --top-p 0.95 ^
    --top-k 64 ^
    --alias "unsloth/gemma-4-26B-A4B-it-GGUF" ^
    --port 8001 ^
    --chat-template-kwargs "{\"enable_thinking\": true}" ^
    --tensor-split 3,1

Llama.cpp does support using multiple GPUs with different VRAM sizes, but it can slow things down a bit.

With my setup, I get about 95 tokens per second on a single GPU. When I use two GPUs (one 48GB and one 16GB), it drops to around 64 tokens per second.

[-]

ForsookComparison@reddit

yes, running vulkan

[-]

soyalemujica@reddit (OP)

Any bottlenecks at all or things to take into account ?

[-]

ForsookComparison@reddit

nah. Vulkan has great token-gen speeds but somewhat worse prompt processing speeds.