Dual GPU setup (yes, no)?

Posted by PiHeich@reddit | LocalLLaMA | View on Reddit | 8 comments

I have the following llama.cpp setups available:

RTX 3080, 10 GB VRAM + 32 GB RAM + i9-9900K
RX 9070 XT,16 GB VRAM + 32 GB RAM + 9800X3D (iGPU; llama.cpp reports 18 GB RAM)
RX 9070 XT 16 GB VRAM + RTX 3080 10 GB VRAM + 32 GB RAM + 9800X3D

I tried Qwen3.6-35B-A3B-UD-Q3_K_S.gguf on the 9070 XT and I get around 34 tok/s, while using Qwen3.6-35B-A3B-UD-Q4_K_S.gguf in a hybrid 9070 XT + iGPU setup I was getting 18 tok/s, but I ran into crashes as well (I’m still a beginner). I’d like to understand, together with you, whether and how I can improve my setup for coding by making the best use of my hardware.

The issue is that the 3080 alone (which is in my secondary machine) would be perfect, but I have to go much lower with quantization and I’m afraid of losing too much quality. On the other hand, the system with the 9070 XT is my main PC, which I also use for gaming and other things, but with only 16 GB I’m a bit limited.
However, I noticed that even just 6 GB more VRAM lets me use much less aggressive quantization and go up in bit quality.

I’d like to figure out with you which configuration is the best, and above all how to optimize it, since I’m still new to llama.cpp.
Do you think I can keep the hybrid 9070 XT + iGPU setup and tell llama to prioritize everything on the discrete GPU and only use the rest on the iGPU? I noticed that the load gets assigned to the iGPU first instead (I assume because it has more RAM), and I don’t like that very much.
Or would it be better to also install the 3080 in my main PC? Does llama handle GPUs from different brands?

How do you configure two GPU everything from the terminal? At the moment I started the hybrid solution like this:

llama-server ^
 -m "C:\Users\user\Documents\modelli\qwen3punto6_20e9GB\Qwen3.6-35B-A3B-UD-Q4_K_S.gguf" ^
 --mmproj "C:\Users\user\Documents\modelli\qwen3punto6_20e9GB\mmproj-F16.gguf" ^
 --jinja ^
 -c 91750 ^
 --host 0.0.0.0 ^
 --port 8033 ^
 --chat-template-kwargs "{\"enable_thinking\":true}" ^
 --temp 0.6 ^
 --top_p 0.95 ^
 --top_k 20 ^
 --min_p 0.0 ^
 --presence-penalty 0.0 ^
 --repeat-penalty 1.0 ^
 -dev Vulkan0,Vulkan1 ^
 -sm layer ^
 -ts 3,1 ^
 -ngl all

Thanks

[-]

Ok-Relationship9111@reddit

You can use rpc to connect both, I do it here with nvidia + amd

lemondrops9@reddit

What is your network speed and do you have a guess of the % off loss by going RPC?

PiHeich@reddit (OP)

How?

andy2na@reddit

Do the 9070 and 3080, hitting 24gb vram opens a whole new world. You'll be able to fit 3.4_35b iq4_nl with full 256k context and it should run MUCH faster

Witch llama I need to download? Can you give me a terminal setup to run it?

latest llama.cpp

you will want to use split-mode layer commands

   cmd: >
      /custom-bin/bin/llama-server
       --port ${PORT}
      --host 127.0.0.1
      --webui-mcp-proxy
      --model /models/qwen35/Qwen3.5-35B-A3B-IQ4_XS.gguf
      --mmproj /models/qwen35/mmproj-Qwen3.5-35B-A3B-f16.gguf
      --cache-type-k q8_0
      --cache-type-v q8_0
      --n-gpu-layers auto
      --split-mode layer
      --tensor-split 15,9
      --main-gpu 0
      --threads 8
      --threads-batch 8
      --ctx-size 262144
      --image-min-tokens 1024
      --flash-attn on 
      --parallel 1 
      --jinja

first number in -tensor-split is whatever your system sees as GPU 0 and second is GPU 1, just ask gemini or chatgpt to help you with your particular setup

ty! But the latest llama volkan? cuda?

You'll have to use Vulkan since you can't use cuda with Radeon