Dual GPU setup (yes, no)?
Posted by PiHeich@reddit | LocalLLaMA | View on Reddit | 8 comments
I have the following llama.cpp setups available:
- RTX 3080, 10 GB VRAM + 32 GB RAM + i9-9900K
- RX 9070 XT,16 GB VRAM + 32 GB RAM + 9800X3D (iGPU; llama.cpp reports 18 GB RAM)
- RX 9070 XT 16 GB VRAM + RTX 3080 10 GB VRAM + 32 GB RAM + 9800X3D
I tried Qwen3.6-35B-A3B-UD-Q3_K_S.gguf on the 9070 XT and I get around 34 tok/s, while using Qwen3.6-35B-A3B-UD-Q4_K_S.gguf in a hybrid 9070 XT + iGPU setup I was getting 18 tok/s, but I ran into crashes as well (I’m still a beginner). I’d like to understand, together with you, whether and how I can improve my setup for coding by making the best use of my hardware.
The issue is that the 3080 alone (which is in my secondary machine) would be perfect, but I have to go much lower with quantization and I’m afraid of losing too much quality. On the other hand, the system with the 9070 XT is my main PC, which I also use for gaming and other things, but with only 16 GB I’m a bit limited.
However, I noticed that even just 6 GB more VRAM lets me use much less aggressive quantization and go up in bit quality.
I’d like to figure out with you which configuration is the best, and above all how to optimize it, since I’m still new to llama.cpp.
Do you think I can keep the hybrid 9070 XT + iGPU setup and tell llama to prioritize everything on the discrete GPU and only use the rest on the iGPU? I noticed that the load gets assigned to the iGPU first instead (I assume because it has more RAM), and I don’t like that very much.
Or would it be better to also install the 3080 in my main PC? Does llama handle GPUs from different brands?
How do you configure two GPU everything from the terminal? At the moment I started the hybrid solution like this:
llama-server ^
-m "C:\Users\user\Documents\modelli\qwen3punto6_20e9GB\Qwen3.6-35B-A3B-UD-Q4_K_S.gguf" ^
--mmproj "C:\Users\user\Documents\modelli\qwen3punto6_20e9GB\mmproj-F16.gguf" ^
--jinja ^
-c 91750 ^
--host 0.0.0.0 ^
--port 8033 ^
--chat-template-kwargs "{\"enable_thinking\":true}" ^
--temp 0.6 ^
--top_p 0.95 ^
--top_k 20 ^
--min_p 0.0 ^
--presence-penalty 0.0 ^
--repeat-penalty 1.0 ^
-dev Vulkan0,Vulkan1 ^
-sm layer ^
-ts 3,1 ^
-ngl all
Thanks
Ok-Relationship9111@reddit
You can use rpc to connect both, I do it here with nvidia + amd
lemondrops9@reddit
What is your network speed and do you have a guess of the % off loss by going RPC?
PiHeich@reddit (OP)
How?
andy2na@reddit
Do the 9070 and 3080, hitting 24gb vram opens a whole new world. You'll be able to fit 3.4_35b iq4_nl with full 256k context and it should run MUCH faster
PiHeich@reddit (OP)
Witch llama I need to download? Can you give me a terminal setup to run it?
andy2na@reddit
latest llama.cpp
you will want to use split-mode layer commands
first number in -tensor-split is whatever your system sees as GPU 0 and second is GPU 1, just ask gemini or chatgpt to help you with your particular setup
PiHeich@reddit (OP)
ty! But the latest llama volkan? cuda?
andy2na@reddit
You'll have to use Vulkan since you can't use cuda with Radeon