Dual GPU Setup for LLMs – Notes from a Newbie

Posted by DrRamorey@reddit | LocalLLaMA | View on Reddit | 30 comments

Some learnings I made the hard way. These points might be obvious to some, but I wasn’t fully aware of them before I built my LLM workstation. Hopefully this helps other newbies like me.

Context:

I was using my AMD RX 6800 mostly for LLM workloads and wanted more VRAM to test larger models. I built a PC to accommodate two GPUs for this use case.
The plan was to use my RX 6800 plus a newer GPU. I knew it should be an AMD card, and the RX 9070 XT seemed like the best value.
I’m still an amateur with LLMs—mostly using them in LM Studio—but I’ve started experimenting with dedicated servers and Docker setups.

Learning 1 - You can’t assume gaming benchmarks reflect LLM performance

Standard benchmarks like 3DMark, Heaven, or Superposition showed my new 9070 XT was 51–64% faster than my old card. I kind of expected similar gains in LLM performance.

That was clearly not the case. Here are my llama-bench results (ROCm):

./llama-bench -m gemma-3-12B-it-qat-GGUF/gemma-3-12B-it-QAT-Q4_0.gguf -mg 0,1 -sm none 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no 
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 
ggml_cuda_init: found 2 ROCm devices:
Device 0: AMD Radeon RX 9070 XT, gfx1201 (0x1201), VMM: no, Wave Size: 32 
Device 1: AMD Radeon RX 6800, gfx1030 (0x1030), VMM: no, Wave Size: 32
model size params backend ngl main_gpu sm test t/s
gemma3 12B Q4_0 6.41 GiB 11.77 B ROCm 99 0 none pp512 1420.48 ± 4.73
gemma3 12B Q4_0 6.41 GiB 11.77 B ROCm 99 0 none tg128 47.74 ± 0.14
gemma3 12B Q4_0 6.41 GiB 11.77 B ROCm 99 1 none pp512 947.02 ± 0.82
gemma3 12B Q4_0 6.41 GiB 11.77 B ROCm 99 1 none tg128 43.23 ± 0.01

The 9070 XT is 50% faster in prompt parsing, but only 10% faster in token generation.
That was really disappointing.

It’s a different picture for image generation (results below are generation times in seconds; lower is better).
As this isn’t my main interest, I only did very basic testing. ComfyUI had some weird issues with the 9070 XT but worked flawlessly with the RX 6800 (as of July 2025).

Task RX 6800 RX 9070 XT
Stable Diffusion 3.5 simple 1024x1024 115 77
SDXL simple 1024x1024 25 -
Flux schnell 1024x1024 38 32
Flux checkpoint 1024x1024 171 61

As my 9070 XT was also way too loud, I returned it and picked up a second-hand RX 6800 XT. It’s only slightly faster than my old card, but €450 cheaper than the 9070 XT.

Lesson: ignore standard gaming benchmarks when choosing a GPU for LLMs.
Check LLM-specific benchmark lists like https://github.com/ggml-org/llama.cpp/discussions/10879 and pick a GPU that matches your existing one if you’re not going for identical models.

Learning 2 - Two GPUs do not double LLM token generation performance

This should be obvious, but I never really thought about it and assumed overall performance would scale with two GPUs.

Wrong again.
The main benefit of the second GPU is extra VRAM. Larger models can be split across both GPUs—but performance is actually worse than with a single card (see next point).

Learning 3 - Splitting models across two GPUs can tank performance

Using the same model as before, now with the RX 6800 XT:

./llama-bench -m gemma-3-12B-it-qat-GGUF/gemma-3-12B-it-QAT-Q4_0.gguf -mg 0,1 -sm none`
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no 
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 
ggml_cuda_init: found 2 ROCm devices: 
Device 0: AMD Radeon RX 6800 XT, gfx1030 (0x1030), VMM: no, Wave Size: 32
Device 1: AMD Radeon RX 6800, gfx1030 (0x1030), VMM: no, Wave Size: 32
model size params backend ngl main_gpu sm test t/s
gemma3 12B Q4_0 6.41 GiB 11.77 B ROCm 99 0 none pp512 1070.86 ± 2.28
gemma3 12B Q4_0 6.41 GiB 11.77 B ROCm 99 0 none tg128 44.78 ± 0.06
gemma3 12B Q4_0 6.41 GiB 11.77 B ROCm 99 1 none pp512 875.96 ± 1.29
gemma3 12B Q4_0 6.41 GiB 11.77 B ROCm 99 1 none tg128 43.25 ± 0.01

The 6800 XT is only 3% faster than my old RX 6800 in token generation.

Now splitting the model across both cards:

./llama-bench -m gemma-3-12B-it-qat-GGUF/gemma-3-12B-it-QAT-Q4_0.gguf -mg 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no 
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 
ggml_cuda_init: found 2 ROCm devices: 
Device 0: AMD Radeon RX 6800 XT, gfx1030 (0x1030), VMM: no, Wave Size: 32 
Device 1: AMD Radeon RX 6800, gfx1030 (0x1030), VMM: no, Wave Size: 32
model size params backend ngl main_gpu test t/s
gemma3 12B Q4_0 6.41 GiB 11.77 B ROCm 99 0 pp512 964.56 ± 2.04
gemma3 12B Q4_0 6.41 GiB 11.77 B ROCm 99 0 tg128 31.92 ± 0.03
gemma3 12B Q4_0 6.41 GiB 11.77 B ROCm 99 1 pp512 962.75 ± 1.56
gemma3 12B Q4_0 6.41 GiB 11.77 B ROCm 99 1 tg128 31.90 ± 0.03

Interpretation (my guess):

In my case, splitting gave me 26% lower token generation speed compared to my slowest card.
The upside: I now have 32GB of VRAM for bigger models.

Learning 3 - Consumer hardware (mainboard and case) pitfalls

Mainboard

Case

I’m still learning, and most of this is based on my own trial and error.
If I’ve misunderstood something, overlooked a better method, or drawn the wrong conclusions, I’d appreciate corrections.
Feel free to share your own benchmarks, tweaks, or experiences so others (including me) can learn from them.