Dual GPU Setup for LLMs – Notes from a Newbie
Posted by DrRamorey@reddit | LocalLLaMA | View on Reddit | 30 comments
Some learnings I made the hard way. These points might be obvious to some, but I wasn’t fully aware of them before I built my LLM workstation. Hopefully this helps other newbies like me.
Context:
I was using my AMD RX 6800 mostly for LLM workloads and wanted more VRAM to test larger models. I built a PC to accommodate two GPUs for this use case.
The plan was to use my RX 6800 plus a newer GPU. I knew it should be an AMD card, and the RX 9070 XT seemed like the best value.
I’m still an amateur with LLMs—mostly using them in LM Studio—but I’ve started experimenting with dedicated servers and Docker setups.
Learning 1 - You can’t assume gaming benchmarks reflect LLM performance
Standard benchmarks like 3DMark, Heaven, or Superposition showed my new 9070 XT was 51–64% faster than my old card. I kind of expected similar gains in LLM performance.
That was clearly not the case. Here are my llama-bench results (ROCm):
./llama-bench -m gemma-3-12B-it-qat-GGUF/gemma-3-12B-it-QAT-Q4_0.gguf -mg 0,1 -sm none
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
Device 0: AMD Radeon RX 9070 XT, gfx1201 (0x1201), VMM: no, Wave Size: 32
Device 1: AMD Radeon RX 6800, gfx1030 (0x1030), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | main_gpu | sm | test | t/s |
|---|---|---|---|---|---|---|---|---|
| gemma3 12B Q4_0 | 6.41 GiB | 11.77 B | ROCm | 99 | 0 | none | pp512 | 1420.48 ± 4.73 |
| gemma3 12B Q4_0 | 6.41 GiB | 11.77 B | ROCm | 99 | 0 | none | tg128 | 47.74 ± 0.14 |
| gemma3 12B Q4_0 | 6.41 GiB | 11.77 B | ROCm | 99 | 1 | none | pp512 | 947.02 ± 0.82 |
| gemma3 12B Q4_0 | 6.41 GiB | 11.77 B | ROCm | 99 | 1 | none | tg128 | 43.23 ± 0.01 |
The 9070 XT is 50% faster in prompt parsing, but only 10% faster in token generation.
That was really disappointing.
It’s a different picture for image generation (results below are generation times in seconds; lower is better).
As this isn’t my main interest, I only did very basic testing. ComfyUI had some weird issues with the 9070 XT but worked flawlessly with the RX 6800 (as of July 2025).
| Task | RX 6800 | RX 9070 XT |
|---|---|---|
| Stable Diffusion 3.5 simple 1024x1024 | 115 | 77 |
| SDXL simple 1024x1024 | 25 | - |
| Flux schnell 1024x1024 | 38 | 32 |
| Flux checkpoint 1024x1024 | 171 | 61 |
As my 9070 XT was also way too loud, I returned it and picked up a second-hand RX 6800 XT. It’s only slightly faster than my old card, but €450 cheaper than the 9070 XT.
Lesson: ignore standard gaming benchmarks when choosing a GPU for LLMs.
Check LLM-specific benchmark lists like https://github.com/ggml-org/llama.cpp/discussions/10879 and pick a GPU that matches your existing one if you’re not going for identical models.
Learning 2 - Two GPUs do not double LLM token generation performance
This should be obvious, but I never really thought about it and assumed overall performance would scale with two GPUs.
Wrong again.
The main benefit of the second GPU is extra VRAM. Larger models can be split across both GPUs—but performance is actually worse than with a single card (see next point).
Learning 3 - Splitting models across two GPUs can tank performance
Using the same model as before, now with the RX 6800 XT:
./llama-bench -m gemma-3-12B-it-qat-GGUF/gemma-3-12B-it-QAT-Q4_0.gguf -mg 0,1 -sm none`
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
Device 0: AMD Radeon RX 6800 XT, gfx1030 (0x1030), VMM: no, Wave Size: 32
Device 1: AMD Radeon RX 6800, gfx1030 (0x1030), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | main_gpu | sm | test | t/s |
|---|---|---|---|---|---|---|---|---|
| gemma3 12B Q4_0 | 6.41 GiB | 11.77 B | ROCm | 99 | 0 | none | pp512 | 1070.86 ± 2.28 |
| gemma3 12B Q4_0 | 6.41 GiB | 11.77 B | ROCm | 99 | 0 | none | tg128 | 44.78 ± 0.06 |
| gemma3 12B Q4_0 | 6.41 GiB | 11.77 B | ROCm | 99 | 1 | none | pp512 | 875.96 ± 1.29 |
| gemma3 12B Q4_0 | 6.41 GiB | 11.77 B | ROCm | 99 | 1 | none | tg128 | 43.25 ± 0.01 |
The 6800 XT is only 3% faster than my old RX 6800 in token generation.
Now splitting the model across both cards:
./llama-bench -m gemma-3-12B-it-qat-GGUF/gemma-3-12B-it-QAT-Q4_0.gguf -mg 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
Device 0: AMD Radeon RX 6800 XT, gfx1030 (0x1030), VMM: no, Wave Size: 32
Device 1: AMD Radeon RX 6800, gfx1030 (0x1030), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | main_gpu | test | t/s |
|---|---|---|---|---|---|---|---|
| gemma3 12B Q4_0 | 6.41 GiB | 11.77 B | ROCm | 99 | 0 | pp512 | 964.56 ± 2.04 |
| gemma3 12B Q4_0 | 6.41 GiB | 11.77 B | ROCm | 99 | 0 | tg128 | 31.92 ± 0.03 |
| gemma3 12B Q4_0 | 6.41 GiB | 11.77 B | ROCm | 99 | 1 | pp512 | 962.75 ± 1.56 |
| gemma3 12B Q4_0 | 6.41 GiB | 11.77 B | ROCm | 99 | 1 | tg128 | 31.90 ± 0.03 |
Interpretation (my guess):
- Prompt parsing (
pp512) seems truly parallelized—the performance is roughly the average of both cards. - Token generation (
tg128) is slower than a single card. This makes sense: both cards must work in sync, so there’s extra overhead—likely from synchronization and maybe PCIe bandwidth limits.
In my case, splitting gave me 26% lower token generation speed compared to my slowest card.
The upside: I now have 32GB of VRAM for bigger models.
Learning 3 - Consumer hardware (mainboard and case) pitfalls
Mainboard
- You need two physical PCIe x16 slots.
- These must support lane splitting (x16 → x8/x8). Some boards don’t support this or use weird splits (x8/x1). x8 speeds didn’t cause me performance issues (even in gaming), but x4 or x1 would likely bottleneck.
- Slot spacing matters—many modern GPUs are 3+ slots thick, which can block the second slot.
Case
- I underestimated the heat and noise from two GPUs.
- With one GPU, my airflow-oriented build was fine.
- Adding the second GPU was a massive change—temps spiked and noise went from “barely there” to “annoying constant GPU roar.”
- If your build sits on your desk, fan performance and possibly sound-dampening panels become very important.
I’m still learning, and most of this is based on my own trial and error.
If I’ve misunderstood something, overlooked a better method, or drawn the wrong conclusions, I’d appreciate corrections.
Feel free to share your own benchmarks, tweaks, or experiences so others (including me) can learn from them.
LicensedTerrapin@reddit
Thank you for your service. I have a spare intel arc a770 16gb on my desk and a 24gb 3090 in my pc. I've been tempted to put the arc to use and run them mixed in vulkan. Thoughts as a newbie expert?
androidGuy547@reddit
i have 2 A770 (1 in use and in spare) and wanna pair them together, but only have 1 pcie*4 to spare, wondering if that's gonna bottleneck.
DrRamorey@reddit (OP)
Give it a try. I think it should work at least under linux. At least in LM studio you can select only one runtime - Vulcan. This should abstract the hardware differences. My guess is it will work, but be a performance waste on the 3090, as CUDA is likely faster.
Would be interesting to see how much the intel arc card will slow down the nvidia 3090.
LicensedTerrapin@reddit
Llamacpp can do vulkan as well, it would be considerably slower for sure. The other option is to get another 3090 and hope for the best aka that they don't cook each other. But I don't really have money to invest in this right now 😭 I've seen people getting mi50s with 32gb, now that's a bargain but I'd need a server board with loads of ram. Local llama is a pricey hobby.
Neither_Bath_5775@reddit
You can try using rpc to do a vulkan host for the arc and a cuda for the 3090.
Professional_Top3747@reddit
Thanks for the post. Im planning to do a similar build. Based on my measurements, I will have roughly 1 cm gap between the 2 gpus. Is that okay?
DrRamorey@reddit (OP)
Personally I would say yes. My setup is similar. My Rx6800 xt is 2.5 slots wide, So I have 0,5 slots width gap between my cards \~ 1 cm.
If you have good airflow in your case, then you should be able to keep temperatures and fan speed under control.
Professional_Top3747@reddit
Thank you!
lord_penguin77@reddit
Great writing up. Basically what I assumed with two gpus splitting on two PCIe lanes. I don’t know how production grade gpus work as a cluster usually have multiple gpus. Probably some technique to link different cards in a fast way so that it’s not bottlenecked by PCIe
donatas_xyz@reddit
Just to add - PCIe x4 wouldn't bottleneck anything in any meaningful way. I've moved my RTX 4070 from x16 to x4, and the drop in FPS was about 1%. Same with the token processing - no meaningful downgrade.
The same goes for RTX 5070 Ti. There is no meaningful downgrade from x16 to x8.
I thought this may help someone trying to decide whether x8/x4 split would work for them.
pyr0kid@reddit
now this has me wondering if 4.0x1 would also work
divergentchessboard@reddit
try it and find out.
I loose 9% performance running in PCIe x4 for 3D task and 50% when it comes to video encoding. Video encode hates latency for some reason
DrRamorey@reddit (OP)
This is for one GPU only. Any idea or data, if this would affect a dual GPU LLM use case? As here data would need to be transferred between the GPUs, if I'm not mistaken.
donatas_xyz@reddit
I'm using said 4070 + 5070 Ti combo and haven't noticed any drops in t/s. If VRAM is shared, I suspect the t/s will be limited by the slowest card in the combo. However, now I have 28GB of VRAM instead of on 12GB and even if CPU is being used, I can offload more layers to VRAM and get a much better t/s overall than I would've with on a single 12GB GPU :) I guess what I'm saying is that at least in my case, the benefits of using 2 GPUs far outweighs any theoretical data throughput loss because of PCIe downgrades or what not.
v01dm4n@reddit
This is brilliant and very useful! Thanks for the efforts OP.
gounesh@reddit
So much to learn here! Ty for sharing. I have gpus lying araound from the pcs of the household. 4 x 1080ti, 1 x 2080 ti and my main 3080 ti. I was considering buying a used 3090 or go all in with the new 5090. I think i’ll give 6 cards a try. Hopefully additional 2080ti and 3080ti work okay?
DarkEngine774@reddit
What to do on local devices
Marksta@reddit
So you normally -sm split (default) or row? It's pretty interesting that you're facing heat issues if you're just splitting, realistically you're only net adding ~20 watts or so as both cards are taking turn being idle. Overall heat in the case should be relatively identical, a degree more maybe.
Yeah splitting drops performance on scenarios where you could've not split, but otherwise perf increase is "infinite" on a model that's impossible for you to run without the split. So the apples to apples comparison isn't so same really with that in mind.
DrRamorey@reddit (OP)
Normal split with -sm which is I assume the default 50%/50% split for my cards and also in LMstudio.
Row split as in moving certain layers/rows to specific cards ?? No, haven't tried this yet.
Confused about you saying cards taking turns being idle? I see this behavior sometimes, but not always. I have observed both cards at 100%. For sure during parsing, but also generation.
Would this depend on the model used?
No, card fans are not blocked. Lucky me one card is 2,5 slots and the other 2 slots. Also the PSU shroud is perforated.
Marksta@reddit
So split is 'serial' and row is 'parallel', or in other inference engine terms like vLLM split is pipeline parallel, and row is tensor parallel.
So in split some of the work gets done on Card0, then some work done on Card1, then back to Card0... (or CPU if not fully off-loaded in VRAM). So in your case, each batch there is 50% down-time for each GPU. Your example 12Bq4 is on the smaller side, so it might be less than a second so you wouldn't see here. With a larger model that fills up VRAM or more cards it gets more apparent on nvtop or whatever activity graph you watch.
So that said, pretty odd you're getting that much of an increased heat up but I guess that's just how physics works if the total heat overcomes the ability to exhaust it. Maybe power limit to 80% or something, you can usually get away with taking some power off the table without losing much performance.
DrRamorey@reddit (OP)
Tested with a larger model and could monitor this switch from one card to the other.
Also could pinpoint the high load scenario on both GPUs. This is during first prompt parsing as I work with quite large inputs of 10k-25k texts. Also might be model dependant. I get this initial full load with mistral-small and qwen3, but not with qwen3-a3b.
DrRamorey@reddit (OP)
I think I understood your explanation of the split scenario. Will test with larger models and look for the utilization.
On my heat problem. Likely a "standard" airflow problem. Right now most air is moved out by 1x fan in the back of the case. Radiator is at the top, but its fans are not spinning fast enough to move a lot of air out there. So with two active cards this one fan struggles to move enough air. When I find time, I will change the configuration to get more exhaust of air.
Agreeable-Prompt-666@reddit
Have you tried Vulcan?
DrRamorey@reddit (OP)
Yes, but not really tested to get data - lmstudio only.
But I like the idea and will compile llama.cpp to run the same benchmark again.
DrRamorey@reddit (OP)
I was curious to see vulcan performance actually measured. It's slower than ROCm for my GPUs. Fresh compile of llama.cpp and vuclan. Compared ROCm results to previous benchmark above - were the same.
| model | size | params | backend | ngl | main_gpu | test | t/s |
| ------------------------------ | --------- | --------- | ---------- | -- | --------- | -------------- | ------------------- |
| gemma3 12B Q4_0 | 6.41 GiB | 11.77 B | Vulkan | 99 | 0 | pp512 | 793.41 ± 0.88 |
| gemma3 12B Q4_0 | 6.41 GiB | 11.77 B | Vulkan | 99 | 0 | tg128 | 26.91 ± 0.07 |
| gemma3 12B Q4_0 | 6.41 GiB | 11.77 B | Vulkan | 99 | 1 | pp512 | 793.58 ± 0.68 |
| gemma3 12B Q4_0 | 6.41 GiB | 11.77 B | Vulkan | 99 | 1 | tg128 | 25.21 ± 0.24 |
deepspace_9@reddit
I was using 7900xtx and 7800xt. when using two gpu, performance was much worse than just one 7900xtx, so I thought 7800xt was bottleneck. Recently 7900xtx is on sale in my town, it looks like local dealer is clearling out their inventory, so I bought one. but speed of 2*7900xtx is almost same with old configuration.
DrRamorey@reddit (OP)
Thx for confirming my assumption, that this overhead of communication and PCIe bandwidth is causing this.
Zomboe1@reddit
Really awesome post, I like your formatting and the way you explain the things you've learned, and that you included your data. I only have a single GPU and have only played with image/video generation, but at some point I'd like to get another and play around with LLMs so I'll keep your lessons in mind.
Individual-Cattle-15@reddit
Im not a thermals expert either. Similar learnings to you. Except I went and got workstation grade GPUs on a dell 7960. My 4500 ada card draws 210w peak vs a consumer card similar ( rtx 4090 maybe ? Would draw 400+w for the same capabilities on llms)
Thermals, noise, energy efficiency are all done well by Dell. Managed to cool decently with fans and shrouds alone! But my experience for requesting upgrades/ modifications and spares for proprietary power cables are all negative. I suppose they prefer selling you a built system and servicing it for 3 years and then selling you another new system and do on. No upgrades allowed even if it's possible.
DrRamorey@reddit (OP)
I based my text on this PC build https://de.pcpartpicker.com/b/XXbKHx