Dual 3090 setup - performance optimization
Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 37 comments
I have this machine right now: - MSI B550-A PRO - Ryzen 5 5600X, 4x16GB DDR4 3200 MHz - RTX 3090 - PCIe4 x16 (~25GB/s) - RTX 3090 - PCIe3 x4 (<3GB/s..)
I added the second GPU just recently and after a day of optimizing stuff settled on this setup:
| Model name | Model quant | KV quant | --ctx-size | pp/s | tg/s | Engine |
|---|---|---|---|---|---|---|
| Qwen3.5-122B-A10B | AesSedai Q4_K_M | q8_0 | 80000 | 1000 | 22 | ik_llama.cpp |
| Qwen3.5-27B | PaMRxR Q8_K_L | bf16 | 200000 | 1950 | 25 | llama.cpp |
| Qwen3.5-35B-A3B | PaMRxR Q8_K_L | bf16 | 260000 | 4366 | 102 | llama.cpp |
| With --split-mode layer things work well, especially pp, but tg is not so ideal. With vLLM I got 50-60 tg/s on the 27B, but with a worse quant, a lot worse 600 pp/s and abysmal startup time. Overall not really worth it. |
I wonder what others with dual 3090 get with these or similar models, especially if you have better transfer speeds between the GPUs? I suspect an X570 motherboard with PCIe4 8x/8x could improve tg especially with --split-mode row / graph. I just don't want to go into replacing it blindly because everything is wired in a water cooling loop which took a lot of time to setup. NVLink is unfortunately not possible as the GPUs are different brands.
Side note: the Q8_K_L are my own quantizations, basically Q8_0 with a few tensors selectively overridden to BF16. Still smaller than UD-Q8_K_XL while achieving better KLD. Credits to /u/TitwitMuffbiscuit and his kld-sweep tool which makes it easy to compare ppl/kld of multiple quants.
Minimum-Lie5435@reddit
Can get you more stats later but use vLLM with cyankiwi awq models, get about 60tps with 27b with low input context and 130-140 with 35b, I also have max_num_seqs=2 with 35b, and can get 110-120 TPS on both streams in parallel which totals to 220ish. I have a z490 board and an nvlink. Didn't find TP to be as good on cpp or anything else
Minimum-Lie5435@reddit
Also I'm pretty sure you can run nvlink on different brand cards? Assuming theyre both 3090s
McSendo@reddit
I think you still need to line them up right?
RedShiftedTime@reddit
How are you making bf16 fit full context on just 48 gb vram?
PaMRxR@reddit (OP)
For which one do you refer? I run models at most at ~Q8_0 quantization.
jikilan_@reddit
Unsloth q8 qwen3.5 27b is about 20t/s , 131k context Unsloth q8 qwen3.5 35b is about 102t/s, 256k context
All using release version of llama.cpp at 2-3 days ago. Z790, pcie5 x16 + PCH PCIe4x4 Power limit at 70%
PaMRxR@reddit (OP)
Thanks for sharing mate, it looks very similar to my numbers. My 27B quant is 31GB and I can fit 200k context. I don't power limit which maybe explains a little faster tg of 25.
Have you tried --split-mode row in llama.cpp, or maybe vLLM with tensor parallel=2?
jikilan_@reddit
As my second 3090 is in PCH PCIe, Row mode will make the performance become a lot worst.
I actually surprised if using q8 can archive 200+ context.
Have you tried the latest release? Now there is a form of tensor parallelism is implemented in llama.cpp. I am in midst of moving to Ubuntu. Haven’t got the chance to continue my migration.
PaMRxR@reddit (OP)
I just rebuilt it to try -sm tensor, but it keeps crashing as soon as it gets done with prompt processing unfortunately. Probably needs some time for issues to be ironed out.
Pattinathar@reddit
Custom Q8_K_L quants with selective BF16 overrides is clever getting better KLD than UD-Q8_K_XL at smaller size is a solid win. Curious how much the PCIe3 x4 bottleneck actually hits during generation vs prefill.
PaMRxR@reddit (OP)
It's difficult to find quantizations that are well optimized for 2x3090s (48GB VRAM + Q8/BF16 native support). I really think more people can benefit by tuning models specifically for their systems.
With --split-mode layer the slow transfer doesn't really matter as far as I know, but the GPUs are only utilized like 50% so I think tg is at least 2x slower than it could be.
fragment_me@reddit
I haven't seen it be 2x slower but it's definitely a little slower than the baseline of 1x 3090.
PaMRxR@reddit (OP)
I meant slower compared to something like tensor parallel which utilizes the GPUs much better. Otherwise -sm layer is slower in comparison to one 3090 mainly if you load a larger quant I'd guess.
Makers7886@reddit
I did some comparisons for dual 3090s running qwen3.5 27b Q8 via ik_llama. The 3090s are on 4.0x16 slots (epyc server w/romed8-2t).
PaMRxR@reddit (OP)
So the bandwidth affects pp too, although a little less than tg (17% vs 26%). I just tried ik_llama with -sm graph and getting 35.4 tg, pretty much same as you! But pp tanks from 1950 to 770 (-b/-ub 512), or 870 (-b/-ub 4096) for me. Do you have the startup command? Maybe I'm missing/misusing some parameter. Here's what I used:
Poha_Best_Breakfast@reddit
I don’t think Qwen 3.5 122B fits on dual 3090.
I run a dual model setup on my dual 3090s.
GPU0: Gemma4 31B IQ4_XS, 128k KV cache Q8 with attn_rotation. TG: 38 tok/s PP: around 400 IIRC
GPU1: Gemma4 26B UD-Q4_K_XL, 256K KV cache Q8 with attn_rotation: TG: 115 tok/s, PP: 1100 tok/s.
I run them as a pair agent + subagent pair and the output is better than a single model.
Earlier I was running Qwopus V3 27B on GPU 0 and Qwen 3.5 35B on GPU1.
In an ideal world I’d run a 70-80B model but currently all the 70B class models are outdated.
fragment_me@reddit
FYI if you were not aware Gemma4 supports speculative decoding (without a model). Add this to your llama-cpp for free tokens.
--spec-type ngram-mod --spec-ngram-size-n 32 --draft-min 24 --draft-max 48
Poha_Best_Breakfast@reddit
Let me try that tonight and reply back how it worked. Currently getting 37-39 tok/s on the 31B and 115 on the 26B. If these are improved I'll love it.
PaMRxR@reddit (OP)
Qwen3.5 122B does not fit fully of course, the quant I use is 72 GB, so ~36GB experts run in system RAM. I find it generally a bit dumber than Qwen 3.5 27B anyway, but occasionally it comes ahead in debugging tasks where the little more knowledge of more obscure details helps.
Actually I run a VERY similar system of Qwen3.5 27B agent + Qwen 3.5 35B-A3B subagent. But I run each fully on both cards at a Q8_0+, swapping them back and forth with llama-swap. Have you tried such an arrangement? It's slower with the swapping, pp is seriously faster, and finally tg is slower due to larger weights I think. Whether it has a significant impact on quality is hard to tell though.
Do you find the Gemma4 combo better than Qwen3.5 btw?
Poha_Best_Breakfast@reddit
I benchmarked them and for coding the quality difference between Q8 and Q4 was negligible and much less than 2-pass (doing -> fixing). for text generation the difference will obviously be more. This is ofcourse using good quality Q4 quants (like UD_Q4_K_XL which intelligently uses 8+ bits on the important layers).
Yes, the gemma4 is definitely better after the llama-cpp fixes. Qwen overthinks a lot. The capabilities are similar but gemma takes 3-4x less tokens to arrive at the same conclusion..
IMO Gemma 4 31B > Qwen 3.5 27B, but gemma4 26B \~ Qwen 3.5 35B. But I still prefer the 26B due to less reasoning and it allows me to fit 250k context.
Fidrick@reddit
Thanks for sharing, I'm interested in your setup..
If you mind clarifying, do you feed the output from the agent to the sub-agent to improve it, run them in parallel and compare results, or use one as a driver of the other?
Poha_Best_Breakfast@reddit
My setup works the way claude code's ULTRAPLAN mode works (but funnily I was using it before it got leaked).
There's a primary agent: the slower, smarter one (this will be something like qwen 3.5 27B, or gemma 4 31B, I prefer the latter), and the subagent: the faster, slightly less smart one (qwen 3.5 35B A3B or gemma 4 26B A4B).
using opencode, you can have the primary agent manage and use the subagent. For example before a task it can use subagent to do research and search online. The subagent is fast and doesn't pollute the context window of the primary agent. For doing a task the primary agent can use the subagent multiple times (100-130 tok/s is very fast), review its output and ask it to either re-do or move on. It's all about agentic planning and skill files from there, but you can use this pattern to create something which can mimic a much more capable model.
Now if you want to take it one step further you can have a frontier model to oversee these (like claude opus or gemini 3.1 Pro) and help when needed and allow these to escalate to cloud model when needed. You get the latency and speed of the local models + capability of the frontier when needed without paying the full task cost.
Ok-Measurement-1575@reddit
Don't forget -sm tensor
viperx7@reddit
I have a 4090+3090ti and I get 42t/s on Qwen3.5 27B Q8_XL with -sm tensor
viperx7@reddit
u/PaMRxR u/jikilan_ u/eribob
Qwen3.5 27B
Tensor config (speed 42t/s)
llama-server --host 0.0.0.0 --port 5000 -fa auto --no-mmap --jinja -fit off --no-op-offload -sm tensor -m Qwen3.5-27B-Q8_0.ggufthis leaves 4.9 GB free VRAM
Note: up until yesterday i was able to load the mmproj with 27B in this config but since another llama.cpp update i can no longer (i hope it will be fixed soon as vram as enough VRAM is available)
Alternative config with mmproj (speed 29t/s)
llama-server --host 0.0.0.0 --port 5000 -fa auto --no-mmap --jinja -fit off --no-op-offload -m Qwen3.5-27B-Q8_0.gguf --mmproj mmproj-F16.ggufQwen3.5 35B
llama-server --host 0.0.0.0 --port 5000 -fa auto --no-mmap --jinja -fit off --no-op-offload -m Qwen_Qwen3.5-35B-A3B-Q8_0.gguf --mmproj mmproj-Qwen_Qwen3.5-35B-A3B-f16.gguf -ts 22,24Correction: speeds with Qwen3.5 35B are at 128t/s and not 120t/s when starting with a context of 100k it goes down to 98t/s
PCIE config: 4090 at 5GB/s 3090ti @ 25GB/s
eribob@reddit
Care to share the 27B Q8 setup? I can only fit about 130k context on my dual 3090s without KV quantization. Running vllm.
fragment_me@reddit
Here's what I use for 2x 3090 on llama-cpp. FIrst one is 190K KV at Q8 and second is 145K KV at F16.
PaMRxR@reddit (OP)
Do you run llama.cpp or vLLM? Could you also share what PCIe configuration you have?
jikilan_@reddit
Can share the parameters for the 27B? Whose quant that u are using?
raketenkater@reddit
You guys should try my auto optimization script to get better performance without the hassle of tuning flags manually https://github.com/raketenkater/llm-server
fragment_me@reddit
Yo this is kind of cool!
AdamDhahabi@reddit
My Frankenstein build runs Qwen 3.5 122B IQ4_XS GGUF (Bartowski) with 200K context at 50 t/s (first few thousands of tokens). Specs: 2x 5070 Ti + 3090 + 5060 Ti 16GB (mix of expensive Blackwells and a single 3090 to keep it affordable).
a_beautiful_rhind@reddit
P2P driver and I guess subdivide the x4-16x
nicholas_the_furious@reddit
Can you expand on the P2P driver? Is this more than just the latest Nvidia driver?
12bitmisfit@reddit
AFAIK the p2p patched drivers allow the cards to skip system memory and talk directly to each other. It is a feature explicitly not enabled in consumer cards.
a_beautiful_rhind@reddit
Its a patched P2P open driver that allows you to do P2P on cards that don't support it or have nvlink.
PaMRxR@reddit (OP)
Just yesterday I tried actually compiling that P2P driver patch, and ran into various issues.. both compilation and linking errors. I don't expect miracles from it anyway, the PCIe 3 slot is physically max 4x - 4GB/s.