Dual 3090 setup - performance optimization

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 43 comments

I have this machine right now: - MSI B550-A PRO - Ryzen 5 5600X, 4x16GB DDR4 3200 MHz - RTX 3090 - PCIe4 x16 (~25GB/s) - RTX 3090 - PCIe3 x4 (<3GB/s..)

I added the second GPU just recently and after a day of optimizing stuff settled on this setup:

Model name	Model quant	KV quant	--ctx-size	pp/s	tg/s	Engine
Qwen3.5-122B-A10B	AesSedai Q4_K_M	q8_0	80000	1000	22	ik_llama.cpp
Qwen3.5-27B	PaMRxR Q8_K_L	bf16	200000	1950	25	llama.cpp
Qwen3.5-35B-A3B	PaMRxR Q8_K_L	bf16	260000	4366	102	llama.cpp
With --split-mode layer things work well, especially pp, but tg is not so ideal. With vLLM I got 50-60 tg/s on the 27B, but with a worse quant, a lot worse 600 pp/s and abysmal startup time. Overall not really worth it.

I wonder what others with dual 3090 get with these or similar models, especially if you have better transfer speeds between the GPUs? I suspect an X570 motherboard with PCIe4 8x/8x could improve tg especially with --split-mode row / graph. I just don't want to go into replacing it blindly because everything is wired in a water cooling loop which took a lot of time to setup. NVLink is unfortunately not possible as the GPUs are different brands.

Side note: the Q8_K_L are my own quantizations, basically Q8_0 with a few tensors selectively overridden to BF16. Still smaller than UD-Q8_K_XL while achieving better KLD. Credits to /u/TitwitMuffbiscuit and his kld-sweep tool which makes it easy to compare ppl/kld of multiple quants.

[-]

OttoRenner@reddit

What OS are you on? What specific 3090s?

I'm basically in the same situation (once the special PSU cables arrive from china to split the PCI5.1 output into 2x6+2...my be quiet straight 12 1200W doesn't come with enough ports to connect two old cards directly -.-).

MSI Tomahawk Max Wifi (16x and 4x) Ryzen 9 MSI 3090 Suprim X OC NVIDIA EVGA 3090 Hybrid (the card itself is water cooled) 64GB DDR4 (4X16) 3200Mhz XMP1 Samsung 990 Pro SSD

Ubuntu 26.04 LTS Llama.cpp Zed IDE with Claude Code CLI (I think qwen 3.5 something) Foot Terminals And I think Thorium Chromium Browser.

My main goal is to offload as much as possible to the CPU/DDR. So, no hardware acceleration where ever possible (all Foot terminals combined use 10MB -20MB). I'm still tweaking. But you might want to consider going a similar route, as every bit of additional VRAM is vital!

[-]

PaMRxR@reddit (OP)

My main goal is to offload as much as possible to the CPU/DDR. So, no hardware acceleration where ever possible (all Foot terminals combined use 10MB -20MB). I'm still tweaking. But you might want to consider going a similar route, as every bit of additional VRAM is vital!

My main goal is to offload as much as possible to the CPU/DDR.

The GUIs on my Ubuntu 24.04 were taking up a few 100 MB which I didn't think is worth bothering to optimize.

[-]

PaMRxR@reddit (OP)

Indeed sounds extremely similar hw-wise. Meanwhile I moved the 3090s to an older dual Xeon server I already had, with 2x PCIe 3.0 x16 slots running headless. This cuts the max bandwidth in half, but also bumps the min bandwidth by 4x. In the end it's a bit faster for -sm tensor with llama.cpp, nothing dramatic though.

[-]

Minimum-Lie5435@reddit

Can get you more stats later but use vLLM with cyankiwi awq models, get about 60tps with 27b with low input context and 130-140 with 35b, I also have max_num_seqs=2 with 35b, and can get 110-120 TPS on both streams in parallel which totals to 220ish. I have a z490 board and an nvlink. Didn't find TP to be as good on cpp or anything else

[-]

Minimum-Lie5435@reddit

Also I'm pretty sure you can run nvlink on different brand cards? Assuming theyre both 3090s

[-]

TheOnlyBen2@reddit

Only if NVLINK slots aligned, which is not guaranteed if the cards are not from the same brand

[-]

McSendo@reddit

I think you still need to line them up right?

[-]

TheOnlyBen2@reddit

I have the same setup but with an NVlink bridge.

I reach 50 tokens per second with Qwen 3.5 27B Q6 K XL (tensor parallel).

130 tokens per second with Qwen 3.6 27B Q6 K XL, however tensor parallel is not supported yes.

[-]

viperx7@reddit

I have a 4090+3090ti and I get 42t/s on Qwen3.5 27B Q8_XL with -sm tensor

[-]

viperx7@reddit

u/PaMRxR u/jikilan_ u/eribob

Qwen3.5 27B

Tensor config (speed 42t/s)

llama-server --host 0.0.0.0 --port 5000 -fa auto --no-mmap --jinja -fit off --no-op-offload -sm tensor -m Qwen3.5-27B-Q8_0.gguf

this leaves 4.9 GB free VRAM

Note: up until yesterday i was able to load the mmproj with 27B in this config but since another llama.cpp update i can no longer (i hope it will be fixed soon as vram as enough VRAM is available)

Alternative config with mmproj (speed 29t/s)

llama-server --host 0.0.0.0 --port 5000 -fa auto --no-mmap --jinja -fit off --no-op-offload -m Qwen3.5-27B-Q8_0.gguf --mmproj mmproj-F16.gguf

Qwen3.5 35B

llama-server --host 0.0.0.0 --port 5000 -fa auto --no-mmap --jinja -fit off --no-op-offload -m Qwen_Qwen3.5-35B-A3B-Q8_0.gguf --mmproj mmproj-Qwen_Qwen3.5-35B-A3B-f16.gguf -ts 22,24

Correction: speeds with Qwen3.5 35B are at 128t/s and not 120t/s when starting with a context of 100k it goes down to 98t/s

PCIE config: 4090 at 5GB/s 3090ti @ 25GB/s

[-]

eribob@reddit

Thanks! Will check it out

[-]

eribob@reddit

Care to share the 27B Q8 setup? I can only fit about 130k context on my dual 3090s without KV quantization. Running vllm.

[-]

fragment_me@reddit

Here's what I use for 2x 3090 on llama-cpp. FIrst one is 190K KV at Q8 and second is 145K KV at F16.

/home/user/llm/llama.cpp/build/bin/llama-server  \
 -m /home/user/llm/models/Qwen3.5-27B/Qwen3.5-27B-UD-Q8_K_XL.gguf \
 --port 8080 --host 0.0.0.0 --webui-mcp-proxy \
 --no-mmap --threads 8 --jinja \
 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on -kvu --ctx-size 190000 -ngl 99 -sm layer -ts 52,48 \
 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0


/home/user/llm/llama.cpp/build/bin/llama-server  \
 -m /home/user/llm/models/Qwen3.5-27B/Qwen3.5-27B-UD-Q8_K_XL.gguf \
 --port 8080 --host 0.0.0.0 --webui-mcp-proxy \
 --no-mmap --threads 8 --jinja \
 --cache-type-k f16 --cache-type-v f16 --flash-attn on -kvu --ctx-size 145000  -ngl 99 -sm layer -ts 53,47 \
 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0

[-]

PaMRxR@reddit (OP)

Do you run llama.cpp or vLLM? Could you also share what PCIe configuration you have?

[-]

jikilan_@reddit

Can share the parameters for the 27B? Whose quant that u are using?

[-]

RedShiftedTime@reddit

How are you making bf16 fit full context on just 48 gb vram?

[-]

PaMRxR@reddit (OP)

For which one do you refer? I run models at most at ~Q8_0 quantization.

[-]

jikilan_@reddit

Unsloth q8 qwen3.5 27b is about 20t/s , 131k context Unsloth q8 qwen3.5 35b is about 102t/s, 256k context

All using release version of llama.cpp at 2-3 days ago. Z790, pcie5 x16 + PCH PCIe4x4 Power limit at 70%

[-]

PaMRxR@reddit (OP)

Thanks for sharing mate, it looks very similar to my numbers. My 27B quant is 31GB and I can fit 200k context. I don't power limit which maybe explains a little faster tg of 25.

Have you tried --split-mode row in llama.cpp, or maybe vLLM with tensor parallel=2?

[-]

jikilan_@reddit

As my second 3090 is in PCH PCIe, Row mode will make the performance become a lot worst.

I actually surprised if using q8 can archive 200+ context.

Have you tried the latest release? Now there is a form of tensor parallelism is implemented in llama.cpp. I am in midst of moving to Ubuntu. Haven’t got the chance to continue my migration.

[-]

PaMRxR@reddit (OP)

I just rebuilt it to try -sm tensor, but it keeps crashing as soon as it gets done with prompt processing unfortunately. Probably needs some time for issues to be ironed out.

[-]

Pattinathar@reddit

Custom Q8_K_L quants with selective BF16 overrides is clever getting better KLD than UD-Q8_K_XL at smaller size is a solid win. Curious how much the PCIe3 x4 bottleneck actually hits during generation vs prefill.

[-]

PaMRxR@reddit (OP)

It's difficult to find quantizations that are well optimized for 2x3090s (48GB VRAM + Q8/BF16 native support). I really think more people can benefit by tuning models specifically for their systems.

With --split-mode layer the slow transfer doesn't really matter as far as I know, but the GPUs are only utilized like 50% so I think tg is at least 2x slower than it could be.

[-]

fragment_me@reddit

I haven't seen it be 2x slower but it's definitely a little slower than the baseline of 1x 3090.

[-]

PaMRxR@reddit (OP)

I meant slower compared to something like tensor parallel which utilizes the GPUs much better. Otherwise -sm layer is slower in comparison to one 3090 mainly if you load a larger quant I'd guess.

[-]

Makers7886@reddit

I did some comparisons for dual 3090s running qwen3.5 27b Q8 via ik_llama. The 3090s are on 4.0x16 slots (epyc server w/romed8-2t).

[-]

PaMRxR@reddit (OP)

So the bandwidth affects pp too, although a little less than tg (17% vs 26%). I just tried ik_llama with -sm graph and getting 35.4 tg, pretty much same as you! But pp tanks from 1950 to 770 (-b/-ub 512), or 870 (-b/-ub 4096) for me. Do you have the startup command? Maybe I'm missing/misusing some parameter. Here's what I used:

  ${ik-llama-server}
  --parallel 1
  --model "${models_path}/Qwen3.5-27B-GGUF/Qwen3.5-27B-Q8_0-11.gguf"
  --mmproj "${models_path}/Qwen3.5-27B-GGUF/mmproj/mmproj-BF16.gguf"
  --chat-template-file "${models_path}/chat_templates/Qwen3.5.txt"
  --split-mode graph
  --fit
  --fit-margin 512
  --seed 42
  --ctx-size 100000
  --n-gpu-layers 999
  --jinja
  --peg
  -fa on
  --no-context-shift
  -b 2048
  -ub 2048
  --cache-ram 30000

[-]

Poha_Best_Breakfast@reddit

I don’t think Qwen 3.5 122B fits on dual 3090.

I run a dual model setup on my dual 3090s.

GPU0: Gemma4 31B IQ4_XS, 128k KV cache Q8 with attn_rotation. TG: 38 tok/s PP: around 400 IIRC

GPU1: Gemma4 26B UD-Q4_K_XL, 256K KV cache Q8 with attn_rotation: TG: 115 tok/s, PP: 1100 tok/s.

I run them as a pair agent + subagent pair and the output is better than a single model.

Earlier I was running Qwopus V3 27B on GPU 0 and Qwen 3.5 35B on GPU1.

In an ideal world I’d run a 70-80B model but currently all the 70B class models are outdated.

[-]

fragment_me@reddit

FYI if you were not aware Gemma4 supports speculative decoding (without a model). Add this to your llama-cpp for free tokens.

--spec-type ngram-mod --spec-ngram-size-n 32 --draft-min 24 --draft-max 48

[-]

Poha_Best_Breakfast@reddit

Let me try that tonight and reply back how it worked. Currently getting 37-39 tok/s on the 31B and 115 on the 26B. If these are improved I'll love it.

[-]

PaMRxR@reddit (OP)

Qwen3.5 122B does not fit fully of course, the quant I use is 72 GB, so ~36GB experts run in system RAM. I find it generally a bit dumber than Qwen 3.5 27B anyway, but occasionally it comes ahead in debugging tasks where the little more knowledge of more obscure details helps.

Actually I run a VERY similar system of Qwen3.5 27B agent + Qwen 3.5 35B-A3B subagent. But I run each fully on both cards at a Q8_0+, swapping them back and forth with llama-swap. Have you tried such an arrangement? It's slower with the swapping, pp is seriously faster, and finally tg is slower due to larger weights I think. Whether it has a significant impact on quality is hard to tell though.

Do you find the Gemma4 combo better than Qwen3.5 btw?

[-]

Poha_Best_Breakfast@reddit

I benchmarked them and for coding the quality difference between Q8 and Q4 was negligible and much less than 2-pass (doing -> fixing). for text generation the difference will obviously be more. This is ofcourse using good quality Q4 quants (like UD_Q4_K_XL which intelligently uses 8+ bits on the important layers).

Yes, the gemma4 is definitely better after the llama-cpp fixes. Qwen overthinks a lot. The capabilities are similar but gemma takes 3-4x less tokens to arrive at the same conclusion..

IMO Gemma 4 31B > Qwen 3.5 27B, but gemma4 26B \~ Qwen 3.5 35B. But I still prefer the 26B due to less reasoning and it allows me to fit 250k context.

[-]

Fidrick@reddit

Thanks for sharing, I'm interested in your setup..

If you mind clarifying, do you feed the output from the agent to the sub-agent to improve it, run them in parallel and compare results, or use one as a driver of the other?

[-]

Poha_Best_Breakfast@reddit

My setup works the way claude code's ULTRAPLAN mode works (but funnily I was using it before it got leaked).

There's a primary agent: the slower, smarter one (this will be something like qwen 3.5 27B, or gemma 4 31B, I prefer the latter), and the subagent: the faster, slightly less smart one (qwen 3.5 35B A3B or gemma 4 26B A4B).

using opencode, you can have the primary agent manage and use the subagent. For example before a task it can use subagent to do research and search online. The subagent is fast and doesn't pollute the context window of the primary agent. For doing a task the primary agent can use the subagent multiple times (100-130 tok/s is very fast), review its output and ask it to either re-do or move on. It's all about agentic planning and skill files from there, but you can use this pattern to create something which can mimic a much more capable model.

Now if you want to take it one step further you can have a frontier model to oversee these (like claude opus or gemini 3.1 Pro) and help when needed and allow these to escalate to cloud model when needed. You get the latency and speed of the local models + capability of the frontier when needed without paying the full task cost.

[-]

Ok-Measurement-1575@reddit

Don't forget -sm tensor

[-]

raketenkater@reddit

You guys should try my auto optimization script to get better performance without the hassle of tuning flags manually https://github.com/raketenkater/llm-server

[-]

fragment_me@reddit

Yo this is kind of cool!

[-]

AdamDhahabi@reddit

My Frankenstein build runs Qwen 3.5 122B IQ4_XS GGUF (Bartowski) with 200K context at 50 t/s (first few thousands of tokens). Specs: 2x 5070 Ti + 3090 + 5060 Ti 16GB (mix of expensive Blackwells and a single 3090 to keep it affordable).

[-]

a_beautiful_rhind@reddit

P2P driver and I guess subdivide the x4-16x

[-]

nicholas_the_furious@reddit

Can you expand on the P2P driver? Is this more than just the latest Nvidia driver?

[-]

12bitmisfit@reddit

AFAIK the p2p patched drivers allow the cards to skip system memory and talk directly to each other. It is a feature explicitly not enabled in consumer cards.

[-]

a_beautiful_rhind@reddit

Its a patched P2P open driver that allows you to do P2P on cards that don't support it or have nvlink.

[-]

PaMRxR@reddit (OP)

Just yesterday I tried actually compiling that P2P driver patch, and ran into various issues.. both compilation and linking errors. I don't expect miracles from it anyway, the PCIe 3 slot is physically max 4x - 4GB/s.