Finding the 4x 3090 Sweet Spot

Posted by anitamaxwynnn69@reddit | LocalLLaMA | View on Reddit | 38 comments

In another post I had someone ask me about the power draw of the 4x 3090 setup so I'm sharing a a full test I conducted to understand the efficiency curve. Used this blog post (not mine) as a reference.

Setup:

GPUs: 4x RTX 3090 (Dell OEM, EVGA XC3, 2x ASUS Strix)
PCIe Topology: Gen 3 (Bifurcated: x16 / x8 / x8 / x4)
Model: Qwen3.6-27B (FP16)
Backend: vLLM v0.20.2 (TP=4)

Power Limit (W)	Output (t/s)	Total Throughput (t/s)	Efficiency (t/joule)
350/390 (Unrestricted)	29	269	0.77
300	29	268	0.89
275	29	265	0.96
250	29	261	1.04
220	27	248	1.13
200	24	221	1.11

Takeaways:

The 220W Sweet Spot: Peak efficiency (matches the blog's findings)
Diminishing Returns: Increasing the limit beyond 250W provides diminishing returns

Hope this helps someone. Happy to answer any questions.

I'm VERY satisfied with Qwen 3.6 27B as a daily driver, but I would still like to know if there are any better/bigger models I can run on this setup. My understanding is that the best I can do is DSv4 at Q2 - not sure if it's fully supported yet though.

Additional context: it's an open build on a generic mining frame. I'm cooling it with 10x TL-C12C-S (5 on each side of gpus perpendicularly). I finished building this very recently so I'm open to suggestions on how to improve it.

[-]

laul_pogan@reddit

Output t/s is flat from 250W to 350W because decode is memory-bandwidth-bound, not compute-bound. 3090 GDDR6X bandwidth barely changes with power limit, so you hit the same ~29 t/s regardless. PP drops at 200W because prefill IS compute-bound. That's why 220W is the sweet spot: you're preserving the thing that matters (memory BW) while shedding watts on the thing that's already past diminishing returns (shader clock). For bigger models on your setup, 96GB VRAM fits a 70B in Q4 comfortably (~35-40GB). Qwen3-72B at Q4_K_M via vLLM TP=4 would be worth a shot before going to DSv4 Q2 territory.

[-]

anitamaxwynnn69@reddit (OP)

I really appreciate you taking out the time to explain that. That makes a lot more sense. Regarding bigger models, has your experience w the 72B better than 3.6 27B?

[-]

grunt_monkey_@reddit

Do you think it helps if you bifurcate the x16 so that everything is x8? Would it even out things for inference? Usually you get rate limited by the slowest card which is x4 in this instance.

[-]

anitamaxwynnn69@reddit (OP)

Yeah I would've loved to go x8 x8 x8 x8 lol. But my mobo (gigabyte x299 designaire ex) just doesn't support bifurcation like that. And yes I think that would've improved the whole situation quite a bit.

[-]

a_beautiful_rhind@reddit

consider the p2p driver

[-]

Judtoff@reddit

3x 3090s checking in. This is the first I'm hearing about p2p. Is it available on Linux? Does it work with llama.cpp?

[-]

a_beautiful_rhind@reddit

Yep: https://github.com/aikitoria/open-gpu-kernel-modules

works with mixed nvlink/pcie now too.

[-]

anitamaxwynnn69@reddit (OP)

Is it stable? I understand it's a bit invasive on the kernel/os right?

[-]

a_beautiful_rhind@reddit

The p2p part is fine. The open drivers from nvidia aren't exactly perfect.

[-]

anitamaxwynnn69@reddit (OP)

I re-tested everything with the patches vllm but no improvement. I think the x4 just cucks my entire system so hard. But it is definitely promising for any else reading this.

[-]

Judtoff@reddit

Dang what a wonderful thing to discover, thank you!!

[-]

anitamaxwynnn69@reddit (OP)

Wow, I am just now finding out about this. Thank you for opening my eyes. This seems very promising.

[-]

sparticleaccelerator@reddit

What are your idle temps and delta-T under sustained load with that fan setup? Considering a similar open-frame 4x build and trying to figure out if perpendicular intake actually beats the usual "fans blowing across the stack" approach.

[-]

anitamaxwynnn69@reddit (OP)

I actually ran a background script polling nvidia-smi every 5 seconds during the entire benchmark suite, so I have the exact numbers for you lol.

Idle Temps:

GPU 0: 32°C
GPU 1: 29°C
GPU 2: 29°C
GPU 3: 29°C

Max Temps (Sustained Load @ 350/390W):

GPU 0: 67°C
GPU 1: 68°C
GPU 2: 62°C
GPU 3: 60°C

That gives you a Delta-T of roughly +31°C to +39°C under maximum unrestricted load. I wouldn't take these at face value though because nvidia-smi doesn't report all the temps. I found out later about this which reports individual temps for vram/junc/core which is a lot more useful.

About the fans, I've actually ran the setup without the 10x fans and it was quite a bit worse (Dell OEM going upto 110 on the VRAM). With the fans even during peak load it doesn't cross 103-104. So the fans have earned their place haha. Also, I have seen people literally use cheap used box fans for their quad 3090s and even that works very well XD. So as long as youre moving air you should be good.

[-]

sparticleaccelerator@reddit

Thanks -- this is super helpful

[-]

Xamanthas@reddit

This has been done a so many times already. Please utilise search for your own benefit. Almost every time 225w pl was the sweet spot.

[-]

oxygen_addiction@reddit

For coding https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF will be better in certain tasks than Qwen 27B.

And Qwen 3.5 122B will have more world knowledge. https://huggingface.co/AesSedai/Qwen3.5-122B-A10B-GGUF

[-]

anitamaxwynnn69@reddit (OP)

Really appreciate the response, I will try them out. Do you think Qwen3.5 122B Q5 is a better general purpose model then Qwen 3.6 27B FP16?

[-]

oxygen_addiction@reddit

It's for you to decide.

27B is smarter overall, while 122B has more world knowledge. With proper search tools, 27B is probably better overall. But there will be tasks where the almost 100 extra B parameters matter.

[-]

FullOf_Bad_Ideas@reddit

How much ram do you have?

You should be able to squeeze in Mistral Medium 3.5 128B but it's hard to say if it's any better than Qwen 3.6 27B based on public opinion. If you have some RAM maybe there's a way to get Minimax M2.7 working well.

[-]

lemondrops9@reddit

Are you running Windows? The speed seems slow or is this not running parallel. I'm asking because i got in the lower 30s with two 3090's running pipeline in LM Studio.

[-]

Septerium@reddit

It is a dense model at full precision (FP16), so speed seems to be as expected

[-]

lemondrops9@reddit

Makes sense, guess I expected parallelism help more? I got curious and limited my two 3090s to 285w this is what got in llama.cpp if your curious.

[-]

anitamaxwynnn69@reddit (OP)

Yeah I believe it's more about the PCIe 3.0 x4. The entire setup is basically choked because of that. TP takes a hard hit with bad intercom I believe. I'm looking to upgrade to epyc if I can find a good deal on a motherboard.

[-]

Ok-Measurement-1575@reddit

He's handicapped by the x4 link, I believe, for the all reduce.

[-]

lemondrops9@reddit

Thank you. I missed that it was Gen3 x4.

[-]

Ok-Measurement-1575@reddit

I suppose it's quite likely all your numbers would change if you didn't have one card choking the ring at x4?

You'd see increased power draw and higher tokens/s as a result, is my guess.

[-]

anitamaxwynnn69@reddit (OP)

Nah, the setup is x16-x8-x8-x4 lol. But yes you're 100% right that x4 literally chokes up the whole thing. If I had a mobo which supported clean bifurcation to x8 x8 x8 x8, it would probably draw slightly higher power. Right now they don't go beyond 300W too much during inference (that's why no benefit either I believe). I cheaped out on the CPU+mobo combo, should've opted for the epyc for a cleaner setup.

[-]

starkruzr@reddit

a mining frame? what is the PCIe bandwidth to each one of those cards? and you're doing TP=4 with it successfully and it splits the layers successfully?

[-]

anitamaxwynnn69@reddit (OP)

This is the mining frame I'm using. Already mentioned the PCIe setup in the post but here's a more detailed breakdown -

Gigabyte X299 (it's old so PCIe 3.0 only)
i9-9940x (44 lanes)
best setup possible - x16 x8 x8 x4

It's not ideal for TP=4 because poor intercommunication but TP=4 is A LOT faster for decode than PP=4 lol. So yes it works 100%, in fact I've tried PP=4/TP=4 with the same gpus with x1/x1/x1/x1 as well. Prefill is a mess but even that works. I should also mention I run my VLLM with --disable-custom-all-reduce which turns off vLLM's custom kernel meant for speeding up TP due to my uneven PCIe setup. I'm still learning a lot of this so please correct me if I'm wrong.

[-]

starkruzr@reddit

it does, thanks. I need to figure out something that will let me run 4x 5060Tis effectively without breaking the bank.

[-]

anitamaxwynnn69@reddit (OP)

Funny you mention that, I actually started with 2x 5060Tis lol. Tried the NVFP4 variant somewhere made a post about on this subreddit. It worked very well. But I could feel the bandwidth difference. After some research, I came to the conclusion that 2x 3090s > 4x 5060TIs so I ended up returning them. Once you move out of chat and into anything agentic the tps actually starts mattering a lot more. Take it with a grain of salt though :')

[-]

anitamaxwynnn69@reddit (OP)

$1650 for the Dell OEM + EVGA
$900 each for the ASUS Strix

All locally bought used.

[-]

Far_Course2496@reddit

Pp speeds?

[-]

anitamaxwynnn69@reddit (OP)

Thanks for the note, added now