Local dual 5060 ti, qwen 3 30b full context of 40k, >60t/s

Posted by see_spot_ruminate@reddit | LocalLLaMA | View on Reddit | 24 comments

Hello all

I wanted to do a write up of my setup for anyone considering a similar choice. I know that it is not actually that cheap, but I think I get a good performance benefit. I live near a microcenter so a lot of this was purchased there.

I got the 7600x3d deal they have but with the boost to 64 gb or ram. then I got 2x 5060 ti 16gb. With this setup (due to the 32gb of vram) I am able to load up the full context for qwen 3 30b fully offloaded to gpu (via ollama, via openwebui, with the recommended settings). I get >60 tokens per second with this. I know that most of the time it is recommended by many, many people to get used cards but I just can't deal with this.

Anyway, this is mostly a post for those looking for dual 5060 ti use. Let me know if you have any questions.

[-]

Glittering-Cold-2981@reddit

Hello, what speed do larger models have on this set, e.g. qwen 27b or 32b in Q5/Q6?

[-]

see_spot_ruminate@reddit (OP)

So, I have not been using them that much. At q4, probably less than 20 t/s with it being closer to 10 t/s. This is not the card to get multiple of to load up dense models.

This is more of a setup to max out context on MOE models. Check out my more recent post history for what I am getting now because I have gone up to 4 cards.

[-]

ForsookComparison@reddit

I can't figure out why, but my two Rx 6800's only get ~40t/s max despite having faster theoretical memory bandwidth

[-]

see_spot_ruminate@reddit (OP)

I will say on alternate models I get less or more and I bet there is some low level technical reason as to why. I will also say that the motherboard I have is the Asus TUF 650 that came with the microcenter deal. with the 2 cards loaded I have one running at gen4@8 and the other at gen4@1. While I thought that the gen4@1 would be limiting it still is fast enough for me.

[-]

kmouratidis@reddit

Ollama (and everything llamacpp-based) only does pipeline parallel, so x1 is fine. In fact, running over the network will likely not throttle you either for "thin" models.

It starts making a big difference when each layer is wide, or when you do tensor parallel, and most commonly when doing batch work.

Qwen3-30B-A3B expert matrices seem to be around 2048x768 so <1MB in ollama's default quant (~Q4), so even if you transfer say the results of 9 of the roughly 10 matrices (and some don't even have parameters) per layer, that's still under 100 Mbps.

[-]

TSG-AYAN@reddit

what backend are you using? Rocm TG is much slower than vulkan, atleast on llama.cpp with linux, mesa vulkan driver.

[-]

AppearanceHeavy6724@reddit

What is OS and idle consumption in watts?

[-]

see_spot_ruminate@reddit (OP)

ubuntu 25.04, with the graphics ppa for the drivers.

says 72 watts (per my ikea smart plug on homeassistant, 120v*.6amps)

looking back at the graph between .55 amps (idle) and 2 amps (when using the model)

[-]

SandboChang@reddit

If you are already in Ubuntu, you can consider switching to vLLM/SGLang/Tabby for better performance.

[-]

AppearanceHeavy6724@reddit

Not whole computer, only cards. Just run nvidia-smi

[-]

see_spot_ruminate@reddit (OP)

the limit for both cards is 180w, but at idle they are at 5w

this is on driver 575.64.03

[-]

AppearanceHeavy6724@reddit

5w sounds great! Who is the manufacturer of the cards and what is the model? Gigabyte? BTW power limit the cards down to 150W you won't lose performance. You can also try vllm to use both cards in parallel, should give you around 80 t/s,on empty context

[-]

see_spot_ruminate@reddit (OP)

I will work on that. I spent most of my time just trying to get it to actually recognize the cards. On driver 570 it would just call them "generic nvidia device" or something.

The cards are zotac. Just the 2 fan variety. I think one is an "amp oc" and the other is the lowest tier.

https://www.microcenter.com/product/694732/zotac-nvidia-geforce-rtx-5060-ti-twin-edge-overclocked-dual-fan-16gb-gddr7-pcie-50-graphics-card

Thanks, will check on vllm. I get 80t/s on gemma3 4b. Maybe I would get more on vllm, have to check it out.

[-]

AppearanceHeavy6724@reddit

Oh great. I was thinking about buying Zota. But felt like kinda sus brand. It seems to work fine though, so I guess I can buy one.

[-]

see_spot_ruminate@reddit (OP)

I feel like all the brands when I was searching had something going on.

Also, other good news is the power cable is unlikely to melt as it still uses the 8 pin pcie power connector for these cards. So I also took it as a win.

[-]

henfiber@reddit

Thanks. Could you also report the PP t/s you get with Qwen3 30b with a sizable prompt length (500+ tokens)?

[-]

see_spot_ruminate@reddit (OP)

no problem.

prompt: "help me tell reddit how well my computer setup is running"

response_token/s: 62.7

prompt_token/s: 86.23

approximate_total: "0h0m35s"

9 seconds were loading
26 seconds were responding, 22 of which look like thinking

total_tokens: 1669

[-]

henfiber@reddit

86 Prompt tokens/s seems very low for a 5060 ti. I get 80 with an old 6-core laptop CPU (no GPU).

Could you check with a longer prompt (5-6 paragraphs or so)?

E.g.,

summarize this text for me: <...pasted text frm wikipedia article...>

[-]

see_spot_ruminate@reddit (OP)

I asked it to summarize some stuff about oranges, 699 tokens

response_token/s: 62.35 prompt_token/s: 1085.45 total_tokens: 1221 total_time: "0h0m18s"

[-]

henfiber@reddit

~1100 PP t/s seems more realistic, thanks.

[-]

see_spot_ruminate@reddit (OP)

Yeah, you are right, the original prompt was not long enough

[-]

legit_split_@reddit

What are the rest of your system's specs?

[-]

see_spot_ruminate@reddit (OP)

I have an nvme ssd for boot and it also has 2 hdd in raid 1 which acts as a backup to another raid setup on another computer. nothing else is fancy. I don't have all the specs memorized, sorry.

[-]

legit_split_@reddit

Oh cool, so really just a standard ATX board. Thanks!