Replace RTX 2060 12G with second RTX 5060 Ti 16G for Qwen 3.6 27B?
Posted by houchenglin@reddit | LocalLLaMA | View on Reddit | 16 comments
Right now I'm running Qwen3-27B-Q4_K_M on a 2060 12G + 5060 Ti 16G with tensor split 15/7. Gen speed sits around 16.5 t/s and prompt eval drops from 653 to 356 t/s as context grows. It works, but I'm thinking about replacing the 2060 by another 5060 Ti to get a balanced dual setup with 32GB total VRAM.
[bench] RTX 2060 12G (PCIe x16) + RTX 5060 Ti 16G (PCIe x 4)
- Model: Unsloth Qwen3-27B-Q4_K_M
- PP: from 653 → 356 t/s as context grows (13K → 29.5K tokens).
- TG: flat at \~16.5 t/s r
-m Qwen3-27B-Q4_K_M.gguf -ngl 999 -ts 15,7
-fa 1 --no-mmap -b 4096 -ub 4096
--spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 12 --draft-max 48
-c 96000 -n 32768 -t 8 -ctk q8_0 -ctv q8_0 --parallel 1
--temperature 0.6 --jinja --min-p 0.0 --top-k 20 --top-p 0.95
My main question is whether the speed gain is actually worth it. One of the x16 slots on my board is only running at x4, so I'm worried the PCIe bottleneck eats most of the benefit. Anyone running dual 5060 Ti (or similar dual mid-range) for 27B+ models? What kind of gen speed are you seeing?
Also curious about the VRAM side — going from 28GB to 32GB, does that meaningfully change what models I can run, or am I still capped around 27B either way? Net cost is basically one 5060 Ti minus whatever I get for the 2060, so trying to figure out if the jump justifies it.
KURD_1_STAN@reddit
I have never used dual cards nor know how this whole splitting things work but cant u make it like 17/5 or 18/4? I feel like 2060 is slowing it down and u might be able to fit more in the 5060 ti if monitor is connected to the 2060
ea_man@reddit
As of how you are now why not running Qwen3.6-27B.i1-IQ4_XS.gguf 15.1 G or Q3_K_L 14.3GB on a single 5060 Ti 16G ?
You'll get maybe 2x speed.
https://huggingface.co/mradermacher/Qwen3.6-27B-i1-GGUF
houchenglin@reddit (OP)
I actually tried the IQ3 quant on a single card and it runs well, but since KV cache uses \~3.3x memory, the context window gets really small. Not enough for coding tasks unfortunately
ea_man@reddit
3.3x memory?
I use q_4 and get like 70k with less than 1GB.
houchenglin@reddit (OP)
I asked qwen to google the vram usage for 35B and 27B and he replies me this answer:
(WARNING! AI GENEREATED DATA)
ea_man@reddit
you have to look at the logs when you load the lm, it tells you what's happening.
see_spot_ruminate@reddit
At some point with the 5060ti (or any card), once you fully load the model into vram, then bandwidth is the main concern. If you are already loading the model into vram, then you are likely hitting the bandwidth limit of the 2060 (assuming, since it is the older card). I would not worry as much for you right now with the x4 lanes. This matters for the loading of the model more, but it can still affect it.
I have a quad setup and once I fully load the model it is at the limit of the vram bandwidth.
Going from 28 to 32 can be a meaningful change, but also going from the limits of the 2060 to the 5060ti can be more impactful.
On the other hand I find that if you have enough system ram that the larger MOE models can make more sense in a way. I have 64gb + 64gb and I just use the q4 of the 122b qwen3.5 with good results if I need something beefier.
houchenglin@reddit (OP)
That makes sense — bandwidth is probably the real bottleneck here. I hadn't considered trying larger MoE models with system RAM, that's a good idea. Thanks!
see_spot_ruminate@reddit
For sure. The primary concern is vram, then bandwidth. Since you have the vram covered, worry about bandwidth of the card not the pcie lanes. As long as you have at least gen4 pcie, then you will 'probably' have another bottleneck somewhere else.
Also, llamacpp and the nvidia drivers (sometimes) keep getting better. Last year I was gettin like mid 50 t/s on gpt-oss-120b but now I am getting like 70 t/s. I keep going back to this model because it is a nice mix.
Long_comment_san@reddit
Double 5060ti seems to be the sweet spot, yeah. It will also unlock native 4 bit capabilities. Sadly there is just no alternative, I would have loved to say "hey sell this junk and slap twin 5080 supers with 24gb vram each" but they didn't happen :(
You can also try second hand 3090, if your case and motherboard can take this ofc
houchenglin@reddit (OP)
Thanks for the breakdown! I was considering a used 3090, but it's already \~6 years old now so I'm worried about its reliability and lifespan. As for the 2080, it sounds like a solid option but also pretty expensive.
Charming-Author4877@reddit
The downside of the 2060 is not only the vram and speed, it's also it's CUDA capabilities!
The 4GB VRAM would also allow you to run a real draft model, instead of the ngram-mod
Your 2060 runs at PCI Exp. generation 3 ? The 5060 Gen 5.
4 lanes are faster than the 16 old lanes
It's certainly an upgrade worth it
houchenglin@reddit (OP)
Yeah, I think you're right. The PCIe gen difference alone makes 4 lanes beat 16 old lanes. Thanks!
suprjami@reddit
Or, 3x 12Gb lets you run Q6 with 128k ctx.
houchenglin@reddit (OP)
Thanks for the suggestion! Unfortunately my mATX board doesn't have room for that many cards.
ixdx@reddit
I have a 5070Ti (PCIe 5 16x) with a 5060Ti (PCIe 3 4x)
Qwen3.6-27B-Q4_K_L fits with 128k context and mmproj (pp512: 1240, tg128: 27)
Qwen3.6-27B-Q5_K_L fits with 128k context without mmproj (pp512: 1235, tg128: 24)
It's a little better than 2x5060Ti, but not much.