Multiple 5060 Ti's

Posted by snorixx@reddit | LocalLLaMA | View on Reddit | 30 comments

Hi, I need to build a lab AI-Inference/Training/Development machine. Basically something to just get started get experience and burn as less money as possible. Due to availability problems my first choice (cheaper RTX PRO Blackwell cards) are not available. Now my question:

Would it be viable to use multiple 5060 Ti (16GB) on a server motherboard (cheap EPYC 9004/8004). In my opinion the card is relatively cheap, supports new versions of CUDA and I can start with one or two and experiment with multiple (other NVIDIA cards). The purpose of the machine would only be getting experience so nothing to worry about meeting some standards for server deployment etc.

The card utilizes only 8 PCIe Lanes, but a 5070 Ti (16GB) utilizes all 16 lanes of the slot and has a way higher memory bandwidth for way more money. What speaks for and against my planned setup?

Because utilizing 8 PCIe 5.0 lanes are about 63.0 GB/s (x16 would be double). But I don't know how much that matters...

[-]

Deep-Technician-8568@reddit

If you are running dense models, i don't really recommend getting more than 2x 5060 ti. With my testing of 1x 4060 ti and 1x 5060 ti combined I was getting 11 tok/s on qwen 32b. To me i dont really consider anything under 20 tok/s to be usage. I also dont think 2x 5060 ti will even get to 20 tok/s. So, for dense models, really don't see the point of getting more than 2x 5060 ti.

[-]

snorixx@reddit (OP)

My focus will be more on development and gaining experience. But thanks that’s helping. My only test right now is a Tesla P4 in an x4 slot with a RTX 2070 and this runs 16B models fine on both GPUs with Ollama. But maybe I will have to invest try and document…

[-]

sixx7@reddit

Not sure what the above person is doing, but I ran a 3090 + 5060ti together and had way better performance. Ubuntu + vllm (tensor parallel) and I was seeing over 1000 tok/s prompt processing and generation of 30 tok/s for single prompts and over 100 tok/s for batch/multiple prompts

[-]

Excellent_Produce146@reddit

Which quant/inference server did you use? With vLLM Qwen/Qwen3-32B-AWQ I get

Avg generation throughput: 23.4 tokens/s (Cherry Studio says 20 t/s)

out of my test system with 2x 4060 Ti. Using v0.9.2 (container version) with "--model Qwen/Qwen3-32B-AWQ --tensor-parallel-size 2 --kv-cache-dtype fp8 --max-model-len 24576 --gpu-memory-utilization 0.98" and VLLM_ATTENTION_BACKEND=FLASHINFER.

Still in service for tests, because the support for the previous generation is still better than the Blackwell cards at least for vLLM. Blackwell still needs some love:

https://github.com/vllm-project/vllm/issues/20605

[-]

Excellent_Produce146@reddit

FTR - before buying (now) overpriced RTX 4060 Ti - I would get 2x 5060 Ti instead. Was just curious what is used in the backend.

[-]

snorixx@reddit (OP)

Thanks. I will consider that. I would buy a RTX PRO Card 2000 or 4000 but the Blackwell ones are not available yet to buy mate in 1-3month

[-]

FieldProgrammable@reddit

I agree for dense model multi GPU LLM inference, but a third card could be useful for other workloads, e.g. having a third card dedicated to hosting a diffusion model, or in a coding scenario a second smaller lower latency model suitable for FIM tab autocomplete e.g. the smaller Qwen2.5 coders).

[-]

AppearanceHeavy6724@reddit

combined I was getting 11 tok/s on qwen 32b

4060ti is absolute shit for LLMs this is why. It has 288 Gb/s bandwidth which is ass. With 2x5060ti you'll get easy 20t/s esp. if using vllm.

[-]

EthanMiner@reddit

Just get used 3090s. My 5070ti and 5090 are pains to get working with everything training related in linux(inference is fine). It’s like github whack-a-mole figuring what else has to change one you update torch.

[-]

HelpfulHand3@reddit

yes, blackwell is still not well supported
only problem with 3090 is they are massive, and huge power hogs + the OC can require 3x8 PCIe
my 5070ti is much smaller than my 3090

[-]

EthanMiner@reddit

I just use founders editions and don’t have those issues. You can water block them too, I fit 3 in a Liam Li A3, no problem, maxes out at 1295w for 72gb of vram.

[-]

snorixx@reddit (OP)

What platform do you use Epyc/AM5?

[-]

EthanMiner@reddit

Ubuntu 22.04

[-]

AdamDhahabi@reddit

Less PCIe lanes won't impact too much, I found a test showing a 5060 Ti on PCIe 3.0x1 vs 5.0x16. https://www.youtube.com/watch?v=qy0FWfTknFU

[-]

snorixx@reddit (OP)

Nice thanks. I will watch it. I think it will only impact when running one model on many cards because they have to communicate over PCIe which is way slower than memory

[-]

FieldProgrammable@reddit

That video is a single GPU running the entire workload form VRAM. So completely meaningless compared to multi GPU inference let alone training. For training, you need to maximise intercard bandwidth. One reason dual 3090s are so popular is they support NVLINK, which got dropped from consumer Ada onwards.

Another thing to research is PCIE P2P transfers, whixh Nvidia disable for gaming cards. Without that data has to pass through system memory to get to another card, so way higher latency. I think there was a hack to enable this for 4090s. But this is a feature that would be supported in pro cards out of the box, giving them an edge in training that might not be obvious from just compute and memory bandwidth comparison.

[-]

snorixx@reddit (OP)

Thanks that’s interesting my first choice was the RTx 4000 Blackwell but that is not available. The big problem is that you will need a server board if you want to upgrade over time but, that increases initial cost substantially… And AMD cards are not yet an option atm

[-]

FieldProgrammable@reddit

You realise that any test which uses a single GPU with all layers and cache in VRAM is a completely meaningless test for PCIE bandwidth?

This is basically just streaming tokens off the card one at a time. In a dual GPU or CPU offloaded scenario weights are distributed in different memory connected by the PCIE bus, to generate a single token all results have to be passed from one layer to another in a different memory before a new token is generated.

Op asked about a dual GPU setup, the amount of data moving over the PCIE bus would be orders of magnitude higher than a single GPU scenario.In a tensor parallel configuration it would be higher again. That's mot to say that you absolutely need the full bandwidth, but that video is absolutely not representative of a multi GPU setup.

[-]

AdamDhahabi@reddit

Fair enough. I found a related comment in my notes: "For the default modes the pci-e speed really does not matter much for inferencing. You will see a slow down while the model is loading. In the default mode the processing happens on one card, then the next, and so on. Only one card is active at a time. There is little card to card communication." https://www.reddit.com/r/LocalLLaMA/comments/1gossnd/when_using_multi_gpu_does_the_speed_between_the/

[-]

FieldProgrammable@reddit

"Little" is relative, especially compared to a task like training which needs massive intercard bandwidth. Also that quote applies to a classic pipelined mode where data passes serially from one card to the next. In a tensor parallel configuration the cards try to run in parallel requiring far more intercard communication.

Like I said I'm not claiming that PCIE5x8 vs PCIE4x4 is going to make or break your speeds, but that's a far cry from trying to claim "PCIE3x1 is fine" with a completely different memory configuration.

[-]

AdamDhahabi@reddit

You're right, I overlooked the training part the OP mentioned.

[-]

cybran3@reddit

I ordered 2x 5060 Ti 16GB, they should be arriving any time now. I choose the 5060 instead of 3090 just because it’s gonna last me longer, and used GPUs are hit or miss and I don’t want that kind of trouble.

[-]

snorixx@reddit (OP)

Okay what Mainbord/Platform do you use, Consumer oder Server?

[-]

FieldProgrammable@reddit

Dual PCIE5x8 consumer boards are rarer but they do exist. You can also get a bifurcation riser for a slot wired for x16 to split it into two x8 slots (assuming the MB supports bifurcation which many do). If you are only going to be using two cards, then a server CPU/MB doesn't make much sense cost wise.

I have an Asus ProArt X870E, that also does dual PCIE5x8 and has plenty of other high end features if you are looking for those (10GB ethernet, PCIE5x4 M.2, shitloads of USB 3.2 ports).

[-]

cybran3@reddit

I ordered Gigabyte B850 AI TOP. It supports 2x PCIe 5.0 8x at maximum speeds. It’s a consumer board tho, I’ll pair this with Ryzen R9 9900x and 128 GB 5600 MT/s of RAM.

[-]

AppearanceHeavy6724@reddit

I kinda agree. I also thought about used 3090, but I jusged that I already have 3060 and to accomodate 3090 I need to replace PSU. So instead I am buying 5060ti once they get below $500 on our local market.

[-]

tmvr@reddit

Two of those cards alone cost USD800 (or in EU land about 860EUR). Check how many hours of 80GB+ GPUs you can rent for that amount (and that without upfront payment).

[-]

Direct_Turn_1484@reddit

Sure. But that’s not local.

[-]

snorixx@reddit (OP)

Thanks I will consider that too

[-]

AmIDumbOrSmart@reddit

sup, I have 2 5060 ti's and a 5070 ti. The 5060's are on a pcie 4.0 x4 lanes (probably heavily bottlenecked) and the 5070 ti is on an 5.0 x16. Can run q4km 70b models at 6k context at around 10 tokens a second or so.