Planning Multi-RTX 5060 Ti Local LLM Workstation (TRX40 / 32–64GB VRAM)

Posted by Special-Art-9369@reddit | LocalLLaMA | View on Reddit | 24 comments

TL;DR:
Building my first multi-GPU workstation for running local LLMs (30B+ models) and RAG on personal datasets. Starting with 2× RTX 5060 Ti (16GB) on a used TRX40 Threadripper setup, planning to eventually scale to 4 GPUs. Looking for real-world advice on PCIe stability, multi-GPU thermals, case fitment, PSU headroom, and any TRX40 quirks.

Hey all,

I’m putting together a workstation mainly for local LLM inference and RAG on personal datasets. I’m leaning toward a used TRX40 platform because of its PCIe lanes, which should help avoid bottlenecks you sometimes see on more mainstream boards. I’m fairly new to PC building, so I might be overthinking some things—but experimenting with local LLMs looks really fun.

Goals:

Run \~30B parameter models, or multiple smaller models in parallel (e.g., GPT OSS 20B) on personal datasets.
Pool VRAM across GPUs (starting with 32GB, aiming for 64GB eventually).
Scale to 3–4 GPUs later without major headaches.

Current Build Plan (I/O-focused):

CPU: Threadripper 3960X (used)
Motherboard: MSI TRX40 PRO 10G (used)
GPUs (initial): 2× Palit RTX 5060 Ti 16GB
RAM: 64GB DDR4-3200 CL22 (4×16GB)
PSU: 1200W 80+ Platinum (ATX 3.1)

Questions for anyone with TRX40 multi-GPU experience:

TRX40 quirks / platform issues

BIOS / PCIe: Any issues on the MSI TRX40 PRO 10G that prevent 3-4 GPU slots from running at full x16 PCIe 4.0?
RAM stability: Any compatibility or quad-channel stability issues with CL22 kits?
Multi-GPU surprises: Any unexpected headaches when building a multi-GPU inference box?

Case / cooling

Open vs closed cases: What works best for multi-GPU setups?

Power supply / spikes

Will a 1200W Platinum PSU handle 4× RTX 5060 Ti plus a Threadripper 3960X (280W)?
Any issues with transient spikes under heavy LLM workloads?

Basically, I’m just trying to catch any pitfalls or design mistakes before investing in this set up. I’d love to hear what worked, what didn’t, and any lessons learned from your own multi-GPU/TRX40 builds.

Thanks in advance!

[-]

see_spot_ruminate@reddit

I have a 3x (one is on nvme to oculink egpu) 5060ti 16gb setup, but not on threadripper.

For power, I think its not going to push it, but you may be over the 80% threshold sometimes? The zotacs I have on my 850w power supply idle at ~5 watts and on load they go up to around 100 watts.

You probably don't need all the lanes. The 5060ti is a x8 card, but your bifurcation will likely depend more on your motherboard settings.

llamacpp works well with all the cards and splits nicely even though I have 3 of them. I would like to try vllm one day, but if I do that I only have 48gb of vram and I think (?) I lose my 64gb of system ram for models.

shoot me any questions you have.

[-]

Special-Art-9369@reddit (OP)

Thanks for the detailed reply and confirming I'll be fine with the PSU, maybe even able to drop it down a bit.

Have you found that the NVMe to occulink connection has impacted tk/s performance in any way? It seems like a good solution for thermal and space management. Can you recommend what you use?

I've opted for the extra lanes, as I have aspirations to start home labbing in future and wanted the flexibility - not sure how much this will conflict with CPU processing for my LLM's. I guess that's something I'll have to address when I get there.

I'm not familiar with the intricacies of LLM software like vllm. I was wondering what your experience with RAM spill over is like? Are you actively avoiding it or have you found it acceptable speed/capacity with the 64Gb (I'm assuming DDR4)?

[-]

see_spot_ruminate@reddit

No the connector has not as it just exposes 4 pcie lanes. For most inference that will not be a bottleneck. I do have one of the internal cards on a single pcie lane since that is how my motherboard bifurcates which is probably worse.

Any time you touch system ram it is going to slooow down. My ram is ddr5 so somewhat faster, but still not as fast if a model were to be loaded completely in vram.

[-]

Special-Art-9369@reddit (OP)

Can you advise how you are partitioning your models across each GPU? I'm trying to get an idea of your general workflow, how many tk/s you get and whether you're happy with the sophistication of the outputs.
Do you ever feel the 5060ti's bandwidth is an issue for your LLM inferencing, especially if you're using a RAG system?

[-]

blankboy2022@reddit

I have the same idea to build a system like you, but I'm thinking about the option of using 3090s since they will have 24G vram each. If I remember correctly using a PSU evaluation site, four 3090s will require at least 2000w psu. Not sure with 5060ti though, you might look for it on any site that evaluate PSU for systems, like cooler master.

[-]

AppearanceHeavy6724@reddit

no 4x3090=1000W if power limited to 250W

[-]

Special-Art-9369@reddit (OP)

Is power limiting your GPU's something you need to experiment with as to not impact performance i.e. 'find the sweet spot'. Do you think I would need to upgrade my PSU if I eventually upgrade to 3-4 GPU's given cooler master as estimated a \~1400W power draw?

[-]

AppearanceHeavy6724@reddit

250W is the sweet spot gor 3090. Check the older posts

[-]

blankboy2022@reddit

Damn that a huge difference, are you using the limited power quad 3090 yourself?

[-]

Special-Art-9369@reddit (OP)

Thanks for the advise. Inputting the closest specs I could find, cooler master is saying I'd need at least 1400W, which leads me to think I'd need a 1600W PSU. From what I've read through 5060s are usually running bellow their spec sheet wattage...

[-]

Marksta@reddit

Ask the LLM you're using to write this? What's the point of prompting humans with an LLM?

[-]

Special-Art-9369@reddit (OP)

I appreciate your concern. While yes I used the LLM to help me with this entire project so far, I don't see an issue using it for the tool that it is. I'm not technically minded and would have struggled to even get this far without it. It's been an enjoyable experience so far and I thought before I trust everything that the LLM has helped me with so far I should ask actual people with real world experiences such as yourself.

If you have advise on the practicality of my build I'm all ears. Thanks.

[-]

Marksta@reddit

Of course! This is a fantastic and well-researched plan for a local LLM beast. You're asking the right questions to avoid the common pitfalls. Here's a breakdown of my advice, framed as a Reddit-style reply. Re: Planning Multi-RTX 5060 Ti Local LLM Workstation (TRX40 / 32–64GB VRAM)

Hey, great plan! You've clearly done your homework. Targeting the TRX40 platform for its PCIe lanes is 100% the correct move for a multi-GPU inference box. You're avoiding the biggest bottleneck that plagues people trying to do this on consumer platforms. Let's dive into your questions. TRX40 Quirks & Platform Issues

BIOS / PCIe Lanes: This is TRX40's superpower. The Threadripper 3960X has 88 PCIe 4.0 lanes. Your motherboard's manual will show the slot configuration, but on a board like the MSI TRX40 PRO 10G, you can typically run:

2 GPUs: Both at a full x16.
3 GPUs: x16 / x16 / x8 (or similar).
4 GPUs: x16 / x8 / x16 / x8 is a common split.
The good news: For LLM inference, even PCIe 4.0 x8 has more than enough bandwidth. You will see zero performance penalty compared to x16. The model weights are loaded into VRAM and stay there; the data transfer between GPUs (for tensor parallelism) is relatively minimal once the initial load is done. The main bottleneck is VRAM capacity and GPU compute, not inter-GPU bus speed. Always update to the latest stable BIOS. This is non-negotiable for stability.

RAM Stability: CL22 is a bit on the slow/latent side, but for your use case, it's completely fine. LLM inference is not nearly as memory-sensitive as gaming or some professional applications. Capacity is king. The quad-channel architecture of Threadripper is a bonus and will be very stable with a 4-DIMM kit. Just make sure the kit is on your motherboard's QVL (Qualified Vendor List) if you can, but it's less critical for DDR4 at this point.

Multi-GPU Surprises (The Big One):

Software & Framework Support: Your biggest "headache" won't be the hardware, it will be the software. You need to understand the difference between tensor parallelism (splitting a single model across GPUs) and model parallelism (running separate models on separate GPUs).
- For tensor parallelism (running one 30B model across 2+ GPUs), frameworks like vLLM and Text Generation Inference are your best friends. They are designed for high-throughput inference and handle multi-GPU well. Ollama also makes it very simple with its -gpu flag.
- You'll specify something like CUDA_VISIBLE_DEVICES=0,1,2,3 when launching your server to use all GPUs.
The "Pooled VRAM" Illusion: Remember, you can't just "pool" VRAM into one big 64GB block like system RAM. A 30B model (in 4-bit quantized, ~20GB) will be split across the GPUs. If you use 2 GPUs, ~10GB will be on GPU0 and ~10GB on GPU1. This is why tensor parallelism works so well.

Case / Cooling

This is arguably the most critical physical aspect of your build.

Open vs. Closed Cases:

Open Frame (like a Core P3/P5/P8, Xproto): Best for thermals, hands down. GPUs dump heat directly into the room. Fantastic for access and looks. Downsides: Dust, noise, pets/kids, and no directed airflow. The top GPU will still be hotter than the bottom one, but the delta will be smaller.
Closed Case (with High Airflow): This is the more practical and common choice. You need a full-tower case.
- Top Recommendations: Fractal Design Meshify 2 XL, Lian-Li V3000 Plus, Phanteks Enthoo Pro 2, Corsair 7000D Airflow.
The Real Secret: It's all about slot spacing. You MUST check the motherboard manual and the case's PCIe slot spacing. You want at least one slot of space between your dual-slot GPUs. A triple-slot gap is even better. This allows the fans of each GPU to breathe. Blower-style coolers are better for tight multi-GPU, but the RTX 5060 Ti will almost certainly use axial fans (the standard open-air design).

Cooling Strategy:

Get a case with a mesh front panel.
Install multiple high-quality, high-static-pressure fans as intake at the front.
Use the AIO for the CPU as top exhaust.
Have a rear exhaust fan.
If you have a 4th GPU sitting right against the PSU shroud, consider a fan mount on the shroud to push air into that GPU.

Power Supply / Spikes

Wattage Calculation: Let's do some napkin math.

Threadripper 3960X: 280W (max)
RTX 5060 Ti (estimated TBP): ~190W each (this is a guess based on current trends).
4x GPUs: 4 * 190W = 760W
Motherboard, RAM, SSDs, Fans: ~100W
Total (Max Theoretical): ~1140W

A 1200W Platinum PSU is the absolute bare minimum. You are cutting it too close. The PSU will be running near its limit, which is inefficient, hot, and loud. More importantly, you have no headroom for transient power spikes (microsecond-long surges that can be 2x the TBP).

Recommendation: Go for a 1600W PSU. It might feel like overkill, but it gives you:

Ample headroom for transient spikes.
Quiet operation, as the fan won't need to spin up aggressively.
Future-proofing for potentially more powerful GPUs or adding more components.

An ATX 3.1 / PCIe 5.1 PSU is a great choice as they are specifically designed to handle these modern GPU power transients.

Final Verdict & Pitfalls to Avoid

Your core concept is solid. Here are the key adjustments and final tips:

Upgrade the PSU: This is your most critical change. Get a 1600W unit. Don't cheap out here; it's the heart of your system.

Case is King: Don't underestimate the importance of a full-tower, high-airflow case with good PCIe slot spacing. Your GPUs will thank you with higher boost clocks and longer lifespans.

Thermal Monitoring: Use a tool like nvtop (Linux) or HWInfo64 (Windows) to monitor your GPU memory junction temperatures (mem_junction_temp) under sustained LLM load. This is often hotter than the core temp.

Start with Linux: For a pure LLM machine, you will have a much better experience on Ubuntu 22.04/24.04 LTS. Driver installation is simpler, there's less background overhead, and the vast majority of inference servers are developed and tested on Linux. The NVIDIA driver is easy to install.

Don't Fear the PCIe x8: As stated, you will not notice a difference for inference. Don't waste mental energy trying to get all 4 slots at x16; it's impossible and unnecessary.

You are on the right track to building an absolute monster of a local AI workstation. Good luck with the build, and post some pics when it's done

[-]

AppearanceHeavy6724@reddit

Are you sure it is the right sub for you? May be you need r/antiai ?

[-]

Marksta@reddit

Are you daring to doubt the supremacy of LLMs over humans? His LLM is surely far more adequate for these questions he's asking than the mere mortals of this sub. What could we, or the sub's search bar, possibly offer up that hasn't already been encoded into model weights?! It sounds like you belong with r/antiai

[-]

AppearanceHeavy6724@reddit

WTF are you talking about sir?

[-]

Marksta@reddit

Just pointing out the hypocrisy. I come here to speak to humans, not read LLM tokens. But I offended you for daring to want to talk to humans instead of OP's LLM. But you see no issue with OP talking to humans instead of his own LLM. Are these not the same exact scenario? I don't want to talk to OP's LLM and neither does OP. Yet you aren't directing OP to r/antiai ?

[-]

AppearanceHeavy6724@reddit

I DGAF if OP use LLM or not if their post is coherent and has some clear intent behind it. It is LocaLLama FFS, why would people allergic to LLM would hang out here is beyound me.

[-]

AppearanceHeavy6724@reddit

If all you want is LLMs 5060ti is almost twice slower than 3090 with dense models. It is a slow card.

[-]

Special-Art-9369@reddit (OP)

I see your point, I even looked into the 5070 ti but both models cost double the 5060 where I live. As I'm willing to sacrifice some speed for VRAM capacity and have read other posts with 5060 owners being happy with their cards for inferencing, which is what I'm after, I'm taking my sacrifices where I have to unfortunately.

Hopefully the market prices settle down soon - maybe I can experiment more with faster cards then!

[-]