Best way to build a 4× RTX 3090 AI server (with future upgrade to 8 GPUs)?
Posted by Lazy_Independent_541@reddit | LocalLLaMA | View on Reddit | 24 comments
I'm planning to build a local AI workstation/server and would appreciate advice from people who have already done multi-GPU setups.
My current idea is to start with 4× RTX 3090 (24GB each) and possibly scale to 8× GPUs later if the setup proves useful.
My main workloads will be:
Coding LLMs for an agentic development setup
Running open-source coding models locally (DeepSeek, CodeLlama, etc.)
Using them with Claude Code–style workflows / coding agents
Image and video generation
Running ComfyUI workflows
Stable Diffusion / video models / multi-GPU inference if possible
Questions
- Hardware platformWhat is the best platform for this type of build?
Options I’m considering:
Threadripper / Threadripper Pro
AMD EPYC
Intel Xeon
My goal is to start with 4 GPUs but keep the option to scale to 8 GPUs later without rebuilding everything.
- Motherboard recommendationsWhat boards work well for multi-GPU setups like this?
Things I’m trying to avoid:
PCIe lane bottlenecks
GPUs throttling due to slot bandwidth
Compatibility issues with risers
- Is 8× 3090 still worth it in 2026?
Since the 3090 is an older card now, I'm wondering:
Is it still a good investment for local AI servers?
What bottlenecks would I face with an 8×3090 system?
Possible concerns:
PCIe bandwidth
power consumption
NVLink usefulness
framework support for multi-GPU inference
- Real-world experiences
If you’re running 4× or 8× 3090 setups, I’d love to know:
what CPU / motherboard you used
how you handled power and cooling
whether you ran into scaling limitations
Goal
Ultimately I want a local AI server that can:
run strong coding models for agentic software development
run heavy ComfyUI image/video workflows
remain expandable for the next 2–3 years
Any build advice or lessons learned would be hugely appreciated.
zipperlein@reddit
I am running 4x3090 in an open case in the basement. System is 7900X + ASROCK Livewire B650. I don't do anything special to keep it cool tbh. A good lessons learned for 3090s is enabling p2p, which requires updating VBIOS (afaik every manufacturer has a update tool for this, Windows only though in my experience) to enable rebar support + modified driver. I don't think 3090s will be worthless for local AI anytime soon. Memory bandwith is just too good for that. If u don't specifially need it, I would avoid spending too much money on a processor with cores u would not need anyway. RAM is also not important if u don't want to do offloading. My vllm lxc container runs on 16GB RAM.
itroot@reddit
I also have b650-based mobo and 7700. Wonder how you managed connect 4 3090s? Asking because I also would like to :-)
zipperlein@reddit
My current setup is kinda janky. 2 are connected over a repurposed pcie switch to the x16 slot and the other 2 to a risen m.2 slot / regular x4 respectively. If I had to do it again, I would probabbly just go for a bitfuricated x4 oculink on the main x16 slot. I had one of those chinese x4 bitfurication boards, would advise to avoid them.
itroot@reddit
Thanks for sharing! Oculink runs maximim on PCIe 4 x4, that's puts a limit on it. It would be great to have an extension card in the market that could be plugged into fill-speed PCIe-5 and switched to 4 3090.
zipperlein@reddit
The most crucial thing u need to respect, is that when u bitfuricate the x16 slot, each card will draw power as normal. This mean that the (dumb) splitter can draw up to 4x the power of 1 card. So u need some kind of powered splitter / riser.
ParaboloidalCrest@reddit
So you end up splitting the main (metal) PCIE slot?
zipperlein@reddit
No, I have a pcie-switch.
kidflashonnikes@reddit
okay, so I can help you a lot here. First off, you never want to invest in more than 2 RTX 3090s. The price/compute is the best, but we are leaving behind the 3090s now for AI. THat is why the market is flooding with them. I run a lab at one of the largest privately funded AI labs in the world - you will know the name. We are already phasing out old cards for blackwell architecture in the Nvidia chips. We are primary interesting in the int4 quant new structure. I can assure that this is what everyone is doing already. Older cards are not it anymore for AI. They are great for learning and hobbiest ect. but for real work with AI - you are better off either renting GPUs or buying a single RTX PRO 6000 and calling it a day.
For context, I have 4 RTX PRO 6000s, 1TB of RAM, 16GB of SSDs, a threaripper pro 96 Core CPU, all running on an asus wrx90 sage se mobo, and I still cant run the large models that I want to run fully. You will always be limited by compute. You should just buy 2 RTX 3090s, and invest more right now in the CPU, RAM (128 GB), the motherboard, a 24-36 core CPU, and just wait to save more money to buy the RTX PRO 6000. Your main goal now is to focus on the set up and architecture with the best entry for compute/price - which in this case is 2 RTX 3090s. It is extremely difficult to run an open source model sub 70B for agentic work that is hard core. IF anyone wants to call me out - feel free. My background is using LLMs to compress brainway data in real time on live brain tissue. I can't really say what I am doing or who I work for - but recently a lot of work was finally allowed by the DOJ/DARPA to be "slow" released.
knob-0u812@reddit
Very cool of you to share. I have been using runpod to do some light LoRA training and all the buzz/hype about the quad 3090 setups has been giving me an unhealthy itch to take on hardware. It's tempting, but also feels foolish. I would call myself a hobbyist, but have built llm ETL workflows that support my professional work. The rtx 6000 call sounds wise. Keeping my hands out of the hardware sounds even more so.
SteppenAxolotl@reddit
What is the use case for 4 way local setups?
The leading edge is so close to being useful.
I would opt for the $108/year for 3× usage of the Claude Pro plan for GLM5 and wait and see what what hardware is needed for the leading edge at the end of this year. It would suck to pull the trigger < 1year too early and fall short of being able to run a distilled version of the first minimally dependable/competent model.
Lazy_Independent_541@reddit (OP)
That's a fair point about waiting for the next generation of models/hardware.
But I'm curious if you're building agentic workflows (coding agents, autonomous loops, tool use, etc.), wouldn't a local 4×3090 setup still be useful for experimentation and high volume runs without API limits or costs?
SteppenAxolotl@reddit
Yes. But that cost diff needs a good use case. Not just casual use.
My current use is casual so using RTX 4090 for ~100 t/s Qwen3.5-35B-A3B-UD-Q3_K_XL for the price of electricity and sunk costs works for me.
Still looking for RLM solution to escape context limits.
If I need more intelligence, all the cheap API tokens on tap will hold me for another year.
ImportancePitiful795@reddit
Considering the cost of ram right now where 128GB goes north of $1500, you should be looking for miniPC with as much memory possible.
Just 128GB RAM and 4 3090s alone will get you to $5000 range. You better off with GDX, 395 + 1 9700 or 5090, or M5Max/Ultra studio.
SweetHomeAbalama0@reddit
I started with 2x 3090's, moved up to 4x, then 8x, now also includes 1-2x 5090's, so 9-10 cards total at any given time.
Any server grade CPU should be fine compared to consumer processor options, just whatever has the most cores and the most recent architecture that you can afford. DDR5 hardware just costs more so your budget may help you determine the choice. I went with TR Pro 3995WX/DDR4 and it does the job fine, and I focus on workloads like what you mentioned, with the caveat that I no longer use the 3090's for image/video stuff.
If you are going high GPU density, focus on options that have the pcie slots to support it, so workstation boards or server boards that have up to 7 slots, like the WRX80e sage ii.
Good investment for one/two persons? I mean, for a small operation on a personal budget, fuck yeah, but that's just my subjective opinion from my own experience. If this were for a professional production environment deployment however, I would prob suggest a different route entirely. I usually delegate the 8x3090's for LLM work like running deepseek while the 5090's do image/video gen work, and the 8x stack is excellent for this task. That said, my philosophy and strategy could be completely different from yours. I don't use the 3090's for image/video tasks at all really, they are "okay" for this but the 5090's focus on that in my environment. There's not really any pcie bandwidth constraints I've run into (just make sure slots and pcie bifurcation settings are correctly configured in BIOS where applicable, like if using risers rated for a certain gen or bifurcation cards), inferencing can be somewhat forgiving about this. There IS however a major inter-GPU bandwidth present by virtue of running a model across so many cards (assuming that's what you plan to do as well), and this creates a power bottleneck on the individual GPUs (meaning each card may only pull around 150W when inferencing together, even though their TDP is 350W+), which can actually be a positive thing because this can greatly reduce running inference costs and power infrastructure needs. I've not needed NVLink for inferencing performance, I've only heard this is mostly useful for training. If you plan to train, that would change a lot of what I just said. I don't train at all, and don't plan to. Training and/or running multiple LLM's simultaneously where each card could draw closer to their rated TDP at the same time, will require more power and hardware accommodations.
TR 3995WX pro + WRX80 sage se ii, and I used a 360 enermax AIO to manage CPU cooling, three of the GPUs are hybrid water cooled, the rest are cooled via multiple 140mm intake fans drawing fresh air into the enclosure. Power is managed by a 1600W and 1300W PSU (2900W total), but the absolute maximum that I've observed the unit pull during workloads is around 2000-2200W. Theoretically possible to run on a single 20A circuit, but I would still recommend load balancing especially if it's expected to run heavy workloads for extended periods of time. Ambient cooling is managed by wheels. Wish I had a better answer for that but there's only so much you can do putting 8+ high power graphics cards in a box, the room will inevitably heat up. My workaround solution to this was putting the server on wheels that can be wheeled from location to location, so at least we have the benefit of choosing what room will get the dumped heat. Space was the biggest scaling limitation I ran into, and it was resolved by the case/chassis. Dual chamber cases may be something to look into if you'll have up to 4, but for more than that the options start to require some creativity, or just go the mining rack route. I ended up finishing the project with a Thermaltake Core W200, which I highly recommend for this purpose, if you are able to find one.
Yeah this is still a highly viable approach to getting 192Gb of "pretty fast" VRAM, 3090's will just leave some room to be desired in the image/video gen department. It'll still work, but I've discovered they aren't the most efficient for this, power and heat become much more of a concern when relying on 3090's for image/vid gen. So maybe this is the only asterisk I could say about it, they are EXCELLENT cards for LLM, but only FINE for image/video, and only if not working in the same room where all the heat would be dumped. I recommend 50 series for this task specifically, they are just so much more efficient in comparison for image/video gen.
AutomaticDriver5882@reddit
If you don’t do heavy loads you can get thunderbolt docks on eBay or Amazon I have a nuc running 5 GPUs
Sweet_Drama_5742@reddit
TLDR: 8x 3090s is fine for light hobby usage and experimentation/exploration, but not sufficient for real development workloads (due to speed + model sizes).
I had/have the same goals as you: I ran 10x 3090s at one point, but ran into lots of issues (blew up an add2psu adapter since I designed 3 PSUs into the one system, constant headache with PCIe connections/reliability - GPUs would drop off the bus mid-run, etc). Ultimately, as my primary use is serious code flows on medium/large codebases, there wasn't really anything that (1) would load without lobotimizing quantization into the realm of "not good enough", (2) even with MoE like gpt-oss-120b or Q5 minimax m2.5 were not quite fast enough for "serious" development work over long contexts in code harnesses like opencode.
What has worked for me: upgraded 4 of the 3090s to RTX 6000 pro (max q), consolidate to a single power supply. Currently using glm 4.7 q8 on the RTX 6000s (vllm - 50 tokens/sec generation), and running smaller multimodal models on the 4 3090s (including ComfyUI, some STT/TTS, etc). Obviously, this is extremely pricey and not cost effective today.
DataGOGO@reddit
Xeon, hands down, you get AMX and better memory controller. any sapphire or emerald rapids CPU is fine. All support 8-12 memory channels, emerald or granite rapids is better than sapphire rapids.
If you are running LLM’s and not training, then this works fine, if you are training, the lack of nvlink across more than 2 GPU’s makes it impractical to run more than two GPU’s due to the slow pci-e bandwidth; you will spend more time doing all reduce than forward passes.
amejin@reddit
Do you need to get an electrician out to put in a socket for the larger power draw? That's a lot of pressure to put on a single outlet.
ryanp102694@reddit
I did it myself. For me however it was extremely straightforward because I had an existing dryer outlet in my basement that wasn't being used for a dryer.
My father in law is very handy and was able to walk me through it which also gave me peace of mind
amejin@reddit
Thanks. That just tells me that I will eventually have to do some electrical work if I want to host a home lab. Appreciate the response.
HugoCortell@reddit
Another thing to consider is using NVLink or whatever if you can get your hands on the right 3090s.
applegrcoug@reddit
I have a setup that actually runs six 3090s. Motherboard is a MSI tomahawk 670E combined with a 9950x and 64GB ram. AM5 gives 28 pcie lanes, right? Tomahawks are kind of cool in that the motherboard actually lets you use the lanes. In the primary slot, I have a 4x4x4x4 occulink bifurcation card, so those four gpus aall get four cpu lanes. Then, in the primary nvme, another occulink adapter, so the fifth gpu also gets four cpu lanes. Then on the 670E, there is a x4 pcie expansion slot that is also direct to the cpu, for a sixth occulink connection. What is nice, is the 870E can also be tweaked this way...it is the only AM5 board I've found where you can turn off the pcie x4 lanes to the USB4 ports and dump those lanes to an nvme slot.
Another thing I found was that using occulink was the only way to get the cards to detect and run at pcie gen4. Riser cables were too flaky even at gen3.
Now then, I too run all of it on an open frame and did set a power limit of 120w to the cards. Total draw when working is 1000w. I need to test tps, but I think it only cut them maybe 1tps from when they were at 180w or something. I also ran a 220v circuit to a PDU for my servers.
norofbfg@reddit
I would lean toward EPYC since the PCIe lanes give more breathing room once you move past four GPUs.
ryanp102694@reddit
I'm finishing up a 4x3090 build now. I have:
Epyc 7742 (IMO best choice due to PCI lanes and memory channels)
AsRock ROMED8-2T (so many PCIE slots!!)
256gb DDR4
I'm currently unable to install the 4th 3090 because I need to be able to safely use 2 PSUs, and I need to get my add2psu board.
For me the biggest thing I didn't see coming was dealing with the amount of power that I'd need. I just finished installing an L6-30R in place of an old dryer outlet, which I connected a PDU (https://www.walmart.com/ip/Valiant-Power-240V-30A-Vertical-Rackmount-PDU-4-C13-2-C19-Outlets-Digital-Display-Resettable-Breaker-L6-30P-Input-Heavy-Duty-Metal-Housing/16860552281) to. This was less scary than I thought it'd be and was a pretty easy DIY.
Once I have the 4 3090s running on the 2 PSUs and get the software stack running how I want, I'll probably replace the 2 PSUs with a single 3000w PSU.
I don't do any nvlink or anything.
I just migrated to this single-slot CPU mobo from a dual socket, so I've got an extra Epyc 7742 that I'm looking to sell. I've got the dual socket motherboard but I don't really recommend it (not enough pcie lanes and requires 2 cpus to use all the memory channels).