what's the right motherboard/CPU to use for building a machine with 3 or 4 cards in it?
Posted by starkruzr@reddit | LocalLLaMA | View on Reddit | 42 comments
I've been looking around for boards that can support at least 3 x8 PCIe Gen 5 cards without loss of speed to any card and so far it's been very unclear what actually does this. I have the general idea that finding something with one 16-lane bifurcatable slot and one 8-lane slot at least shouldn't be that tough, but specific specs on this seem to be hard to find. it's also not super clear which CPUs I should be looking for in case I need to do offload, i.e. which have the best acceleration (anything with something like AVX-512, I guess?) usable for transformers. do we have a system building guide somewhere? TIA.
FoxiPanda@reddit
In general, I think this is poorly documented currently. I've seen a few github pages take a stab at it and I'd like to use something like pcpartpicker for it, but I think in general you have three options:
Personally, I went with the Threadripper route - you can get enough PCIe to have 4x cards natively on the board as long as you live within dual width PCIe cards (no chonkers).
The CPU matters less as long as you're loading up the entire model in VRAM, but if you are going to split between vram and system ram, the limiting factors rapidly become system memory bandwidth - the higher the better (AMD Genoa/Turin with 12x DIMMS in 1DPC can hit ~600GB/s-ish).
starkruzr@reddit (OP)
thanks for this. is something like Zen4's AVX-512 support much less important if you have to do system RAM offload (i.e. if you are going to try to use something like Qwen3.5-397B) than the system memory bandwidth? basically I'm trying to figure out "is it ok going with a PCIe 4.0 machine and saving myself a lot of money." my plan is to run with multiple 5060Ti 16GBs as I already have one. three seems like the minimum to be able to run with a comfortable amount of context without quantizing the shit out of kv and the model itself (for Qwen3.6-27B). would be nice to be able to grow into 4. these are 8 lane cards though and PCIe Gen 5 so if you run them at Gen 4 you get 16GB/s per card and idk how well that scales for tensor parallelism.
lemondrops9@reddit
Dont listen to the nea sayers, I have 5060tis and quite happy with them as well as 3090s. Im able to run them off of PCIe 3.0 x1 no probem because Im not using parallelism.
starkruzr@reddit (OP)
I don't think I can get away from parallelism though to run something like Qwen3.6-27B.
ccbadd@reddit
You don't need parallelism to run Qwen3.6 models. Parallelism is used to speed performance when you have multiple users or agents hitting the cards simultaneously. If it's just for you and not a group of people you should be fine. You can still split a model across multiple cards you just can only do one request at a time serially.
lemondrops9@reddit
then you will need an even amount of gpus and decent to good PCIe bus speeds depending on what your willing to spend.
I'm not quite sure what would be necessary because I havn't tested parallelism to see how much traffic goes over the PCIe bus. I have been tempted to hook up 4x 5060ti's to my high end mobo with PCIe 4.0 x16, and (3) PCIe 4.0 x4 slots to see what it can do. But would have to take apart my main AI PC to do that.
I doubt anyone CPU setup will give more speed vs everything loaded onto 5060ti's. At least I havn't seen a person post anything that is faster than my setup. Unless the model doesn't fit into my Vram then those 8 channel systems blow mine away.
TheDailySpank@reddit
I'm building a new inference machine and decided triple 5060s on a 570 with everything else borrowed from existing should hold me over until the hardware becomes reasonable again or forever unobtainable.
starkruzr@reddit (OP)
570 as in X570?
TheDailySpank@reddit
Yes
FoxiPanda@reddit
Honestly at that point you start getting into implementation specific territory and there's going to be a lot of variables to take in consideration. Realistically, you're going to take a significant performance hit offloading anything to system ram, but if you're only using 5060 Tis to begin with, those don't have the best memory bandwidth to begin with (448GB/s - https://www.techpowerup.com/gpu-specs/geforce-rtx-5060-ti-16-gb.c4292 ) so this is not going to be a fast adventure.
I don't think PCIe gen4 is going to be your problem really, you're going to be thrashing mem copies across cards and to the root complex and to and from system memory and it's going to be real, real slow I think. I'd have to really sit down and do a bunch of math to tell you how slow, but frankly, I wouldn't bother building this.
You're going to spend $500+ x4 on 5060 Tis, only end up with 48GB of VRAM spread across a bunch of extremely mediocre GPUs and mediocre memory bandwidth, and then you're going to spend $600+ on a board and $300 to $10000 on a CPU and then potentially $1000-7000 on RAM and end up with a machine that gets like 3-5 token/s... this is the opposite of what you really want.
You want fast VRAM tied to a GPU with strong compute and avoid system memory and the CPU as much as possible.
starkruzr@reddit (OP)
... yeah. fuck. I guess I'm just not sure where to go to from where I'm at (single 5060Ti 16GB that I use for Qwen3.5-9B) when the goal is "run this particular model (27B dense) reasonably well with good context." I could sell the 5060Ti to help fund a new purchase. it's just that after that things get fuzzy.
veinamond@reddit
I've been researching in the same ballpark as you, more or less. The short conclusion I got to: if you want to run dense 27b / moe 35b (Qwen 3.6) and you already have a 16gb card (e.g. 5060Ti) - the easiest way is to buy another card. If your motherboard does not support x8+x8 split - then bifurcation card + risers is your path, or you cope with the \~10-20% possible performance loss. However, it will work and work reasonably well. Not as well as on 5090, but much cheaper and not too far off (probably 2 to 3 times worse, at a guess, but still usable, especially if MTP will finally be working for the models). Since you have 5060Ti it is pointless to get 3090 (different architecture versions), AMD or Intel (too much differences). 5070Ti is an option, but 5060Ti will slow it down. An alternative (build an altogether another machine) is a serious investment, and with DRAM crisis going on - an expensive one. TL;DR I would buy a second 5060Ti and call it a day.
My situation is worse - I have rtx 3070 and *need* a bifurcation card and risers if I want a 2-gpu setup because the second slot on my mb is pcie4 2x.
lemondrops9@reddit
I'm quite happy with my 5060ti's. If you want me to run some benchmarks for 1-3 5060ti's vs 1-3 3090's running from PCIe 3.0 x1 let me know.
DataGOGO@reddit
Better to use Xeon / Xeon W tbh.
Emf0rtaf1x@reddit
x1. This is the complete answer.
ccbadd@reddit
Last year I went with a HUANANZHI H12D 8D and an Epic 7352. Total was about $400. I wish I had purchase the ram before the ram apocalypse but I just pulled 128G from my old Dell workstation. It has 4 pcie4 x16 slots that I have filled with 2 W6800's and 2 V620's. The board works great and so far I have not had any issues. It is a Chinese board and I ordered it from AliExpress and everything works great. No AVX-512 but that doesn't really matter to me as I run all the models in VRAM right now.
mjuevos@reddit
if you want to go the cheaper route than threadripper >> go with the gigabyte B850 ai top.. its quite the ai rig performer for the price. then a 9950x cpu. you can do 2 to 3 gpus on this.
starkruzr@reddit (OP)
why the 9950X?
mjuevos@reddit
cheaper.. all the lanes you need and the x3d really doesnt boost your ai much.
mjuevos@reddit
also check this vid.. he lays out another option that can get you 4x gpus https://www.youtube.com/watch?v=WRi0jApo9NM
Frizzy-MacDrizzle@reddit
I did a ton of research, about the only thing a server level system may help with is the conversions of quants. My current minipc will take a few minutes to convert a gguf. But I’m training.
Note for the research I have done. The only bottleneck you might have it’s your #lanes and speed.
Let me back up.i have a mini pc on 4x oculink and a 5060 to 16gb. running of llm I don’t see a need being beyond my mini pc required.
Now I’m into tokenizations, gguf writing etc, where Cpu processes will occur. I turned back the clock. Did some research and found a Xeon CPU that supposedly will support everything I need.
you sound like you’re on a corporate budget.
czktcx@reddit
If offloading to RAM, RAM bandwidth is most important, choose server motherboard.
If using comsumer platform, pcie expension card can help so don't worry too much about pcie lanes...
ImportancePitiful795@reddit
Outside workstation/server you won't find any desktop motherboard having 3x8 pcie5 to the CPU.
So there are several paths you can follow.
The hard way is to get a 8x8 desktop board which supports bifurcation.
The easy "cheap" way, get a X399/X299 bundle, usually they come with some DDR4 RAM, plug all 3+ cards directly to the board, and use it.
Unfortunately what you ask needs Intel AMX or AMD/Intel ACE CPU.
The latter, ACE, allegiantly comes out for all the Zen6 CPUs next year. Intel ACE we have no word about it outside the server CPUs. (ACE is Intel AMX on steroids with specs agreed by both Intel and AMD in 2024).
The Intel AMX solution means RDIMM DDR5 RAM, so get ready to put your hand very deep in your pocket because the ram prices have gone up a lot. Last May bought 1GB RDIMM DDR5-5600 for €3600, today close to €22000.
Here you have 2 paths, the Xeon4 with a QYFS (the 56 core CPU is cheap, around €100) and a server motherboard (around 700) like MS3-0CP, or go down the Xeon6 with 6980P ES route. (around €2100 the CPU and another 1000 for the board). The latter is much better.
AVX-512 is not as good as Intel AMX and is way worse than AMD/Intel ACE when comes to Matrix Computations.
What I would have done in your position? Either AMD X399 or Intel X299 route, since almost everyone has DDR4 ram kits around and plug the GPUs directly. Not need for bifurcation etc and dodging the bullet of the more common X99 solution.
Wait and see AMD Zen6 next year. We might be surprised not only with the desktop lineup but also the Medusa Halo (the replacement of the AMD 395/495).
Otherwise if you do not actually need all these GPUs, get a DGX Spark.
Vicar_of_Wibbly@reddit
I built one (https://blraaz.net) around EPYC Zen5 using the parts listed on the site. Happy to answer questions.
Emf0rtaf1x@reddit
Trx40, wrx80/threadripper 3000 and up...sp3/epyc rome....
Technically you don't need anything special to do it. It's moreso how much you want to pay for pcie lanes/speed.
5950x on an x570 with x8/x8/x4. Not the fastest, but it will do. Workstation chipsets can get you 64+ pcie lanes so you can have the full x16 for every card.
Do you already have the cards?
veinamond@reddit
The x8/x8/x4 is available only on the very top-of-the line x570 boards. My x570s UD for example doesn't even have x8/x8, I have never thought that I will need to have more than one gpu several years back =(
Emf0rtaf1x@reddit
🫤
I collect fancy motherboards....it's a stupid hobby for the most part.
veinamond@reddit
We all have unconventional hobbies =)
starkruzr@reddit (OP)
I have one 5060Ti 16GB and want to scale to 3, so just trying to figure out the best way to do that.
DataGOGO@reddit
Xeon W is the better path.
lemondrops9@reddit
Running 3x 5060ti 16GB and 3x 3090s on a cheap 100 buck mobo. Even have 1 gpu running off a wifi socket. Get creative and its easy to add more gpus. The real issue after that is 3 or more gpus Windows will slow things down by a lot.
starkruzr@reddit (OP)
thanks for this. what kind of parallelism are you doing between them?
lemondrops9@reddit
Using default pipeline, it works quite well over all. Its an easy way to keep adding cards. But if you need parallelism you'll spend way more on a board and cpu. It really depends on the model you're trying to run because it would be cheaper and faster to run 4x 5060ti 16gb if it fits on Vram.
llm_practitioner@reddit
Finding enough PCIe lanes for 4 cards on a consumer board is a total headache. You really have to look at HEDT platforms like Threadripper or Epyc to get that kind of bandwidth without everything slowing down.
lemondrops9@reddit
If you don't need tensor parallelism its quite easy to add 6-8 cards on a cheap mobo.
_shell-@reddit
Both of these boards have 7 pcie5.0 x16 slots:
w890e sage se with a xeon 658X - has amx instructions and will utilize 8 channel memory at full bandwith(your ram bandwidth/size will matter if running gpu + cpu inference aka k transformers)
wrx90e sage se with a threadripper pro 9955wx - avx 512 instruction and but will be memory bandwidth capped with 9955 due to only two ccds(if you will mainly run models in vram)
StardockEngineer@reddit
Consumer CPUs dont have enough pci lanes for just motherboard and cpu. Step up to a used Xeon or Epyc
DataGOGO@reddit
Xeon/Eypc/xeon w/threadripper
Drenlin@reddit
Something server or HEDT based would be best. They have a LOT more PCIe lanes.
jacek2023@reddit
check price of x399 + 1920x, this is what I use currently
FullstackSensei@reddit
That's gen 3
FullstackSensei@reddit
Why do you need 3x8 Gen 5? What cards do you plan to have? If you plan to offload, a single Gen 5 lane is usually more than enough (or Gen 3 x4). If your GPUs have physically 16 lanes each, you'll save a kidney's worth of money by going with a PCIe 4 (and DDR4) server platform.
If this is for inference only, you might very well be over estimating how much bandwidth you need.
Gen 5 with a lot of lanes is the domain of workstation and server platforms. You'll pay several thousands for a motherboard and CPU, and several thousands more for RAM. Arguably the cheapest option would be Saphire Rapids Xeon. It also has AMX, which is way way way way better than anything AVX-512 can ever offer. Speaking of, AVX-512 is overrated if you're offloading to GPU. All the heavy lifting will be done on the GPU. Whatever else is left for the CPU can be handled adeptly by AVX2, which is dual ported on all modern CPUs anyway (ie: each core has two AVX2 units that can execute two AVX2 instructions in parallel). Much more important than AVX-512 is core configuration. On Epyc, for ex, you can only get max memory bandwidth if you have all CCDs populated, otherwise infinity fabric is limited to 25GB on DDR4 platforms (PCIe Gen 4) or 50GB/s on DDR5 platforms (PCIe 5).