How are some of you running 6x gpu's?

[-]

x0xxin@reddit

I have a 2U Rackmount Gigabyte G292 running 7 GPUs. Best for blower style cards though. Here's an example: https://ebay.us/m/RettGL

[-]

The 4x wall is real. suddenly, you are not just buying GPUs, you need a whole new mobo, bigger case, beefier PSU, and dont even get me started on cooling.. the infrastructure costs hit harder than the actual cards.

hows your power/cooling situation now with 4x? Thats usually where I start sweating lol.

i ended up saying screw it and went hybrid.. keep my current setup for regular stuff and just rent GPUs when I actually need more.. saves me from dropping like 8k on hardware that will sit around doing nothing half the time.. I have used deepinfra and a few others for the bigger jobs - works out way cheaper.

[-]

eat_those_lemons@reddit (OP)

Yea that is maybe a better way to describe the problem, even going from 3x to 4x was a huge jump in price because of the limited nature of cases with 8x pcie slots

My power situation is tight but workable. I have my nas that idles at 200w and then the ai server draws between 1k-1.2k watts all on an eaton ups for cleaner power. To keep temps down I have all the cards limited to 200w (which is pretty close to where they end up anyway once they thermal throttle)

My cooling situation is going to be interesting. During the summer it has worked out with the limited testing I've done so far since I've had the ac running and can open the vents in the basement. However once it starts snowing I am worried my temps are going to rise. If that happens I'm unsure what I'm going to do. I rent so I don't know if they would be happy with me installing a mini-split in the basement XD

I suspect that I will also go hybrid

[-]

takuonline@reddit

What kind of training do you if l may ask?

[-]

eat_those_lemons@reddit (OP)

Im messing around with the architectures from various papers, training them on new stuff, making modifications stuff like that. Also am working on fine tuning an llm to analyze research papers for me

[-]

takuonline@reddit

Hey can l reach out to you, l am very interested in just understanding what you are doing and what papers you are implementing because l have also been doing some finetuning. I am an ml engineer by the way.

[-]

DisgustedApe@reddit

I just wonder how you guys power these things? At some point don’t you need a dedicated circuit?

[-]

MelodicRecognition7@reddit

google for "PCIe bifurcation" and "PCIe splitter cable". For inference you do not need full x16 PCIe link so you could split one x16 port in half and insert 2 GPUs with x8 speed, or 4 GPUs with x4 speed.

[-]

eat_those_lemons@reddit (OP)

Ah there are pcie bifurcation cards that do full x16 does it mess up training if you use one of those?

Also are there bifurcation cards you recommend? I'm seeing that they are like $300 a piece so for x6 its almost 2k, am I looking at the wrong bifurcation cards?

Also do you know what happens if you train at x8?

also also is there some special case people are using? or just putting the gpus in a pile on the floor?

[-]

MelodicRecognition7@reddit

sorry I can't recommend any splitters because I did not use any, I use a generic tower case and motherboard with full x16 ports. Yes, it will slow down training because data will be moving with x8 speed instead of x16 (if you do not connect cards with NVLink).

[-]

Leopold_Boom@reddit

What is the current recommended PCI5 16x to four 4x splitter?

[-]

eat_those_lemons@reddit (OP)

It sounds like you've seen people use those cables before?

[-]

MelodicRecognition7@reddit

I've seen them used in a gaming rigs and small computers with an external GPU and I haven't seen any bad reviews, anyway DYOR.

[-]

deleted_by_reddit@reddit

[removed]

[-]

campr23@reddit

I myself am working on an ML350G9 with 4x 5060 Ti 16Gbyte. No need for splitters or anything, also 1600W of power as standard in that chassis. Up to 3200W (and more) available if you go for the 4x power supply expansion.

[-]

Karyo_Ten@reddit

Bifurcation splits and share the PCIe lanes, it's OK if you do independent compute and no communication between GPUs and it's bad otherwise.

[-]

bennmann@reddit

there are also backplane options, although i have no hands on experience. i assume if a motherboard lists bifurcation it might help ? maybe ? pcie backplane is a black box:

https://www.ebay.com/itm/135189657675?_trkparms=amclksrc%3DITM%26aid%3D1110006%26algo%3DHOMESPLICE.SIM%26ao%3D1%26asc%3D275831%2C275537%2C276104%26meid%3D7a0f07a9c059413dac120f2711fd27a7%26pid%3D101196%26rk%3D1%26rkt%3D5%26sd%3D135039633701%26itm%3D135189657675%26pmt%3D1%26noa%3D0%26pg%3D2332490%26algv%3DSimplAMLv5PairwiseWebWithBBEV2bAndUBSourceDemotionWithUltimatelyBoughtOfCoviewV1%26brand%3DGIGABYTE&_trksid=p2332490.c101196.m2219&itmprp=cksum%3A1351896576757a0f07a9c059413dac120f2711fd27a7%7Cenc%3AAQAJAAABQJh9BGsXvPG03pKg78mUhLLErCJ%252BXOEYDkzTGJ85B4rSRXG6DGHfiL9UFpXuaOk%252FmuXW6x51j8YJMfy7doeYuyk9WZaRPkl%252FLlHN84X3%252FeYgVG3iucUQjkVp9Lf5uEN8TjNNkavQeqKBTikJ7ybOxo00kkrBUoFfDIZJ5nrvFRJVnVmu3Odi4Kf0%252F1S%252BY0Y%252FOwcjk7CEhjQjvOo4Mo%252BsEYhQB3cQkFN6rGnS5LB0y86Qf0TZDA8hm0yH2vpJ6dS4WyYIeIJIWWE%252FcWcWnaChuEZj2Kh%252FS4ig3t%252FzeLFaMj0Zo6oLQws76EumQEOvqEAplWem5zMn3cnTbZyKrbUbnMms8NNNekcVI9kiCwMlGtpw3i0QgylABNkxEGFQJS9%252FFntZC%252FvLIg5tMBE28BH3zI9ntlxC6b%252BtP5CdaZble15k%7Campid%3APL_CLK%7Cclp%3A2332490&epid=25062904712&itmmeta=01JC4NTVJ2JWHDAJ8ZZPP7XSCJ&autorefresh=true

+ a cardboard box

[-]

Karyo_Ten@reddit

Just buy 1 or 2 RTX Pro 6000.

[-]

eat_those_lemons@reddit (OP)

4k solutions?

Also mxfp4 sounds super interesting and would love to do some training with them. I'm wondering how much faster 1 rtx pro 6000 would be than the 4 3090's at mxfp4 training? How would I find that out?

[-]

Karyo_Ten@reddit

4k solutions?

I meant extra $4K, why not go all the way and do extra $8K.

Also mxfp4 sounds super interesting and would love to do some training with them.

Note that gpt-oss mxfp4 is different from nvfp4 but both can be accelerated in hardware, iirc mxfp4 has a rescaling factor every 32 fp4 elements while for nvfp4 it's every 16 elements.

I'm wondering how much faster 1 rtx pro 6000 would be than the 4 3090's at mxfp4 training? How would I find that out?

I suggest you prepare some benchmarks with your current hardware or for free with Google Colab or Kaggle GPU notebook.

Once they are ready, rent both for a hour which should cost you less than $4, and compare both. A very small investment to have hard numbers.

[-]

darkmaniac7@reddit

I have 6x 3090s watercooled, 4x on regular pcie slots on a Romed8-2t board and the last 2 are split out with slim sas 8i connectors to the other side of the case on a W200 case

[-]

Aware_Photograph_585@reddit

If Inference only:
pcie bifurcation alone is probably fine
x4 is good enough, just might be a little slower to load the model

if training:
pcie bifurcation with retimer/redriver is mandatory
I spent months trying to figure out why I was getting occasional random errors and training speed was subject to random fluctuations. Add re-drivers, everything worked perfectly
If you're using pcie 3.0 with short cables, retimer/redriver may not be 100% necessary. But personally I won't ever again run pcie extension cables without retimer/redriver
My prior tests with FSDP showed that x4 is only a couple percent slower than x8 when using a large gradient_accumulation and cpu_offset to enable large batch sizes

Also, what 6 gpus?

[-]

eat_those_lemons@reddit (OP)

I have 4x 3090's and would just add more

Oh that is interesting that FSDP is only a couple of percent slower, is 8x only a couple of percent slower than 16x?

Also do you have a retimer/redriver you recommend? (and it sounds like use?)

[-]

Smeetilus@reddit

Read some of my recent posts

[-]

Aware_Photograph_585@reddit

I assume you mean the ones enabling p2p to with 3090s?
I'm using 48gb 4090s now, and the vbios shows the bar as 32gb. So not sure if tinygrad drivers will work.
Also, my motherboard bios doesn't have resizable bar (H12SSL-i), but it looks like you found an updated bios. I'll need to go find it.

[-]

Aware_Photograph_585@reddit

Yeah, I did the tests like a year ago with a full fine-tune of sdxl unet (grad_shard_OP, mixed precision fp16) with 2x 3090s. Noticed very little difference in training speeds between x4, x8, x16. It's not sharing that much information between gpus when doing a gradient update. Seems latency is a bigger factor. FULL_SHARD completely kills training speed regardless of pcie setup, you'd need nvlink to finx that.

Can't recommend one, since I don't know what brand they are, just bought them locally in China. You will need to check the config of the redriver though. Mine have 4x4 mode or 2x8/1x16 mode, and require a firmware flash to change. I did have a retimer, worked fine before I burned the chip. Redrivers/retimers have small heatsinks since they're usually in servers. You'll want to add a small fan

[-]

eat_those_lemons@reddit (OP)

You said that latency is a bigger factor, I'm worried the bifurcation cards add a ton of latency since your traces go from 6inches to 20+ inches, is that correct?

[-]

Aware_Photograph_585@reddit

Not an expert on this, but the research I did said that the latency introduced by PLX cards is negligible. I assume it's the same for retimer/redrivers. They're also designed to be used in servers, which require the pcie signal to remain with specifications. Signal quality is a bigger issue, thus requiring the use of retimer/redrivers.

I don't think cable length is an issue at the lengths we use. I use up to 40cm SFF-8654 cables with my retimers, never had an issue.

The latency I talked about it consumer gpu-gpu communication over pcie. Consumer gpus don't support p2p over pcie, thus need to go through cpu/ram, introducing considerable latency for gpu-gpu communications. I'm guessing it's latency bottleneck, because if it was a raw speed issue, I would have seen differences in x4, x8, x16 training gradient update sync times.

[-]

nn0951123@reddit

There is a thing called pcie switch.

https://www.broadcom.com/products/pcie-switches-retimers/expressfabric/gen4/pex88096

But you will need p2p support or it will be slow.

[-]

eat_those_lemons@reddit (OP)

Ah these look super interesting, I wonder how performance drops if you train using 2 of these for example. The p2p between the boards would be slow, so curious if that negates the benefits

https://c-payne.com/products/pcie-gen4-switch-5x-x16-microchip-switchtec-pm40100

[-]

nn0951123@reddit

You will get bottle necked by the uplink/downlink.
The intercard bandwith is physically limited, so you will only get what you have, in your case, running 2 of those switches will result a 4.0 x16 max speed between the boards.

[-]

FullstackSensei@reddit

Two cheap options if you can find them: X10DRX or X11DPG-QT.

Your GPUs will need to be two slots thick or even better one slot thick to achieve high density on the X10DRX. If you're running 3090s, then your only realistic option is to watercool them if you want to avoid using risers. If you don't mind using risers, the X10DRX can let you plug 10 GPUs using riser cables, with each getting it's own X8 Gen 3 link.

This is all assuming your primary workload is inference. If you're training, I'd say just rent something from vast, runpod, or lambda cloud. You'll iterate much faster and unless you're training 24/7 for months at a time, it's much cheaper than doing it locally.

[-]

eat_those_lemons@reddit (OP)

Looking at the numbers a model that takes 6 months to train on a 4x 3090 cluster (mxfp4) would take 6.5 days on a single rtx pro 6000? Is that the sort of speedup that would actually happen?

[-]

FullstackSensei@reddit

The 3090 doesn't support mxfp4, but you can get SXM H100s for not much more, and those are very very fast, even faster than the rtx pro 6000 (even Blackwell) because of how much more memory bandwidth they have.

And who would want to do a 3-6 month training run???!!! To me that doesn't sound very smart. Even with checkpointing, it's very risky if you have any issues. And even if you don't have issues, your model will be outdated by the time it's finished training due to how quickly the landscape is still changing.

Your questions sound like you've never done any training before, nor have much info on hardware. You definitely shouldn't be spending anything on hardware, and focus your time and energy on learning the basics.

[-]

eat_those_lemons@reddit (OP)

what basics would you recommend starting with? While working on hardware I have been going through things like Andrej Karpathy's zero to hero course

Are there other areas you would recommend that I work on learning?

[-]

FullstackSensei@reddit

seriously, have a chat with chatgpt.

[-]

eat_those_lemons@reddit (OP)

Well a) I'm doing that and more information is never bad and b) that was rude

[-]

CheatCodesOfLife@reddit

If you're training, I'd say just rent something from vast, runpod, or lambda cloud. You'll iterate much faster and unless you're training 24/7 for months at a time, it's much cheaper than doing it locally.

+1 for this (with the exception of smaller models that you can train with 1 GPU)

[-]

FullstackSensei@reddit

If it fits on one GPU, then you don't need that much PCIe bandwidth anyways and OP's question becomes moot.

[-]

Freonr2@reddit

Workstation/server boards have significantly more PCIe lanes than desktop parts. Like 96 to 128+ PCIe lanes vs 24 on consumer desktop boards/cpus. Many boards out there will have seven full x16 slots and still have 2-3 NVMe slots since PCIe lanes are in such abundance.

MCIO (up to PCIe 4.0 x8) or Oculink breakout PCIe adapter cards and cables, with mining rig chassis. The MCIO adapters and cables are readily available in PCIe 4.0 spec at least. With a proper server board you could end up with something like ~14 GPUs on PCIe x8 4.0 each if you can figure out how to physically orient them. You could get full x16 per card as well with the right adapters using two x8 MCIO cables, but probably be back down to 6-8 total GPUs. Some boards even have a few MCIO x8 or Oculink right on the board so you can skip the PCIe-MCIO adapter cards and just get the MCIO-PCIex16 adapters and cables.

Mining rig chassis are not very expensive.

Epyc 700x boards and CPUs are fairly reasonable (~600-800 board, <400 for CPUs), not much more than a high end consumer desktop (i.e. 9950X3D and X870E board) but you get 128 PCIe lanes. You pay a penalty on single thread performance, only get DDR4 (but 8 channel instead of 2), and you need to carefully select an 8 CCD CPU to make the most of the memory channels if you ever intend to use CPU offload. Some boards may lack typical desktop features like audio or wifi, or fewer USB4.

You need to be very careful in researching parts. Read the motherboard manuals carefully before selecting. Make sure you understand CCD count vs useful memory channels on AMD platforms and choose the exact CPU according to intent.

[-]

eat_those_lemons@reddit (OP)

Thanks for the detailed answer! I'm curious do you know if pcie 4 vs 5 affects training times very much?

[-]

Freonr2@reddit

Don't know, I try to keep up on this sub for benchmarks but haven't seen any direct benchmarks.

For inference probably not a big deal?

[-]

C0smo777@reddit

I recently hooked up 6x5090 doing this, it works really well, not cheap but the CPU and mobo are the cheap part

[-]

HilLiedTroopsDied@reddit

Look at CACHE size on AMD's spec sheet, some core configs are less than 64 and also come in full cache (max non 3dvcache) are also 8 CCD and full memory)

[-]

silenceimpaired@reddit

Super helpful… and terrifying all at once.

[-]

Freonr2@reddit

Yeah you really need to do some research before going this route so you don't get surprised later as you try to expand or build.

Motherboard/platform selection is pretty important and where I might start. You can learn a lot just by reading motherboard manuals.

[-]

Swimming_Whereas8123@reddit

For inferencing a single model go with 1, 2, 4, 8 or headache in vLLM. Tensor or pipeline parallelism is not very flexible.

[-]

eat_those_lemons@reddit (OP)

Oh thats right you have to do powers of 2 if you want vllm to work well, forgot about that

[-]

Grouchy_Ad_4750@reddit

https://www.asrockrack.com/general/productdetail.asp?Model=ROMED8-2T with PCIe risers

[-]

Smeetilus@reddit

I built a system using this board. Works.

Amazon.com: AMD EPYC™ 7282, S SP3, 7nm, Infinity/Zen 2, 16 Core, 32 Thread, 2.8GHz, 3.2GHz Turbo, 64MB, 120W, CPU, OEM : Electronics

Amazon.com: ASRock Rack ROMED8-2T Single Socket SP3 LGA 4094/ DDR4/ SATA3&USB3.1/ V&2GbE/ ATX Motherboard : Electronics

[-]

koalfied-coder@reddit

In a server chassis and comparable mobo

[-]

Zyj@reddit

Get a used threadripper pro, sometimes you can get them for not too much money.

[-]

ArtisticKey4324@reddit

You can bifurcate, but consumer grade mobos will fight you tooth and nail past two GPUs, they just aren’t made for it. You can probably get away with four with an amd chipset+bifurcation and just use usb/thunderbolt past that, but your best bet for six without contemplating suicide the whole time is an old threadripper+mobo with ample lanes. You can get the cpu+mobo for like 300

[-]

Only_Situation_4713@reddit

2x egpu on thunderbolt. 2x PCIE. and another 2x via bifurcation. Does it work? Yes. Am I divorced now? Yes

[-]

ArtisticKey4324@reddit

“What did it cost you?”

“Everything”

[-]

Amazing_Trace@reddit

wow AI didn't even need to be running a sex robot to breakup marriages, wild omens

[-]

outerproduct@reddit

Tommy, do you know why divorces are so expensive? Because they're worth it!

[-]

xXprayerwarrior69Xx@reddit

You can now train your ai gf model without any distraction. Optimizations all around

[-]

diaperrunner@reddit

Was it worth it? Yes yes it was

[-]

DataGOGO@reddit

You can buy a used workstation/server class MB/CPU for pretty cheap.

[-]

MachineZer0@reddit

Consumer GPUs you have to use a mining rig. Datacenter GPUs there are plenty of 6 and 8 GPU 4u gen 9 servers $400-1000.

[-]

thekalki@reddit

https://www.supermicro.com/en/products/motherboard/m12swa-tf has 6 full pcie lanes

[-]

munkiemagik@reddit

Not everyone likes Janky setups I get that there are some people who love evrything to be ordered and neat and tidy, Those people need not read any further X-D

I wanted a multi GPU setup with allt he PICe bandwidth for training (Im just a hobby tinkerer so my prioirty was just 'as long as it somehow works') You absolutely do NOT need to spend 3-4k to have a chassis +mobo+cpu for multi GPU.

The CPU doesnt need to be the latest most powerful CPU, I chose a Threadripper Pro 3945WX (12c 24t)+wrx80 mobo +128GB DDR4 + open air mining frame. In total it was all less than £600. I came across a great deal on ebay so gladly took it. You can go even lower on CPU and mobo, (There was someone in r/LocalLLaMA advising nudget conscious individuals to go for x399 mobo and threadripper 1000 series (60 PCIE lanes) which would drop the cost even further by a couple hundred)

Tr Pro 3945WX has 128 PCIE lanes from CPU, mobo has 7x PCIE x6 slots (with bifurcation if I ever needed more than 7 GPU)

My price doesnt include PCIE risers nor does it include PSU as I already had the PSU

[-]

kryptkpr@reddit

Look into SFF-8654 (x8) and SFF-8611 (x4)

There are both passive and active versions of these, depends on your cable lengths if under 60cm cheap passive is fine but if you need a meter probably going to have to pony up for a retimer or switch (switches are cheaper but they come with latency)

[-]

Prudent-Ad4509@reddit

Assumin that you want to use PCIe connection only, the is no *efficient* way to do that without switching to threadripper/epyc/xeon patforms with pcie extenders and using a custom-built case. The most promising ones are epyc-based. The reason is that both the number of PCIe slots and available bifurcation options are very limited on consumer PCs these days.

For example, Z390 has 6 PCIe slots, but most of them are pretty slow.

[-]

Magnus114@reddit

As far as I know x4 is perfectly fine for single gpu inference. But if you split a model between 4 gpus, don’t you lose a substantial amount of performance due to low bandwidth? I have a single card, and are considering to get 2 more.

[-]

eat_those_lemons@reddit (OP)

I'm not an expert by any means but my understanding is that only the activations are sent between the gpu's during inference which is not that much data, not all the weights

[-]

CheatCodesOfLife@reddit

Prompt processing will be slower for sure if you're using tensor parallel eg. -tp 4 in vllm or enabling it with exllamav2/exllamav3. Test it yourself if you want, send 20k token prompt and monitor the transfers with nvtop

Text generation is mostly unaffected.

[-]

bullerwins@reddit

Server motherboard with a mining rig and pcie risers