Anyone else struggling with multi-GPU stability when running larger local models?

Posted by Lyceum_Tech@reddit | LocalLLaMA | View on Reddit | 21 comments

Been scaling up local LLM clusters and multi-GPU setups are still a pain.

Power throttling, ROCm bugs, and utilization dropping at scale are killing me.

What’s the biggest headache you’re facing with larger local setups right now?

[-]

ea_man@reddit

Can anyone recommend raiser PCIE 4x cables short (like 20cm or less) on Aliexpress for a good price?

Or any advice on what to look for a *decent* cable for testing a dual setup with just a 4x slot available.

[-]

For short risers (20cm or less), look for ones with good shielding. I had issues with cheap ones causing interference. Search for "PCIe 4.0 riser cable shielded" on AliExpress. The ones with metal casing are usually better.

[-]

ea_man@reddit

PCIe 4.0 riser cable shielded

Thanks, I guess that the ~35$ ones are better than the 15$.

[-]

RedAdo2020@reddit

I find the ADT-Link stuff is good. I am running it without problems.

[-]

RedAdo2020@reddit

Yes, me and my multi-gpu PC, I get cudaStreamSynchronize(cuda_ctx->stream()) , errors when context hits over \~20k, and it does my head in, but only with CPU offload. Problem is I'm not very technical with this stuff.

[-]

Lyceum_Tech@reddit (OP)

Try reducing the offload size or increasing context window step by step.

[-]

RedAdo2020@reddit

Interesting, it does it on the other PC too.
Used Nvidia 595 instead of 590. Cuda Tookit 12.8 instead of 13.1. Same OS, Mint. But all different GPUs, different MB, CPU, RAM. All Hardware is different.

[-]

lemondrops9@reddit

I'm running Linux Mint too. Nvidia 590 with Cuda toolkit 13.1. I havn't had errors when context gets high. Max I've hit is 130k with Qwen3.5 122B.

[-]

RedAdo2020@reddit

Yes it is weird. I was playing with it some more today. I have made an interesting discovery. If I use Split Mode Graph (Ik_Llama.cpp) it doesn't bug out. I got to 49k context without issues. Problem with that is, I use 5 x GPUs, so using SMG with offloading some to CPU, I lose a lot of VRAM to Buffer. So with a 140GB model, I have got all five GPUs each with 16GB of VRAM, with over 15GB on each card, AND of my 96GB of system RAM, I have 88GB in use just running Mint and the Model.

[-]

lemondrops9@reddit

Good to know, I haven't tried Ik_Llama.cpp yet. Ive been sticking too llama.cpp lately and some LM Studio.

[-]

RedAdo2020@reddit

I think I found a solution! GGML_CUDA_DISABLE_GRAPHS=1 seems to be a winner. I've managed to run up to 49k context once, and 64k context once, and now running a second time as I type. Looks like my problem might have been way too many Graphs on the main GPU!

[-]

lemondrops9@reddit

Interesting, so many things to configure.

btw what gpus are you running?

[-]

RedAdo2020@reddit

4 x internal 5070Ti running on PCIe lanes. And a 5060 Ti 16GB running on Thunderbolt 4.

[-]

RedAdo2020@reddit

Can't offload less, GPU's memory is full. And I really don't want to go below IQ2 or IQ3 Quants.

I'm in the middle of building a PC with my older gear, using same OS but Cuda 12.8 instead of 13.1, and a few other llama.cpp make parameters changes to see if anything helps, and if it might be a hardware problem with current build.

[-]

Shipworms@reddit

No - but I have been using old mining hardware; also : a warning about server power supplies 😳

Server PSU warning first : breakout boards. The fancy ones with a proper ATX power connector on them have a high failure rate. Often they stop working. Looking at the ATX specs : ATX PSUs need to analyse the voltage outputs, then tell the motherboard the voltage rails have stabilised. Only then does the motherboard accept the 12v, 5v, 3.3v rails.

ATX PSUs also need to tell the motherboard *before it happens* if any of the power rails are about to go out of spec. The PSU needs to analyse the internal PSU hardware, and remove the ‘voltage rails are safe’ signal if the PSU is about to fail, so the motherboard can disconnect the rails before it gets fried!

Server PSUs only output 12 volts. In short : I don’t trust breakout boards to be safe for the motherboard, and they may be dangerous, especially if they fail. I doubt the breakout boards have all the required safety monitoring devices…

That said, a very basic breakout board (12v only) did work, but the fancy ones? I won’t go near … especially as they only nad 30 day warranties … and they are all rather old now!

Using a no-name 8-slot riserless motherboard, Intel Arc Pro B50s ran fine (as did a Radeon Pro W6600). Used the 12v only breakout board too!

Using a new AsRock H510 Pro BTC+ 6-slot riserless, the Arc Pro B50s also run fine, but I can also use 5060Ti cards. Amcuaing a decent ATX PSU. No instability either. Not the fastest PCIe slots, but rock solid so far!

One thing to try would be llama.cpp (compiled with Vulkan support); it can run mixed setups (I have had ATI, nVidia, and Intel Arc Pro all running on one board with this); it could be a way to rule out most hardware issues (such as riser card signal quality), at least for initial troubleshooting?

[-]

Lyceum_Tech@reddit (OP)

yess the long riser cables have been giving us interference problems too. shorter ones helped a lot. delivery times suck though.