To what degree to PCIe lanes x16 vs x4 or x1 matter in a multi-GPU setup for running LLMs?

Posted by fabkosta@reddit | LocalLLaMA | View on Reddit | 21 comments

Many mainboard offering multi-GPU setups only offer one primary PCIe slot with full x16 bandwidth, wheras the others are then at e.g. x4 or oftentimes only x1. Let's assume I'd have 1 Nvidia RTX 3090 at x16 and 3 others at x1, how does this realistically impact the processing speed of an LLM vs having all four on x16? Does anyone have real-life experience?

[-]

Former-Tangerine-723@reddit

If its loaded full in vram, without spill to ram, then your ok. If your offloading and you have for example a gpu in a pcie 3 x1, then your prompt processing will be very very bad. Just my experience

[-]

Particular-Way7271@reddit

Same here.

[-]

Slow_Letterhead3830@reddit

The higher the width the faster you can move data in. For actual inference, you're not moving much at all, so I don't think it impacts there. For loading you need to move the weights in, so a 1x width would be a major bottleneck, but wouldn't impact inference speeds.

[-]

Particular-Way7271@reddit

Prompt processing yes if the model doesn't fit entirely in vram.

[-]

Upstairs_Tie_7855@reddit

In llama.cpp if you use tensor parallelism, it matter ALOT.

[-]

fabkosta@reddit (OP)

Like, how much? Can you somehow quantify that to give a rough indicator?

[-]

woahdudee2a@reddit

with llama cpp it doesnt matter, with vLLM it matters a lot

[-]

fabkosta@reddit (OP)

Oh, that’s interesting!

[-]

Aggressive-Bother470@reddit

...and lcpp still faster single user, somehow.

[-]

Maleficent-Koalabeer@reddit

I use pci4 x4 for inference with llama cpp. tested the difference to x8 and x16 but couldn't see any difference in llama bench. so x4 works fine for inference

[-]

zandzpider@reddit

Ran vllm against my two 3090. They are both in pci e gen 3 x8. It's literally faster to run qwen 30 a3b against one card than doing tensor parallelism. Running both in single mode I get full speed on both

[-]

zipperlein@reddit

Did u use --enable-parallel?

[-]

zandzpider@reddit

I did not. Maybe I should have done that. Thoughts?

[-]

zipperlein@reddit

Afaik, it reduces PCIE usage because it does distribute the experts instead of splitting them.

[-]

zandzpider@reddit

Thanks! Good idea and sounds like a solution. I'll try it out

[-]

zandzpider@reddit

Just FYI I got around half speed so a 50 downgrade. Time to get a new motherboard I guess

[-]

egnegn1@reddit

For the comparison, did you use a model that fits into the VRAM of a card and then compare it with half of the layers distributed across both cards?

[-]

zandzpider@reddit

I guess? Sorry I didn't write down the config but yes the model easily fits one card and did a pretty stock tensor parallelism 2 in vllm. I'll test it again when I get a new motherboard with more pci lanes

[-]

mr_Owner@reddit

It's a mixed bag of solutions.

I think motherboard bandwidth capabilities are key for multi gpu if ram is gonna involved too.

[-]

a_beautiful_rhind@reddit

It takes a bite out of your tensor parallel inference. For 1x TP might be slower than not using it.

[-]

BoeJonDaker@reddit

I have a 5060ti/16 and 4060ti/16 on my motherboard and a 3060/12 connected by a 1x riser. My load speeds are slower because of the 1x link, but after that initial load, my prompts mostly run at normal speed.