To what degree to PCIe lanes x16 vs x4 or x1 matter in a multi-GPU setup for running LLMs?
Posted by fabkosta@reddit | LocalLLaMA | View on Reddit | 21 comments
Many mainboard offering multi-GPU setups only offer one primary PCIe slot with full x16 bandwidth, wheras the others are then at e.g. x4 or oftentimes only x1. Let's assume I'd have 1 Nvidia RTX 3090 at x16 and 3 others at x1, how does this realistically impact the processing speed of an LLM vs having all four on x16? Does anyone have real-life experience?
Former-Tangerine-723@reddit
If its loaded full in vram, without spill to ram, then your ok. If your offloading and you have for example a gpu in a pcie 3 x1, then your prompt processing will be very very bad. Just my experience
Particular-Way7271@reddit
Same here.
Slow_Letterhead3830@reddit
The higher the width the faster you can move data in. For actual inference, you're not moving much at all, so I don't think it impacts there. For loading you need to move the weights in, so a 1x width would be a major bottleneck, but wouldn't impact inference speeds.
Particular-Way7271@reddit
Prompt processing yes if the model doesn't fit entirely in vram.
Upstairs_Tie_7855@reddit
In llama.cpp if you use tensor parallelism, it matter ALOT.
fabkosta@reddit (OP)
Like, how much? Can you somehow quantify that to give a rough indicator?
woahdudee2a@reddit
with llama cpp it doesnt matter, with vLLM it matters a lot
fabkosta@reddit (OP)
Oh, that’s interesting!
Aggressive-Bother470@reddit
...and lcpp still faster single user, somehow.
Maleficent-Koalabeer@reddit
I use pci4 x4 for inference with llama cpp. tested the difference to x8 and x16 but couldn't see any difference in llama bench. so x4 works fine for inference
zandzpider@reddit
Ran vllm against my two 3090. They are both in pci e gen 3 x8. It's literally faster to run qwen 30 a3b against one card than doing tensor parallelism. Running both in single mode I get full speed on both
zipperlein@reddit
Did u use --enable-parallel?
zandzpider@reddit
I did not. Maybe I should have done that. Thoughts?
zipperlein@reddit
Afaik, it reduces PCIE usage because it does distribute the experts instead of splitting them.
zandzpider@reddit
Thanks! Good idea and sounds like a solution. I'll try it out
zandzpider@reddit
Just FYI I got around half speed so a 50 downgrade. Time to get a new motherboard I guess
egnegn1@reddit
For the comparison, did you use a model that fits into the VRAM of a card and then compare it with half of the layers distributed across both cards?
zandzpider@reddit
I guess? Sorry I didn't write down the config but yes the model easily fits one card and did a pretty stock tensor parallelism 2 in vllm. I'll test it again when I get a new motherboard with more pci lanes
mr_Owner@reddit
It's a mixed bag of solutions.
I think motherboard bandwidth capabilities are key for multi gpu if ram is gonna involved too.
a_beautiful_rhind@reddit
It takes a bite out of your tensor parallel inference. For 1x TP might be slower than not using it.
BoeJonDaker@reddit
I have a 5060ti/16 and 4060ti/16 on my motherboard and a 3060/12 connected by a 1x riser. My load speeds are slower because of the 1x link, but after that initial load, my prompts mostly run at normal speed.