Multi-GPU? Check your PCI-E lanes! x570, Doubled my prompt proc. speed by switching 'primary' devices, on an asymmetrical x16 / x4 lane setup.
Posted by overand@reddit | LocalLLaMA | View on Reddit | 25 comments
Short version - in my situation, adding export CUDA_VISIBLE_DEVICES="1,0" to my llama.cpp launch script doubled prompt processing speed for me in some situations.
Folks, I've been running a dual 3090 setup on a system that splits the PCI-E lanes 16x / 4x between the two "x16" slots (common on x570 boards, I believe). For whatever reason, by default, at least in my setup (Ubuntu-Server 24.04 Nvidia 580.126.20 drivers, x570 board), the CUDA0 device is the one on the 4-lane PCI express slot.
I added this line to my run-llama.cpp.sh script, and my prompt processing speed - at least for MoE models - has doubled. Don't do this unless you're similarly split up asymmetrically in terms of PCI-E lanes, or GPU performance order. Check your lanes using either nvtop, or the more verbose lspci options to check link speeds.
For oversized MoE models, I've jumped from PP of 70 t/s to 140 t/s, and I'm thrilled. Had to share the love.
This is irrelevant if your system does an x8/x8 split, but relevant if you have either two different lane counts, or have two different GPUs. It may not matter as much with something like ik_llama.cpp that splits between GPUs differently, or vLLM, as I haven't tested, but at least with the current stock llama.cpp, it makes a big difference for me!
I'm thrilled to see this free performance boost.
How did I discover this? I was watching nvtop recently, and noticed that during prompt processing, the majority of work was happening on GPU0 / CUDA0 - and I remembered that it's only using 4 lanes. I expected a modest change in performance, but doubling PP t/s was so unexpected that I've had to test it several times to make sure I'm not nuts, and have compared it against older benchmarks, and current benchmarks with and without the swap. Dang!
I'll try to update in a bit to note if there's as much of a difference on non-oversized models - I'll guess there's a marginal improvement in those circumstances. But, I bet I'm far from the only person here with a DDR4 x570 system and two GPUs - so I hope I can make someone else's day better!
Ummite69@reddit
This is my setup with 5090 on pci 5.0 x16 5090 + 3090 on TB5 egpu (so I think pci 4.0 x4 speed). I may not have the best setup, but pretty good. It think the command you are looking for is "main-gpu" command :
llama-server.exe --no-mmap -m "W:\text-generation-webui\user_data\models\Qwen3.5-27B-UD-Q8_K_XL.gguf" --alias "Qwen3.5-27B-UD-Q8_K_XL" --cache-type-k q8_0 --cache-type-v q8_0 --main-gpu 0 --split-mode layer --flash-attn on --batch-size 1024 --ubatch-size 512 --cache-ram 160000 --port 11434 --prio 3 --tensor-split 32,20 --kv-unified --parallel 3 -c 500000 -ngl 99 --host 0.0.0.0 --metrics --cont-batching --no-warmup --mmproj "W:\text-generation-webui\user_data\models\Qwen3.5-27B-GGUF-mmproj-BF16.gguf" --no-mmproj-offload --temp 0.65 --min-p 0.05 --top-k 30 --top-p 0.93 --defrag-thold 0.1
grumd@reddit
--cache-ram 160000- lol wtf? 160GB of RAM prompt cache? Why?grunt_monkey_@reddit
I thought it was almost mandatory to run ctv and ctk at bf16 for qwen 3.5, is this no longer a thing?
overand@reddit (OP)
Interestingly, my attempt with --main-gpu (or the equivalent in a
--models-presetsetup) didn't actually change the behavior when processing the prompt, but that may have been either a bug or operator error. It does seem like that's the right way to do it, though! (It just didn't actually work for me.)If you're using it. double check to see if it's doing what you'd expect, vs. trying the environment variable option, just to be on the safe side!
Nindaleth@reddit
The equivalent flags for Vulkan and ROCm should be
GGML_VK_VISIBLE_DEVICESandHIP_VISIBLE_DEVICES, respectively.CMDR_Mal_Reynolds@reddit
Makes sense, the x4 is via chipset, x16 mainlines to CPU. Generation likely cares less even if layers are split over cards, less bandwidth needed.
Lemonzest2012@reddit
Thanks for this, my Gigabyte B550 Gaming X v2 does this also, but worse, 16x/2x lol, will try some of the solution in this thread as my slower card seems favoured!
overand@reddit (OP)
I'd be tempted to get a splitter for that 16x lane slot! (But I'd want to make sure people have tested it beforehand - and, that turns into a real mounting headache too)
overand@reddit (OP)
And, if you use "nvtop" - you'll see which card is which (and you can watch which is getting used most heavily.
Are you on Linux, or Windows?
Lemonzest2012@reddit
Linux of course :D
bitcoinbookmarks@reddit
This is the problem of llama.cpp that need more attention. LLama.cpp by default split model across all GPUs vs fit by groups. See https://github.com/ggml-org/llama.cpp/pull/19608 also old explanation: https://github.com/ggml-org/llama.cpp/issues/19607#issuecomment-4067855245
Ummite69@reddit
Isn't --tensor-split 32,20 --kv-unified, with --main-gpu 0 ? My understanding was that the main gpu used with kv-unified was the kv only on the main gpu, then the model layers spread on both gpu with the tensor-split proportion. But I may be completely mistaken.
bitcoinbookmarks@reddit
Complex, but thanks for the idea.. I will try it. Things are already complicated because I’m already using CUDA_VISIBLE_DEVICES to hide some GPUs from llama.cpp so layers aren’t split across them. Putting the KV cache on the main GPU is one thing, but the other problem is grouping layers and preventing them from being split across all available GPUs.
MelodicRecognition7@reddit
make sure to export "CUDA_DEVICE_ORDER=PCI_BUS_ID" environment variable otherwise ID numbers could be different from what you see in nvidia-smi
General_Arrival_9176@reddit
this is the kind of post that saves someone hours of frustration. i had no idea CUDA_VISIBLE_DEVICES order could be different from lspci order on asymmetric lane setups. worth noting for anyone with x570 - those second m.2 slots often share lanes with the 4x slot, so its not just GPU-to-GPU bandwidth that gets affected
Business-Weekend-537@reddit
This might be a dumb question and possibly should be its own post but does anyone here know if llama.cpp supports multi gpu better than ollama? What about better than vllm?
overand@reddit (OP)
My understanding is that vLLM supports multi-GPU better than llama.cpp, but it's a fair bit harder to set up, and more "touchy" (easier to get out of memory errors?)
ik_llama.cpp has some multi-GPU improvements that llama.cpp doesn't have, but overall I prefer llama.cpp, and I find the... interpersonal conflict between the creators to be pretty depressing, given it's literally holding back the progress of AI worldwide.
droptableadventures@reddit
By "better", vLLM allows you to get much more speed gain out of multiple GPUs, but practically, you need 2x, 4x or 8x, they pretty much need to be the same GPU, and it doesn't really do mixed CPU inference well.
llama.cpp supports multiple GPUs, but more GPUs don't really speed things up like they do with vLLM, they just mean you can run a bigger model, or avoid slowdown by having less/none of the model in RAM. But you could run a llama.cpp setup where part of the model's on AMD, part of the model's on NVIDIA, and part is on the CPU.
Yorn2@reddit
If you are smart enough to work around the issues that come with a multi-GPU setup then you should generally move on from ollama to some of the other options that give you a wider scope of tweaking power, IMHO.
jikilan_@reddit
llama.pp default settings is more friendlier and better for consumer board
PermanentLiminality@reddit
Llama.cpp has a command line argument where you can tell it which card to use as the primary. It is -mg I believe.
panchovix@reddit
Not OP but I think that flag is bugged. I just re order with CUDA_VISIBLE_DEVICES and it works instead.
overand@reddit (OP)
Probably we should submit bug reports, eh? Same experience here.
overand@reddit (OP)
Weirdly enough, I didn't get the expected benefit from this! I'm using a
--models-presetini file, and I setmain-gpu = 1but didn't see any change in terms of which GPU was doing the prompt processing. This may have been operator error - perhaps I'd selected the wrong preset with my client, but I think it's possible this doesn't work very well with the split modes. (It definitely worked when I used it with-sm noneto select a single GPU, for running e.g. ComfyUI on one and llama.cpp on the other).Marksta@reddit
The device numbers get enumerated by compute, so 2 3090s it used some other metric to assign device 0. Something not random since it usually doesn't swap each boot, so maybe based on the port address number or whatever. So makes sense, both llama.cpp and Nvidia drivers don't care to do any of the leg work here for you.