best approach for Strix Halo distributed inference in llama.cpp?

Posted by blbd@reddit | LocalLLaMA | View on Reddit | 20 comments

I was curious to understand what people are doing for this use case to get the best trade-off of convenience and performance.

Private backhaul on the 10GbE? USB4? Something else?

I see conflicting information on whether parallelism is per-layer or if there's a way to do a smarter form of parallelism that can drive 100% CPU / GPU utilization across nodes.

Is it better to use it to run bigger models that are smarter and need more unified RAM so they think better? Or better to take smaller ones and try to make them faster for more token speed?

[-]

Look_0ver_There@reddit

I have 2 Strix Halo's connected via a USB4 cable.

After tweaking some kernel parameters I managed to get the latency down from ~65us to just ~8us using just Vulkan. Bandwidth is around 12.7GB/sec

Even if you splurge out on specialized NIC's and run RDMA/RCCL, you'll typically only get down to around 5us latency, and IMO, it's a lot of extra money and hassle for a tiny gain.

For example, I saw a ~10% distributed token generation gain when dropping from 65us to 8us. I can't imagine that dropping from 8 to 5 is going to bring any major wins.

Just my 2c. I'm not at my desk but if you're interested I can drop my kernel changes here to show how I managed that latency reduction with USB4NET

[-]

ProfessionalSpend589@reddit

I had better luck when I tested a 25Gbit card with RDMA support (max latency 3.19 usec). But I removed them to attach GPUs and now I’m back on Ethernet.

That said - I would like to look at your configuration and maybe return to USB 4 if latency is better then my Ethernet connection.

[-]

Look_0ver_There@reddit

https://www.reddit.com/r/LocalLLaMA/comments/1szn5ij/comment/oj40ztc/

[-]

I've just sold 2nd halo. I tried everything but couldn't drop latency below 25ns at P95 on USB4. Maybe it depends on hardware implementation. Specs: GMKTec Evo X2, ubuntu server 24.04 with kernel 6.19.4, some kernel tweaks and OdinFive driver. Connection via TB5 cable. Driver reports x2 pci lanes on USB4, only TB4 mode. Ring size only 1024.

[-]

RegularRecipe6175@reddit

I got lower latency with two Framework desktops by connecting them to a 10 Gbe switch (yes, they are only 5 Gbe ports). I gave up chasing ways to lower latency using IP over USB4. YMMV.

[-]

Look_0ver_There@reddit

I have GMKTEC Evo-X2 as well for both.

The primary cause is the C-states putting the USB4 ports into micro-sleep states between packets. The big win came from restricting the maximum sleep state with a grub kernel line.

The other big win came from putting the net interfaces into pool mode, for which if you were at 25us, I suspect you'd already done that since that was my first big win to get from 65 down to 23us.

The cstate grub config and ensuring that the cable connected to the back USB4 ports got me the rest of the way. I found that the front USB4 ports were a little slower on latency, but this wasn't always consistent. I think the signal strength must be a bit noisier when using the front ports?

I'll post up the grub config and other stuff tomorrow after I wake up. There's also a shell script I run that forces the iGPU into a full power state which boosts PP speed by around 5-10%, although that gain is unrelated to the USB4 latency issue.

[-]

Pretend_Engineer5951@reddit

Very interesting. My front ports perfomed a bit faster than back ones. Even with iperf3 I've got extra speed. If you haven't tried https://github.com/Geramy/OdinLink-Five I wonder how much impact it would do on your setup.

Please share your config when you can.

[-]

Look_0ver_There@reddit

Yeah, it will spike up to 170W. I found that I can't run it like that for extended periods on just the one machine as eventually the machine will turn off and I needed to unplug the power cable in order to recover it. I'd like to stick some PTM on the CPU and see if that fixes it. It works for networked inferencing as the CPU gets to cool down enough.

Strange regarding your port positions. I wonder why it varies?

[-]

Pretend_Engineer5951@reddit

To recover without reboot just set echo to auto.

One of my machines uses graphen pad, another had Coolmoon MX-10 which is theoretically better than PTM pad. Even with graphen pad (> liquid metal) temperatures went up to 90 degrees. At the time the another one showed up 100. So anyway this mode can't be handled with stock cooling system and may damage system.

[-]

FullstackSensei@reddit

Are you sure that's 12GB and not 12Gb?

[-]

Look_0ver_There@reddit

You're correct. It's small b. I fat fingered while sleepy.

[-]

FullstackSensei@reddit

Then there's quite a bit more bandwidth left on the table.

40 or 56Gb FDR Mellanox ConnectX-3 cards can be had for \~$10 on ebay. cables cost about the same. So, you're looking at \~$30 to link both machines. You won't be able to hit the full speed of those links, since the NICs are gen e X8, but that's still \~3x more bandwidth. BTW, those adapters can switch between Infiniband and IP mode via firmware, or you can just run them in IPoIB.

[-]

Look_0ver_There@reddit

IMO, it's the latency, and not the bandwidth that's the main limiting factor here. Watching the TB4 link during inferencing, and it never gets above 26Mb/sec, and most typically sits at 16 to 17Mb/sec when I'm inferencing with MiniMax M2.7 sharded across the two machines.

Sure, the added bandwidth is nice to have, but for the various Strix Halo MiniPCs that don't have a spare PCI slot to plug something into, you can get real close to full inferencing speeds with just a single cable and some tweaks.

I'm not here to tell anyone that a single USB4 cable is the be-all/end-all of solutions, just that it's a remarkably effective solution provided we tweak some kernel parameters.

[-]

FullstackSensei@reddit

If you're on llama.cpp, there's not much benefit to anything above maybe 5gb, because you're stuck with all the inefficiencies of RPC and the IP stack. MoE models also don't have much support for parallelism on llama.cpp. I have a machines with 8 P40s (192GB VRAM). Vanilla llama.cpp gets a bit under 7t/s on Minimax Q4_K_XL. Meanwhile, ik_llama.cpp gets literally double the performance at ~14t/s using -sm graph and enabling p2p. You can clearly see the difference in nvtop, where I can see up to 5GB/s PCIe bandwidth on each of the eight cards.

Might be worth trying to run vllm on the Strix Halo. That can distribute the work properly using Ray.

[-]

Look_0ver_There@reddit

I'm getting 20-22tg/s with MiniMax M2.7 at Q5_K_M on my pair of Strix Halo's with llama.cpp though.

[-]

FullstackSensei@reddit

22t/s at Q5 is ~137GB/s, which is ~60% of real world memory bandwidth for Strix Halo.

While latency does play a role, I suspect the biggest bottleneck here is actually llama.cpp's RPC. Have tried running vllm with any model on a single SH? I really think you can get a lot more performance with vllm and Ray.

[-]

PieBru@reddit

Yes, please.

[-]

Look_0ver_There@reddit

The "big one" is added these two fields to GRUB_CMDLINE_LINUX in /etc/default/grub. For Linux distros that don't use Grub, just modify as per your distro's mechanism:

GRUB_CMDLINE_LINUX="...  processor.max_cstate=2 pcie_aspm=off ..."

Either inside /ec/sysctl.conf, or as a separate file in /etc/systctl.d/ add these net.* lines to put the networking stack into busy_poll mode for enable busy-poll waiting for short bursts of time after packet transfers, and enable low-latency mode for various TCP based operations.

# Approximate time in us to busy loop waiting for packets on device queue
net.core.busy_read = 100
net.core.busy_poll = 100

# Instructs the TCP stack to prefer low latency over throughput
net.ipv4.tcp_low_latency = 1

# Enables TCP Fast Open on both client and server side
net.ipv4.tcp_fastopen = 3

I also have a script that I run that forces the iGPU into high power mode, as well as disabling auto-suspend for Thunderbolt (USB4). The last also forces the CPU into high-power mode in the event that you don't already have this enabled.

#!/bin/bash

# Force all gpus into high-power mode
for card in /sys/class/drm/card*/device/power_dpm_force_performance_level; do
   echo high | sudo tee $card;
done

# Find the PCI device for your Thunderbolt networking
# Then disable autosuspend for it.
for i in /sys/bus/thunderbolt/devices/*/power/control; do echo on | sudo tee $i; done

# Force CPU in high-power governance mode even if system doesn't have it set
echo "performance" | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/energy_performance_preference

Those are the "big three" things, with the iGPU portion of the high-power mode script being optional, as it can result in over-heating, but it will boost your PP speeds by up to 10%.

Also, experiment with moving the Thunderbolt/USB4 cable (ensure that it's a properly rated cable <= 1 meter in length and not some random generic cable) between the USB4 ports on the machines. You may find that one particular pairing of ports works better than another, but this appears to be inconsistent, and I'm not sure if a real thing, or just electrical interference that can mess with speeds.

[-]

PieBru@reddit

Thanks, will try it tomorrow