I tested Strix Halo clustering w/ ~50Gig IB to see if networking is really the bottleneck
Posted by Hungry_Elk_3276@reddit | LocalLLaMA | View on Reddit | 102 comments
**TLDR:** While InfiniBand is cool, 10 Gbps Thunderbolt is sufficient for llama.cpp.
Recently I got really fascinated by clustering with Strix Halo to get a potential 200 GB of VRAM without significant costs. I'm currently using a 4x4090 solution for research, but it's very loud and power-hungry (plus it doesn't make much sense for normal 1-2 user inference—this machine is primarily used for batch generation for research purposes). I wanted to look for a low-power but efficient way to inference \~230B models at Q4. And here we go.
I always had this question of how exactly networking would affect the performance. So I got two modded Mellanox ConnectX-5 Ex 100 Gig NICs which I had some experience with on NCCL. These cards are very cool with reasonable prices and are quite capable. However, due to the Strix Halo platform limitation, I only got a PCIe 4.0 x4 link. But I was still able to get around 6700 MB/s or roughly 55 Gbps networking between the nodes, which is far better than using IP over Thunderbolt (10 Gbps).
I tried using vLLM first and quickly found out that RCCL is not supported on Strix Halo. :( Then I tried using llama.cpp RPC mode with the `-c` flag to enable caching, and here are the results I got:
|Test Type|Single Machine w/o rpc|2.5 Gbps|10 Gbps (TB)|50 Gbps|
|:-|:-|:-|:-|:-|
|**pp512**|653.74|603.00|654.03|663.70|
|**tg128**|49.73|30.98|36.44|35.73|
|**tg512**|47.54|29.13|35.07|34.30|
|**pp512 @ d512**|601.75|554.17|599.76|611.11|
|**tg128 @ d512**|45.81|27.78|33.88|32.67|
|**tg512 @ d512**|44.90|27.14|31.33|32.34|
|**pp512 @ d2048**|519.40|485.93|528.52|537.03|
|**tg128 @ d2048**|41.84|25.34|31.22|30.34|
|**tg512 @ d2048**|41.33|25.01|30.66|30.11|
As you can see, the Thunderbolt connection almost matches the 50 Gbps MLX5 on token generation. Compared to the non-RPC single node inference, the performance difference is still quite substantial—with about a 15 token/s difference—but as the context lengthens, the text generation difference somehow gets smaller and smaller. Another strange thing is that somehow the prompt processing is better on RPC over 50 Gbps, even better than the single machine. That's very interesting to see.
During inference, I observed that the network was never used at more than maybe \~100 Mbps or 10 MB/s most of the time, suggesting the gain might not come from bandwidth—maybe latency? But I don't have a way to prove what exactly is affecting the performance gain from 2.5 Gbps to 10 Gbps IP over Thunderbolt.
Here is the llama-bench command I'm using:
./llama-bench -m ./gpt-oss-120b-mxfp4-00001-of-00003.gguf -d 0,512,2048 -n 128,512 -o md --rpc <IP:PORT>
So the result is pretty clear: you don't need a fancy IB card to gain usable results on llama.cpp with Strix Halo. At least until RCCL supports Strix Halo, I think.
102 Comments
Badger-Purple@reddit
PreparationLow6188@reddit
jc2375@reddit
Badger-Purple@reddit
Hungry_Elk_3276@reddit (OP)
wishstudio@reddit
Hungry_Elk_3276@reddit (OP)
Zyj@reddit
Badger-Purple@reddit
pdrayton@reddit
Badger-Purple@reddit
wishstudio@reddit
gnomebodieshome@reddit
Hungry_Elk_3276@reddit (OP)
Urlilas@reddit
gnomebodieshome@reddit
GregoryfromtheHood@reddit
Badger-Purple@reddit
fallingdowndizzyvr@reddit
Freonr2@reddit
fallingdowndizzyvr@reddit
Zyj@reddit
fallingdowndizzyvr@reddit
panchovix@reddit
WesternTall3929@reddit
TheOriginalG2@reddit
InfraScaler@reddit
Hungry_Elk_3276@reddit (OP)
TheAiDran@reddit
Hungry_Elk_3276@reddit (OP)
TheAiDran@reddit
InfraScaler@reddit
perelmanych@reddit
Intrepid_Rub_3566@reddit
Hungry_Elk_3276@reddit (OP)
Kos187@reddit
Hungry_Elk_3276@reddit (OP)
bytepursuits@reddit
ScaredProfessor9659@reddit
IAmBobC@reddit
Hungry_Elk_3276@reddit (OP)
RemindMeBot@reddit
Only_Situation_4713@reddit
Hungry_Elk_3276@reddit (OP)
starkruzr@reddit
BillDStrong@reddit
ElementII5@reddit
BillDStrong@reddit
Rich_Artist_8327@reddit
Hungry_Elk_3276@reddit (OP)
Rich_Artist_8327@reddit
Hungry_Elk_3276@reddit (OP)
Rich_Artist_8327@reddit
waiting_for_zban@reddit
Hungry_Elk_3276@reddit (OP)
Mastershima@reddit
CapoDoFrango@reddit
Sorry_Ad191@reddit
DistanceSolar1449@reddit
LinkSea8324@reddit
lostdeveloper0sass@reddit
koushd@reddit
MoffKalast@reddit
-dysangel-@reddit
MitsotakiShogun@reddit
BananaPeaches3@reddit
fallingdowndizzyvr@reddit
wishstudio@reddit
ggerganov@reddit
Sorry_Ad191@reddit
wishstudio@reddit
fallingdowndizzyvr@reddit
fallingdowndizzyvr@reddit
wishstudio@reddit
Hungry_Elk_3276@reddit (OP)
wishstudio@reddit
Stunning_Mast2001@reddit
pydehon1606@reddit
griffin1987@reddit
marioarm@reddit
ortegaalfredo@reddit
dionisioalcaraz@reddit
Ren-WuJun@reddit
Hungry_Elk_3276@reddit (OP)
Yorn2@reddit
__JockY__@reddit
Ren-WuJun@reddit
aigemie@reddit
eleqtriq@reddit
KillerQF@reddit
eleqtriq@reddit
geerlingguy@reddit
ComplexityStudent@reddit
KillerQF@reddit
Hungry_Elk_3276@reddit (OP)
KillerQF@reddit
Hungry_Elk_3276@reddit (OP)
KillerQF@reddit
getyourown12words@reddit
WithoutReason1729@reddit
RegularRecipe6175@reddit
Aroochacha@reddit