I tested Strix Halo clustering w/ ~50Gig IB to see if networking is really the bottleneck

Posted by Hungry_Elk_3276@reddit | LocalLLaMA | View on Reddit | 102 comments

**TLDR:** While InfiniBand is cool, 10 Gbps Thunderbolt is sufficient for llama.cpp. Recently I got really fascinated by clustering with Strix Halo to get a potential 200 GB of VRAM without significant costs. I'm currently using a 4x4090 solution for research, but it's very loud and power-hungry (plus it doesn't make much sense for normal 1-2 user inference—this machine is primarily used for batch generation for research purposes). I wanted to look for a low-power but efficient way to inference \~230B models at Q4. And here we go. I always had this question of how exactly networking would affect the performance. So I got two modded Mellanox ConnectX-5 Ex 100 Gig NICs which I had some experience with on NCCL. These cards are very cool with reasonable prices and are quite capable. However, due to the Strix Halo platform limitation, I only got a PCIe 4.0 x4 link. But I was still able to get around 6700 MB/s or roughly 55 Gbps networking between the nodes, which is far better than using IP over Thunderbolt (10 Gbps). I tried using vLLM first and quickly found out that RCCL is not supported on Strix Halo. :( Then I tried using llama.cpp RPC mode with the `-c` flag to enable caching, and here are the results I got: |Test Type|Single Machine w/o rpc|2.5 Gbps|10 Gbps (TB)|50 Gbps| |:-|:-|:-|:-|:-| |**pp512**|653.74|603.00|654.03|663.70| |**tg128**|49.73|30.98|36.44|35.73| |**tg512**|47.54|29.13|35.07|34.30| |**pp512 @ d512**|601.75|554.17|599.76|611.11| |**tg128 @ d512**|45.81|27.78|33.88|32.67| |**tg512 @ d512**|44.90|27.14|31.33|32.34| |**pp512 @ d2048**|519.40|485.93|528.52|537.03| |**tg128 @ d2048**|41.84|25.34|31.22|30.34| |**tg512 @ d2048**|41.33|25.01|30.66|30.11| As you can see, the Thunderbolt connection almost matches the 50 Gbps MLX5 on token generation. Compared to the non-RPC single node inference, the performance difference is still quite substantial—with about a 15 token/s difference—but as the context lengthens, the text generation difference somehow gets smaller and smaller. Another strange thing is that somehow the prompt processing is better on RPC over 50 Gbps, even better than the single machine. That's very interesting to see. During inference, I observed that the network was never used at more than maybe \~100 Mbps or 10 MB/s most of the time, suggesting the gain might not come from bandwidth—maybe latency? But I don't have a way to prove what exactly is affecting the performance gain from 2.5 Gbps to 10 Gbps IP over Thunderbolt. Here is the llama-bench command I'm using: ./llama-bench -m ./gpt-oss-120b-mxfp4-00001-of-00003.gguf -d 0,512,2048 -n 128,512 -o md --rpc <IP:PORT> So the result is pretty clear: you don't need a fancy IB card to gain usable results on llama.cpp with Strix Halo. At least until RCCL supports Strix Halo, I think.

Reply to Post

102 Comments

[-]

Badger-Purple@reddit

u/Hungry_Elk_3276 I know I'm late to the party, but I am now wondering if the nvme slot to PCIE would allow for a NIC to be installed in a system without a slot...it does provide a passthrough for PCIe via oculink, so it should theoretically provide the same passthrough? then we could network these machines with anything else that has NICs like the Spark... Using thunderbolt over IP, the best I get is 100ms latency, I'd love to get a microsecond latency.

[-]

PreparationLow6188@reddit

The answer is Yes. But if the board only provide PCIE3.0x4 nvme it means 32Gbps , 4.0 = 64Gbps, it will not match the 100G NIC fully. It is little wield that you ping TB4 over IP is 100ms latency, under our lab it is about 1.1ms nearly same as CX6.

[-]

jc2375@reddit

the nvme slot gives 4x4, it does not need to match the nic fully as he mentioned. Using a dual 25gbe cx5-ex over oculink working fine. TB4 is limited also in many STXH platforms, such as bosgame m5, as the board layout is not maximized and the USB4 ports are from the CPU controller. In fact, i can't count all pcie lanes being used; the minisforum is somehow able to deliver 2 USB4v2 ports, dual 10GBE Eth, all of which require pcie bandwidth much larger than the 3 lanes that it loses on the second nvme card slot (minisforum second M.2 slot is 4x1, not 4x4). Maybe someone can enlighten me on how the bosgame is using the extra lanes, bc the shitty USB 3.2 ports don't seem justification enough.

[-]

Badger-Purple@reddit

it’s 100 microseconds, not miliseconds. Amazing you use ethernet and get microsecond level/ IB level latency. It is always 10x going via ethernet even with sfp

[-]

Hungry_Elk_3276@reddit (OP)

I dont see any difference. AFAIK, this pcie slot is exactly "from" a nvme slot. Because the lack of pcie lans for 395 devices.

[-]

wishstudio@reddit

Could you test the network latency? I believe that's the only thing that matters once you get TP working. To my understanding data exchange in TP is minimal. But TP will need a few syncs per layer. gpt-oss-120b is 36 layers, typical ethernet latency is around 250us, so just the latency alone will make it abysmally slow. I heard IB can get latency to single digit microsecond range, I'm curious about real world performance.

[-]

Hungry_Elk_3276@reddit (OP)

Using \`ib\_send\_lat\` and \`ib\_write\_lat\` gives me the following result. ib\_write\_lat: Average Latency: 1.10 microseconds Minimum Latency: 1.02 microseconds Maximum Latency: 3.01 microseconds Typical Latency: 1.09 microseconds Std Deviation: 0.00 microseconds 99th Percentile: 1.23 microseconds 99.9th Percentile: 3.01 microseconds ib\_send\_lat: Average Latency: 1.08 microseconds Minimum Latency: 1.07 microseconds Maximum Latency: 2.34 microseconds Typical Latency: 1.08 microseconds Std Deviation: 0.03 microseconds 99th Percentile: 1.24 microseconds 99.9th Percentile: 2.34 microseconds

[-]

Zyj@reddit

Can you provide the numbers for thunderbolt-net also?

[-]

Badger-Purple@reddit

that wont work, TB-net gives you like 90 microseconds at best

[-]

pdrayton@reddit

Nah, thunderbolt-net can be optimized to give you \~17μs, which is actually a little better than direct-connect Ethernet (\~20μs) and almost 2x faster than switched Ethernet (\~31μs). The key w. TB is to pull \*\*all\*\* the optimization levers - massive MTU, IRQ assignment, core parking, CPU governor states, even idle=poll for zero sleep-states inbetween processing data. I ran a \[bunch of tests\](https://forum.level1techs.com/t/benchmarking-usb4-performance-on-strix-halo/245299) on this as part of a larger test matrix including 100GbE and Infiniband that I am working on. But the TB-net results were suprisingly good.

[-]

Badger-Purple@reddit

I suspect that it will depend on the strix halo “version” or partnered product, since some have the USB/TB ports managed by a controller, others are straight PCIE access. Right?

[-]

wishstudio@reddit

Wow that's really impressive. Once you get TP working there should be meaningful speedup.

[-]

gnomebodieshome@reddit

Does RPC mode use RDMA? If you are using IB or have RoCE setup, you could try building libvma and using it with \`LD\_PRELOAD=libvma.so\`. I got soft-RoCE working with my experimental test nodes on my old ICX6610 with 10GbE, and saw a speedup of about 7% with a custom splitting of LLM model layers that I vibe coded. With \*real\* RDMA you should see a significant loss of latency.

[-]

Hungry_Elk_3276@reddit (OP)

Wish I know this sooner, already spend a bunch of time learning the ucx to try to patch the llama.cpp There will be a result updated very very soon.

[-]

Urlilas@reddit

I think with current news about AI Halo release at CES using the same platform there is maybe a possibility that AMD enables some sort of RDMA

[-]

gnomebodieshome@reddit

I hope RDMA though Thunderbolt ports becomes a thing. I've emailed [https://www.dolphinics.com](https://www.dolphinics.com) trying to seed the idea to them to make a PCIe switch targeted for small clusters with Thunderbolt that works with their "supersockets" RDMA stack.

[-]

GregoryfromtheHood@reddit

That's crazy that you can get that kind of speed over RPC. I've been trying to use RPC to combine my pc with a 5090 with my AI PC that has 2x3090 and 1x4090. After a lot of tweaking, I couldn't get anything near useful performance, and could definitely see that the network bandwidth wasn't the problem. I gave up and bought an egpu dock and have been pulling the 5090 out of my gaming PC and throwing it on the dock to use it for AI. Looks like I need to look into RPC again because I am worried about pulling and inserting the GPU so many times, especially the 12vhp

[-]

Badger-Purple@reddit

From these tests and what I have seen by u/eugr in the NVIDIA Spark forum, It is the latency and not the bandwidth...especially as you get to bigger contexts and more key value cache needs to pass across the layers.

[-]

fallingdowndizzyvr@reddit

As expected. I don't find the difference to be substantial between 2.5 to 10 to 50. Sure, it gets a little faster but not nearly as much as the increase in network speed would suggest. Not enough for me to pay several times more for a 10GBE network versus 2.5GBE.

[-]

Freonr2@reddit

2.5 to 10 sure looks worth. ??? There's no real cost difference, some of the 395s have dual 10gbe, some just have 1x2.5gbe. You should be able to setup direct peer-to-peer network for the cost of a $6 Cat6 patch cable. You don't need a switch, though 10gbe switches are are not that expensive these days.

[-]

fallingdowndizzyvr@reddit

> There's no real cost difference, some of the 395s have dual 10gbe, some just have 1x2.5gbe. There is a big cost difference. The 395s with 10gbe cost hundreds more. For example the cheapest dual 10GBE I know of is the Beelink. That's $2500. Compared to $1700 for a 395 with 2.56GBE.

[-]

Zyj@reddit

There is no cost difference because you can use Thunderbolt with cheap cables to get 10+GBit/s networking.

[-]

fallingdowndizzyvr@reddit

There is a cost difference. What you just said doesn't change the fact that machines with 10GBE cost hundreds more than machines that don't. Also, with USB4 daisy chain networking leads to high latency. Which is what's important for inference.

[-]

panchovix@reddit

IMO 10Gbps is worth, but above that nope.

[-]

WesternTall3929@reddit

Looks like someone is going to be forced to post a definitive performance guide to 100G/IB/RoCEv2/NCCL/vLLM operations.

[-]

TheOriginalG2@reddit

The bottleneck is most likely latency.

[-]

InfraScaler@reddit

Hey, this is great stuff, thanks for sharing and for putting in all the work and effort. Did you measure other stuff like how busy CPU, disk, RAM and GPU where in every test? The gains could come from offloads to the MLX5, but this is just a wild guess. I am unfamiliar with these tests (I am a newb here), but I know a bit about infra and scaling, hence my curiosity! Does this traffic use TCP? any chance you could instead use RDMA?

[-]

Hungry_Elk_3276@reddit (OP)

Yes, the implementation of llama.cpp rpc-server is over tcp I think. Using RDMA will need to change the current structure of the code base. At least we need a abstract layer of transport to support different kinds of connection other than tcp, and that is missing right now so there is a lot of work to be done.

[-]

TheAiDran@reddit

or try to write your own proxy TCP/IP over RDMA, but it is not trivial either. Maybe GPT7 will be able to handle this.

[-]

Hungry_Elk_3276@reddit (OP)

I think the libvma is similar to what you just said? It is providing speed ups though.

[-]

TheAiDran@reddit

Yes, libmva should have at least 2x lower latency than SMC-R, as it fully offloads the kernel. If for some reason it is significantly higher than RDMA, e.g., < 10 us, I would test something else. LD\_PRELOAD=/usr/lib/libvma.so sockperf ping-pong

[-]

InfraScaler@reddit

Yeah definitely not a trivial change, but should offload a lot of CPU cycles!

[-]

perelmanych@reddit

Thanks for the results! Given the amount of money 2x Strix Halo cost I would go with M3 Ultra 256Gb with 60 core GPU. [Here](https://x.com/ivanfioravanti/status/1954211534411813289?s=20) you can find results for more expensive 80 core rig, but going down to 60 core should affect only pp.

[-]

Intrepid_Rub_3566@reddit

Thank you very much u/Hungry_Elk_3276 . I recently tried this as well with a 5Gbps Ethernet, and then moved to 10Gbps without seeing any improvement (as you, I suspect latency is the real issue, and likely the 5G and 10G have the same latency, I need to test). Performance is acceptable with MiniMax-M2 at Q6\_K\_XL quant: [https://youtu.be/0cIcth224hk](https://youtu.be/0cIcth224hk) What I did after the video, I applied this PR and this gave me a 5.5% improvement in prompt processing for MiniMax-M2 (I added the benchmarks at the end of the PR comments): [https://github.com/ggml-org/llama.cpp/pull/15405](https://github.com/ggml-org/llama.cpp/pull/15405) However, looking at the conversation on that PR, it doesn't seem likely to be merged for now as it requires work and re-architecting.

[-]

Hungry_Elk_3276@reddit (OP)

Just saw your video, great stuff! Will check out that branch later.

[-]

Kos187@reddit

Why is it 10Gb instead of 40? Did you try nic aggregation?

[-]

Hungry_Elk_3276@reddit (OP)

Becasue the nature of thunderbolt is 2x 10Gb(1 tx 1 rx) or 2 x 20Gb. There is never a 40Gb mode. And I cant get the 20Gb mode work either.

[-]

bytepursuits@reddit

is rocm still not up to par with vulkan on strix halo? I only ever use vulkan with it: https://llm-tracker.info/_TOORG/Strix-Halo

[-]

ScaredProfessor9659@reddit

ROCm is faster on my sh

[-]

IAmBobC@reddit

I had been considering 2x DGX Spark (ASUS @ $3K each) just to have the NVLink interconnect. I hadn't considered direct TB connection between 2x 395 systems. Looks like TB DAC networking works on both Win & Lin! Some of my needs would be more easily met with a Zen CPU, so I'm **very** interested to see how this progresses. RemindMe! 7 days

[-]

Hungry_Elk_3276@reddit (OP)

I recommend just buy the Spark for mature software support. But buy dual strix halo to enjoy the tinkering (and pain lol).

[-]

RemindMeBot@reddit

I will be messaging you in 7 days on [**2025-11-18 04:34:15 UTC**](http://www.wolframalpha.com/input/?i=2025-11-18%2004:34:15%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1ot3lxv/i_tested_strix_halo_clustering_w_50gig_ib_to_see/no8cayu/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1ot3lxv%2Fi_tested_strix_halo_clustering_w_50gig_ib_to_see%2Fno8cayu%2F%5D%0A%0ARemindMe%21%202025-11-18%2004%3A34%3A15%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201ot3lxv) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|

[-]

Only_Situation_4713@reddit

Llama cpp doesn’t use tensor parallel so everything is done sequentially. This test was meaningless. You need to test it with TP on VLLM or Shang

[-]

Hungry_Elk_3276@reddit (OP)

As I state in the post, there is no RCCL support. Without RCCL support, frameworks like vLLM and PyTorch can't perform collective operations (all-reduce, all-gather, etc.) across multiple nodes. This is the fundamental blocker for tensor-parallel inference on Strix Halo—you literally can't split a model across nodes without these primitives. It's always the software support that's lacking on the AMD side. :(

[-]

starkruzr@reddit

is there a timeline for RCCL support? it sounds like that could make a big difference (at least for dense models too big for a single machine's VRAM window, if I understand you correctly)?

[-]

BillDStrong@reddit

I thought RCCL was an NVIDA CUDA API thing, so VLLM just has to implement the higher level primitives? AMD would need to make a similar API? I admit to not knowing enough about this.

[-]

ElementII5@reddit

https://rocm.docs.amd.com/projects/rccl/en/latest/index.html

[-]

BillDStrong@reddit

Thanks.

[-]

Rich_Artist_8327@reddit

what about pipeline parallel =2 in vllm?

[-]

Hungry_Elk_3276@reddit (OP)

From my testing it still seems that the vllm still some how requires NCCL/RCCL in order to get pp=2 work, so it failed to start. The strix halo platform support on vllm is pretty much still in early stages.

[-]

Rich_Artist_8327@reddit

it works, just use the latest versions

[-]

Hungry_Elk_3276@reddit (OP)

After some quick testing, it still does not work. Can you guide me on how to make it work? I first started Ray on both nodes. Verified they see each other and had 2 GPUs. Set up the NCCL, RCCL with the correct interface and vLLM host IP with the mlx5's IP, then started the qwen3-next. And it failed just like before. I am using the latest master branch with Triton branch 57c693b6 and a nightly build of torch with ROCm 7.0. I have a feeling that RCCL is still not supporting gfx1151. And I tried to use GLOO too; that did not work. I can post the logs, but they are too generic with no useful information I think. It is just NCCL complaining it is crashing.

[-]

Rich_Artist_8327@reddit

try gemma3

[-]

waiting_for_zban@reddit

> ROCm 7.0. I know this is finnicky, but vLLM had weird bugs with ROCM7. Can you try with 6.4? Although I do think the main limitation is vLLM. However this is still amazing feat!

[-]

Hungry_Elk_3276@reddit (OP)

That will be great news! Pulling the source and trying now.

[-]

Mastershima@reddit

Now I’m invested.

[-]

CapoDoFrango@reddit

got calls?

[-]

Sorry_Ad191@reddit

lets go!

[-]

DistanceSolar1449@reddit

That’s basically llama.cpp then

[-]

LinkSea8324@reddit

When using PP2 you don't get two GPU at 50%, you get two gpus at 100%, unlike llama.cpp

[-]

lostdeveloper0sass@reddit

You can create a ticket on AMD ROCM GitHub and they usually answer quickly on it.

[-]

koushd@reddit

I believe you can use GLOO instead if NCCL is not available (I assume RCCL is the rocm version).

[-]

MoffKalast@reddit

Are you guys just making up four letter abbreviations now

[-]

-dysangel-@reddit

YESN

[-]

MitsotakiShogun@reddit

Why not distributed-llama?

[-]

BananaPeaches3@reddit

--sm row makes it tensor parallel

[-]

fallingdowndizzyvr@reddit

> This test was meaningless. It is not meaningless at all. It's quite meaningful since network speed is a topic that often comes up. You don't have just be doing TP for it to be of interest.

[-]

wishstudio@reddit

It's meaningless because: 1. Pipeline parallelism only help you run models that you can't fit in a single node. It can't be faster than the single slowest node. So there is no sense testing it for performance, unless you want to test for performance bugs in implementation. 2. Using pipeline parallelism, the network transfer between nodes are minimal. Each token only has 2880 elements of embedding. Even you use 100Mbps network it's only like 1ms time for a token. So what are you trying to test?

[-]

ggerganov@reddit

\> Pipeline parallelism only help you run models that you can't fit in a single node. This is not true - pipeline parallelism increases prompt processing (PP) performance nearly linearly with the number of devices \[0\]. There are many use cases in which PP speed is more important than TG speed. Atm, the RPC backend of llama.cpp specifically does not support pipeline parallelism, but it's something that can be added relatively easy if there is interest. \[0\] [https://github.com/ggml-org/llama.cpp/pull/6017#issuecomment-1994819627](https://github.com/ggml-org/llama.cpp/pull/6017#issuecomment-1994819627)

[-]

Sorry_Ad191@reddit

interested :)

[-]

wishstudio@reddit

But if you can fit the entire model in every single node, like in the OP case, why not simply load the full model in every single node and run them independently without all the hassles? Sure you can save memory for kv cache, etc. But the overall throughput won't be better.

[-]

fallingdowndizzyvr@reddit

+1 for interest.

[-]

fallingdowndizzyvr@reddit

> It can't be faster than the single slowest node. That's not true. Sure you have to wait for every node to finish, but it doesn't have to be the speed of the single slowest node. Since the faster nodes will pull up the overall speed of the entire cluster. Now what can factor into the speed of the entire cluster is network speed. Speaking of which.... > So what are you trying to test? Latency. It's not the bandwidth that's the issue. I've already gone on and on and on in this sub about how the amount of data transferred is KB, not GB or even MB. But the time it takes to transfer that little bit of data matters. Since everything is waiting for that little bit of data to show up. Which makes latency important. And with current networking, latency related to bandwidth. So that's what's being tested. Since it matters.

[-]

wishstudio@reddit

\> That's not true. Sure you have to wait for every node to finish, but it doesn't have to be the speed of the single slowest node. Since the faster nodes will pull up the overall speed of the entire cluster. Now what can factor into the speed of the entire cluster is network speed. Speaking of which.... You are right. I just want to point out that OP's testing scenario does not make sense because it can already fit in a single node. \> Latency. It's not the bandwidth that's the issue. I've already gone on and on and on in this sub about how the amount of data transferred is KB, not GB or even MB. But the time it takes to transfer that little bit of data matters. Since everything is waiting for that little bit of data to show up. Which makes latency important. And with current networking, latency relates to bandwidth. So that's what's being tested. Since it matters. Totally agree with you. Latency is also my curious point. But again, OP's test mainly focus on bandwidth which is irrelevant here.

[-]

Hungry_Elk_3276@reddit (OP)

I chose to test a model that fits in a single node because I really want to see what the penalty is for the RPC mode across two nodes. And frankly, I did not intentionally focus on bandwidth; it is just I really don't know if there is any specific way that I could test that is focused on the latency. Sorry about that.

[-]

wishstudio@reddit

Never mind. I'm sorry if anything I said sounded offensive to you! When I saw your title, I was imaging some some speedups from distributed inference, and quickly realized what you have tested cannot result in a speedup. But as as you are specifically testing for networking overhead, I want to say please ignore this thread and thank you for the testing!

[-]

Stunning_Mast2001@reddit

Latency is definitely a huge factor but I wonder if the bandwidth is more important for training

[-]

pydehon1606@reddit

What is the model of your minipc? I don’t know any with pcie exposed to the back :o

[-]

griffin1987@reddit

Your single machine is still faster in some metrics though. I would assume that your connection has way more protocol overhead and a worse latency (you already hinted at that in your post) than infiniband, that's probably the rest of the difference. So, yes, it makes a difference, and the thing is, for a single machine it might not matter that much, but once you build a whole datacenter of these, every miniscule gain may make a huge difference.

[-]

marioarm@reddit

What specific one you have? I'm tempted with Bosgame M5 but your looks fairly different.

[-]

ortegaalfredo@reddit

Please test using VLLM, llama.cpp really is a single-user software, its useless for >1 request at a tme that is basically wasting 99% of the hardware. Can you try VLLM or sglang with pipeline parallel?

[-]

dionisioalcaraz@reddit

Does it mean that it doesn't matter to connect an external GPU to a mini PC using USB4 or OcuLink (2x BW) in terms of inference speed?

[-]

Ren-WuJun@reddit

When you were testing with the 2.5G connection, did you connect two machines directly or via a network switch? also did you turned on Jumbo frames?

[-]

Hungry_Elk_3276@reddit (OP)

I used a 2.5Gig siwtch, the MTU is at default 1600, so maybe it will have a better result if i mannually set 9000? But I think the improvment wont be that huge though.

[-]

Yorn2@reddit

As a sysadmin, the general rule of thumb with MTU and jumbo frames is not to set it manually unless you have to. As a sysadmin that put off changing the MTU for a particular issue (Oracle RAC) because he was stubborn about sticking to the rule and wasted 72 hours troubleshooting other shit before he finally went back to changing MTU manually which instantly fixed the problem, don't hesitate at least trying it (and remembering to switch back again after every other test). You'd be surprised at how dumb "smart" switches and networking sometimes operate. It's a huge pain in the butt to change everything manually, but it may need to be part of each troubleshooting step. There might be someone with more experience with this exact hardware that would know more, though.

[-]

JockY@reddit

Ahhh... Back in the day there was a certain DVR with a secure boot chain that I compromised because their bootloader's Broadcom Ethernet drivers assumed all Ethernet frames were 1500 bytes and just DMA'd them straight into RAM. Those extra 7500 bytes were very useful in landing a bootloader patch with a [www](https://cwe.mitre.org/data/definitions/123.html) primitive to disable the kernel integrity checks. Good times.

[-]

Ren-WuJun@reddit

I think cut the switch would help. considering there are definitely more than 9 kb of data transmitted per token, why not try jumbo frame? maybe not much of improvement but free improvement non the less.

[-]

aigemie@reddit

Thanks for testing and sharing! May I ask what machines (model, brand) you were using?

[-]

eleqtriq@reddit

Jeff Geerling just posted a video like this on his channel, and his results were abysmal. You should check it out. See what you can get versus what he got.

[-]

KillerQF@reddit

The video from Jeff Geerling was a bit confused wrt expectations. He's running a 400B dense model on strix halo and 'surprised' at the performance. plus he compares the results to machines running deepseek?

[-]

eleqtriq@reddit

I don’t think he set expectations. But I think a lot of people want to know about these use cases. Plus, it’s good to know what’s actually worked in regards to clustering.

[-]

geerlingguy@reddit

The main thing I was targeting was what use case you could hit with clustering in strict halo, and the answer so far is "running larger models more slowly than single node". It's still much better if not using CUDA and 100+ Gbps to just scale up one machine either with multi GPU or the biggest VRAM you can get than to scale across nodes, at least with any current clustering tool outside of Nvidia-land.

[-]

ComplexityStudent@reddit

If only we could use a dGPU for prompt/context processing.

[-]

KillerQF@reddit

Should ip over thunderbolt not be able to go to 80 or 120ngb/s using the usb4v2 ports?

[-]

Hungry_Elk_3276@reddit (OP)

No luck and it seems like the thunderbolt 5 support is not working on ubuntu server 24.04 LTS, was not able to get TB5 drive working. The max speed I am able to get with TB4 is 10GB/s x 2, which could do 10 Gig send and recieve at the same time, but not able to do the full 20 gig connection.

[-]

KillerQF@reddit

Did you mean 10 Gb/sx2? are you on the 6.14 or 6.16 kernel

[-]

Hungry_Elk_3276@reddit (OP)

Yes, sorry for the typo, I mean it is 10Gb full duplex. I am on 6.8 kernel. The reason I did not upgrade the kernel is that newer kernel seems is not supported by the amdgpu-install script.

[-]

KillerQF@reddit

OK, the kernel may be the reason you can't get usb4v2.

[-]

getyourown12words@reddit

Funny, I was just thinking about this today while looking at ServeTheHome and my neighbors over at Level1Techs. Interesting results, I wonder if driver or applications improvements could make this work better.

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! [Come check it out!](https://discord.gg/PgFhZ8cnWW) You've also been given a special flair for your contribution. We appreciate your post! *I am a bot and this action was performed automatically.*

[-]

RegularRecipe6175@reddit

This is exactly the kind of informative post I come here to read. I have a 4x3090 system and a new 395+ machine. Thank you, sir.

[-]

Aroochacha@reddit

Thank you for doing the work…for science!