Inference using exo on mac + dec cluster?

Posted by EternalOptimister@reddit | LocalLLaMA | View on Reddit | 6 comments

I read on the exo lab blog that you can achieve “even higher” inference speeds using DGX spark together with m3 ultra(s) cluster. However I did not find any benchmarks. Has anyone tried this or run benchmarks themselves? Exo doesn’t only work on the ultra but also on m4 pro and m4 max and likely also on m5’s to come. I’m wondering what kind of inference speeds such clusters might realise for large SOTA MoE’s (Kimi, deepseek, …) that are currently practically impossible to run.

Reply to Post

6 Comments

[-]

b4d6d5d9dcf1@reddit

There is an article from EXO Labs showing a picture with dual DGXs, but the benchmarks show only one of them, I haven't found anybody in the community that has got this working ... otherwise I'd have it right now. It is the best local AI option. With the Apple cluster the TB5 becomes saturated and this would solve that issue and let the M3s decode. Assuming the 10Gbe prefill can be handed off efficiently, i think this is the best build for your money. That said, the M5 will likely match the prefill, so if it take EXO 6 months to get this working, then we should wait. If you find any information let me know and I'll do the same.

[-]

EternalOptimister@reddit (OP)

I’ll be waiting for the m5 ultra either way, but would consider adding a dgx if it would speed the process up! I’ll let u know if u find anything.

[-]

b4d6d5d9dcf1@reddit

I spoke to EXO directly and do not support their test at scale. It was a prove of concept. Specifically, 1:1 works, but scaling would mean supporting and managing the I/0 between simultaneous sub-clusters; a prefill sub-cluster and decode subcluster. I've landed on just clustering 4 or 8 GB10's; nowhere else to go? And, investing in a cluster that on day 1 maxes out the TB5 ports doesn't seem responsible. It works, but I think all of us look at the TB5 mesh cabling and our gut says, "wrong direction". The rumor is the M5 will have one less TB5 port, so I've moved off of the Apple wagon despite the juicy mem bandwidth and URAM.

[-]

EternalOptimister@reddit (OP)

But clustering on GB10’s is not really supported either as far as I understand?

[-]

b4d6d5d9dcf1@reddit

ChatGPT says it is "no problem". ;) joking. It should be possible over IB; I don't know how they could technically restrict it. There is no architectural mechanism for NVIDIA to hard-limit this at the fabric level. Any “limit” is support, not physics. You are avoiding PFC/ECN/ECMP, L3 routing, GID/TC Tuning; The nodes would talk using a single flat lossless fabric across a switch that is designed specifically for that purpose. Home users are not sitting on a Mellanox HDR Switch, it would be totally useless to their LAN. Corporate users that have IB HDR Switch(s) are going for H200 clusters. I think very few exist at the intersection of “small enough to experiment” and “serious enough to buy an IB EDR/HDR Switch." just speculating here. Also, these are not cheap; But with new data center explosion they are being replaced with 400GB switches. As such, you can pick up a used one for less than $2000 (likely with some heavy milage), and a new open box for about $4000.

[-]

No_Conversation9561@reddit

You're better off creating an issue in exo github about this. https://github.com/exo-explore/exo/issues Not many experiment with heterogeneous setup here.