M3 Ultra + DGX Spark = M5 Ultra-lite?
Posted by -dysangel-@reddit | LocalLLaMA | View on Reddit | 36 comments
So I saw an article recently about exo disaggregated prefill with DGX Spark and M3 Ultra - prefill on one machine and decode on another. DGX Spark apparently has 4x matmul performance over an M3 Ultra - same as the M5 Ultra should have. So I got a spark and have been playing around with it this weekend. Here are the results I've been getting with llama.cpp:
┌──────────────┬─────────────┬───────────────┬────────────┐
│ Model │ Mac pp16384 │ Spark pp16384 │ Result │ ├──────────────┼─────────────┼───────────────┼────────────┤
│ Qwen 35B A3B │ 1574 t/s │ 2198 t/s │ Spark 1.4x │ ├──────────────┼─────────────┼───────────────┼────────────┤
│ Qwen 27B │ 340 t/s │ 778 t/s │ Spark 2.3x │ ├──────────────┼─────────────┼───────────────┼────────────┤
│ Minimax M2.7 │ 372 t/s │ 763 t/s │ Spark 2.1x │ ├──────────────┼─────────────┼───────────────┼────────────┤
│ Mistral 128B │ 72 t/s │ 241 t/s │ Spark 3.4x │ └──────────────┴─────────────┴───────────────┴────────────┘
In the end I found exo a little overkill for this simple use case, and so I've got Claude building a more focused and direct setup just using llama.cpp and some simple wrappers.
For anyone who's just got a Spark or thinking of getting one: the most important thing I've found so far is to set mmap=0 for llama.cpp, otherwise it massively harms both model loading time (many minutes vs like 20 seconds) and even prefill speeds.
The Spark is tiny and low power. Good complement to the M3 Ultra for a neat, quiet package.
Of course the M3 Ultra only has \~66% of the bandwidth that the M5 Ultra will have, so decode speeds will be a bit lower - but I'm already pretty happy with existing decode speeds. The M5 Ultra won't be enough of a boost that I'd want to upgrade to the M5 Ultra. So my current setup is somewhere between an M5 Max and M5 Ultra, but with some level of CUDA capability.
If I upgraded anything just now, it would probably be adding a second Spark via the 200GbE!
I wonder if I can get even better performance with vllm too, especially for batching. If anyone has good info on this, can they post in here? I'll keep experimenting and keep you guys posted if people are interested.
tru3relativity@reddit
What would you do with a second spark? I have one and an m3 ultra 96.
-dysangel-@reddit (OP)
I'd cluster it together with the first Spark to get a larger RAM pool. I'd be able to squeeze GLM 5.1 @ IQ2_XXS onto that thing and then maybe have a real chance at using it agentically.
I just networked them with a single cable from one 10GbE ethernet port to the other on their own subnet.
tru3relativity@reddit
10GbE works okay? Just bought an enclosure and was about to buy a mellanox card.
-dysangel-@reddit (OP)
Yeah the 10GbE works fine, especially if you just want to initially prove the concept to yourself, it's just a few $ vs hundreds.
If I were to buy another Mac, I'd wait for the M5 Ultra tbh to get the 4x matmul, extra bandwidth, and RDMA over thunderbolt.
I already have 512GB on my M3 Ultra, so it's not RAM I'm wanting for, it's extra compute. I was considering TinyGPU, but apparently the speed is awful just now (3tps!), so the Spark felt like the neatest way to add extra compute for now, plus it has CUDA support which is useful for some projects/repos.
tru3relativity@reddit
Ah yeah gotcha. I thought about getting another spark as well but then would need a switch and figured it would be better to wait for a 256 m5 ultra and getting rid of the m3.
BackgroundCod3658@reddit
Watch Alex's video:
https://www.youtube.com/watch?v=D2oZHzC_M28
It isn't natively supported by ExoLabs yet, so probably easiest to wait until it is. ExoLabs CEO Alex Cheema is actually pretty active on X if you want to learn more.
Also pro tip: if you are going to buy a GB10-type box to use in tandem with M3 Ultra, buy the Asus GX10 1TB. Identical chip and software as DGX Spark, but about $1-$1.5k less expensive.
-dysangel-@reddit (OP)
Correct - you need to be running a specific branch to enable it.
I've just built my own solution. All you have to do is copy the KV cache from one machine to the other. Relatively simple.
And yeah I did actually go with the GX10 as it was £800 cheaper!
LowPlace8434@reddit
Are you using GGUF or MLX models?
1x DGX spark and single-prompt single-response work is relatively simple.
2x DGX spark will be more complicated. Cheap incremental updates is also a problem. Indeed To fully utilize m3 ultra's ram you will need 4x DGX spark, it'll be harder yet. On the flip side, perhaps 4x dgx spark + m3 ultra can be worth >50% more compute compared 1xRTX 6000 Pro with 512G DDR5 ram and ktransformers - you gain from having disaggregation and enough vram to not bottleneck the prefill and get close to parity on m3 ultra decode.
-dysangel-@reddit (OP)
GGUF so far as the quantisation is way more reliable. I've found that some (but not all) larger models can still work well at IQ2_XXS UD. I mostly got the spark in the hopes of accelerating Minimax M2.7, though unfortunately that it seems way more bandwidth bound than compute bound.
I definitely don't want to go up to 4x Sparks as I'd be better off just putting that money toward an M5 Ultra. You make a really great point about 2x Sparks basically having 2x the bandwidth in some situations, so that is tempting.
I've already got the cheap incremental updates running btw.
ezyz@reddit
For this to work, does the model need to fit into the Spark's 128GB? Or is there still a speed up if you stream from the Spark's SDD?
DifferenceCute8951@reddit
I’m also really interested in this topic. I’ve watched a video on YouTube where it was tested with marginal benefit over a single spark.
I’ve got two sparks and wonder if they’ll be just as fast as your setup with TP. Share your recipe, model and benchmark and I’ll run to compare.
Side note I’m running vllm and very happy
-dysangel-@reddit (OP)
The main one I'm interested in is Minimax M2.7 IQ2_XXS UD - at pp16384, how does it compare in llama/vllm/whatever else? I'd consider a second Spark and 200GbE link if it full on halved prefill time again. That would be full on better than the M5 Ultra at that point, and probably kick ass for video generation.
redragtop99@reddit
I personally don’t think they’re going to make an M5U w 512GB of RAM. Most of the RAM Apple is using right now is going towards things like the MacBook Neo, and they need RAM for the IPhone Fold or whatever they’re going to call it.
I think they really want to, but they’re going to put more RAM in their iPhones than in the past. This is just a guess by me, but I’ve been staying up with it. I hope they come out with one, but I don’t see it.
Ok-Internal9317@reddit
My opinion is that 512GB at that past rate if inferencing is not really useable for anything real, TTFT-wise it has nothing to compete even with a old retired mechine with 8xv100s w./ nvlink. It does run quiet and draw less power but when you're running somehting like kimi still its a super quantised + slow version more towards "wow this is fun" rather than agentic coding/running openclaw all day long. Could argue for 27b, but again then that wouldn't require 512GB
-dysangel-@reddit (OP)
What you're saying is mostly true, but I do like running GLM 5.1 for chatting, and that would be a tough squeeze on 256GB. If someone comes out with a stable reap then yeah you'd be able to run on 256GB. One of the things about the 512GB is it also had more CPU and GPU cores. Also leaves a lot of RAM free for running vector DB and all sorts.
-dysangel-@reddit (OP)
Yeah who knows. Either way I don't want to buy another 512GB atm - I feel like if DeepSeek's engram stuff plays out, we're going to be able to offload some general knowledge into engrams. Then more params can be dedicated to intelligence rather than knowledge. Engrams load plenty fast enough even from SSD, so 256GB or maybe even 128GB of VRAM may be enough for an extremely smart model.
worldburger@reddit
This is great. Were you able to implement that send kv cache during next layer calc like they did in the Exo blog?
How complex would your solution be to make a 2x Spark, 2x Mac Studio cluster work?
-dysangel-@reddit (OP)
I haven't tried that yet, but definitely possible in principle to take the time down further
For 2x Spark and Mac Studio I'd probably just wire the Sparks up via their 200GbE ports, the 2 Macs via thunderbolt, and then 10GbE or 25GbE between them. I *think* llama.cpp already directly supports clustering, so it would still just be a matter of copying the KV cache from one cluster to the other
FukumuraMachine@reddit
Awesome! Is it as easy as buying a dgx and load exo on both the dgx and m3u and exo knows to do this?
-dysangel-@reddit (OP)
Yeah basically - you'll need to check out a custom branch though. This video shows things working: https://www.youtube.com/watch?v=D2oZHzC_M28
I'm aiming to opensource my setup once I get it running neatly - will hopefully be more straightforward than exo, as it will be the sole purpose of the tool
forestryfowls@reddit
I bet I'm missing something but does that video have a place where Alex talks about the overall speedups you get by combining the two? It's more clear from your table, but actually I can't tell from your table what the baseline is for either separately overall and then what the aggregated gain is.
From the article they use thunderbolt for connecting the two devices while Alex uses super fast ethernet. What is the rationale behind using your 10 Gig ethernet versus a direct thunderbolt connection? Thanks for putting this together!
-dysangel-@reddit (OP)
Yeah for some reason he uses a Mac Mini at the start of the video - but he uses a proper M3 Ultra at the end. He also includes model loading time and thinking tokens in "ttft", which I don't agree with at all.
The rationale behind the 10G ethernet is that it's all I have to hand atm, as the Spark doesn't have thunderbolt. It does have 200GbE ports I haven't got a switch for connecting the two yet. I think the best I could do bandwidth wise is have Thunderbolt -> 25GbE, then a switch that can do 25GbE to 200GbE. I'm not sure it's going to be worth the money, and the direct 10GbE is working well enough for now.
forestryfowls@reddit
Ooh thanks for clarifying. The last picture in that blog post tripped me up because I thought the usb c connection was thunderbolt to the Mac Studio. So you connected it presumably like:
Mac Studio 10G -> 10G Switch
Nvidia Spark QSFP56 -> 10G SFP -> 10G Switch
You mention this in another post but I'd be curious about bonding multiple 10G connections too. I think the QSFP56 on the Spark is 2x100G interfaces so I'd hope you could bond 2 of lower speed interfaces on the Mac too. I looked and dual 25G thunderbolt connectors are still $900+ but AliExpress has a dual 10G thunderbolt connector that's \~$90. I wonder if that's any good. I remember not long ago that a single 10G thunderbolt adapter was $200 so I'm glad they are finally getting cheaper chipsets of the 10G variety at least.
On the performance, from your table you are just reporting the prefill speeds, right? It would be great to see what your overall speedups are- did they match what was in the blog post where you get 2.8x the speed with everything combined?
-dysangel-@reddit (OP)
That's super interesting about the double thunderbolt adapters. Possibly could get there with enough.
If I were processing longer contexts on a dense model it would absolutely get there, though so far I've just been testing with shorter contexts. Iirc the test I did earlier shaved a 6k prompt and 128 token response down from 15 seconds to 12 seconds, with 700ms of that being the cache transfer over 10G, so not too bad. The concept is proven, now I need to work on an openai compatible API and make sure tool calling etc work
Only-An-Egg@reddit
When doing this, does the Spark need to only hold the model or does it also need to hold the KV cache too? I ask because I'm thinking about getting a M5 Max 128GB MBP to prefill for my 256GB M3U Studio over TB5.
-dysangel-@reddit (OP)
Yes it needs to hold both, because it's generating the KV cache to send to the Mac
I was wondering the same about the M5 Max, but the Spark is likely faster for prompt processing, and is also half the price of a specced out M5 Max. So definitely get some prompt processing benchmarks first.
M5 Max is 2x our M3 Ultra prompt processing, M5 Ultra is 4x. Though with 2 Macs connected via thunderbolt, you get RDMA and more bandwidth, so that opens some other doors..
macarory@reddit
Sorry I don't have a lot of karma so can't post but want to add a question you maybe able to help me with.
I'm a Psychologist planning to purchase and use a nvidia spark to train models, I know it isn't great for inference so i'm looking at a Mac M3 Ultra with 215GB RAM for the inference machine.
My dilemma is that i am moving out of country, so this setup will be kept at home mostly. My brother will be the one to help manage the hardware if I ever need assistance with it, so I want to be reasonably setup to help him with it. I figured I could just ssh / remote connect to the Mac when needed using my macbook.
Does this sound sufficient?
forestryfowls@reddit
I just set up Tailscale and it is amazing at being able to connect to my homelab wherever. The iOS app is even really polished for allowing me to use something like Termius for remote tmux sessions.
Possible-Pirate9097@reddit
Use TEE/TDX.
-dysangel-@reddit (OP)
yep sounds fine to me - I connect into my Mac over ZeroTier when I'm away from home, and it works great
Not_HFM@reddit
Do you think this setup would work with an M2 Ultra (192 GB) so the two machines are matched?
-dysangel-@reddit (OP)
Oh wow I didn't know that the M2 Ultra basically has as much bandwidth as the M3 Ultra! Yeah I think that would be a big win.
chorl@reddit
Which kind of connections are you using? QSFP (50 Gbps?) or plain Ethernet 10Gb?
-dysangel-@reddit (OP)
Yeah just the 10GbE atm. I've been wondering about Thunderbolt to 25GbE. I don't know any other options yet. I was wondering if you could somehow bond a bunch of Thunderbolt connections to link up to 200GbE on the Spark, but it feels like a lot of work for diminishing returns when prompt processing time already massively outweighs the transfer time.
tamerlanOne@reddit
Ma vale anche per strix halo?
-dysangel-@reddit (OP)
Sure - you can transfer the KV cache between any two machines as long as you use a shared format, and have a fast enough connection. I'm just using 10GbE just now, but it's already enough to save a few seconds on a 4k prompt to Minimax M2.7.