M3 Ultra + DGX Spark = M5 Ultra-lite?

Posted by -dysangel-@reddit | LocalLLaMA | View on Reddit | 36 comments

So I saw an article recently about exo disaggregated prefill with DGX Spark and M3 Ultra - prefill on one machine and decode on another. DGX Spark apparently has 4x matmul performance over an M3 Ultra - same as the M5 Ultra should have. So I got a spark and have been playing around with it this weekend. Here are the results I've been getting with llama.cpp:

┌──────────────┬─────────────┬───────────────┬────────────┐
│    Model     │ Mac pp16384 │ Spark pp16384 │   Result   │ ├──────────────┼─────────────┼───────────────┼────────────┤
│ Qwen 35B A3B │    1574 t/s │      2198 t/s │ Spark 1.4x │ ├──────────────┼─────────────┼───────────────┼────────────┤
│ Qwen 27B     │     340 t/s │       778 t/s │ Spark 2.3x │ ├──────────────┼─────────────┼───────────────┼────────────┤
│ Minimax M2.7 │     372 t/s │       763 t/s │ Spark 2.1x │ ├──────────────┼─────────────┼───────────────┼────────────┤
│ Mistral 128B │      72 t/s │       241 t/s │ Spark 3.4x │ └──────────────┴─────────────┴───────────────┴────────────┘

In the end I found exo a little overkill for this simple use case, and so I've got Claude building a more focused and direct setup just using llama.cpp and some simple wrappers.

For anyone who's just got a Spark or thinking of getting one: the most important thing I've found so far is to set mmap=0 for llama.cpp, otherwise it massively harms both model loading time (many minutes vs like 20 seconds) and even prefill speeds.

The Spark is tiny and low power. Good complement to the M3 Ultra for a neat, quiet package.

Of course the M3 Ultra only has \~66% of the bandwidth that the M5 Ultra will have, so decode speeds will be a bit lower - but I'm already pretty happy with existing decode speeds. The M5 Ultra won't be enough of a boost that I'd want to upgrade to the M5 Ultra. So my current setup is somewhere between an M5 Max and M5 Ultra, but with some level of CUDA capability.

If I upgraded anything just now, it would probably be adding a second Spark via the 200GbE!

I wonder if I can get even better performance with vllm too, especially for batching. If anyone has good info on this, can they post in here? I'll keep experimenting and keep you guys posted if people are interested.