Nice writeup. I hear Claud Code is now open source, and the original was full of analytics beacons. Any thought to compiling it yourself, and making improvements to address some of what you mentioned?
Awesome article. I wonder if you have contemplated using Gemma 4 26b a4b with thinking off at fp8 somewhere fast to replace haiku? Your article made it sound to me you use a single thinking model for your local Claude. Those are my current thoughts if I take the plunge of one day buying an M3 ultra. Please keep sharing !!
Nice!! Thanks for the update. Do you run it somewhere else or also on the M3 ultra ? And at what quantization? If am thing something like dual 3090 at W8A16 to have nice “snappiness” while keeping the big think on the M3 ultra. One can research and dream…
Really great write up! I recently got an older m1 ultra studio with 128gb to delve in. I’m definitely not running such large models but it’s been interesting moving between ollama, lmstudio and now trying omlx and rapid-mlx. So I definitely understand that it’s not plug and play but it’s been a lot of fun learning. At work we have Claude and codex so this is more for privacy use at home plus learning. Appreciate you sharing all this knowledge as it’s quite helpful and intriguing!
Thanks for this OP! Great read and got some tips out of it Been meaning to write something similar about my own setup (M4 Max 128GB * 2) but never got to it 😆
Great article! In your case, since the kimi k2.5 at q8 should be 1 tb or 512 gb at q4, were only the active parameters loaded to unified memory and the rest were on disk?
Can you also please test with longer context lengths and with later models like glm 5.1, minimax 2.7 that's about to release?
Thanks! K2.5 actually ships with its experts at 4-bit, so the "full" model is only 600 GB at full precision. It's also quantization aware, so I was able to get it down to ~2.5 bit for ~360GB fully in memory: https://huggingface.co/spicyneuron/Kimi-K2.5-MLX-2.5bit At 20k prompt, prefill is drops 20% from 237 to 188.
GLM 5.1's best case is 194 prefill / 19.5 decode: https://huggingface.co/spicyneuron/GLM-5.1-MLX-2.9bit
Haven't run longer context benchmarks, but I'd expect a drop in the same 20-25% neighborhood.
Great writeup, has everything I want to know. With the recent well-documented service degrade from Claude and subscription prices slowly hiking, running large models locally could get more mainstream. Qwen choosing to not open source their latest large models is disappointing, but there seem to be enough other open models to choose from at the moment.
Just curious, do you have an rough estimate of how much the M5 ultra is going to increase performance?
I did hear people say for mlx models you need to at least get q6 ones, on the other hand, gguf models are good at q4_k_m. Because the quantization methods are different.
This is a good but frustrating article for me to read given the fork in the road I decided to walk down.
My DDR4 box didn't have enough memory/GPUs so since I have interest in photo video generation I went down the upgrade path instead of buying the 512gb Studio (I'd have sold a kidney to do it but.. I would have)
Now I have lots of memory, I can devote 512 to an LLM VM and will put the 5090 I have in once I have the PSU I need but I'm staring at TPs metrics ~10 times slower than yours for the large models which is discouraging. My box does a lot of other things but man :/
Mostly the convenience of sharing the same harness between local and API subscription. Cloud Claude still has a big lead on on fast / complex coding, though I've been impressed with GLM 5.1 so far.
Largest is just a product of how much memory you can set aside for GPU. By default, that's 96GB but you could push it to ~120... if you're willing to run your laptop as a dedicated headless server. Which might not be realistic.
You could easily run Qwen 3.5 122B at Q4 with plenty of room leftover. Or maybe a Minimax M2.7 at a 2 or 3 bit?
You can get a rough approximation of memory needs by just looking at the total download size of that quant. That'll undershoot, but it's a starting point.
Thank you for this OP! Great information in this article. I’m running two DGX Sparks in a cluster and multiple 128gb machines with different models. Just got my hands on the latest MacBook Pro M5 max with 128gb of RAM as well and this is really helpful even if I don’t have the same amount of memory as you.
NP! Been a lurker here long enough, so this felt like something I needed to write.
I'm actually eagerly waiting for more details on Mac + Spark clusters. Exo launched a demo of this a couple months ago, but it hasn't moved since: https://github.com/exo-explore/exo/issues/1102
ElementNumber6@reddit
Nice writeup. I hear Claud Code is now open source, and the original was full of analytics beacons. Any thought to compiling it yourself, and making improvements to address some of what you mentioned?
Leafytreedev@reddit
Don't forget to confirm your .plist file belongs to root and is read only for all besides root :D
zeferrum@reddit
Awesome article. I wonder if you have contemplated using Gemma 4 26b a4b with thinking off at fp8 somewhere fast to replace haiku? Your article made it sound to me you use a single thinking model for your local Claude. Those are my current thoughts if I take the plunge of one day buying an M3 ultra. Please keep sharing !!
ezyz@reddit (OP)
Actually, I just switched
local_haikufrom Qwen 3.5 35B to Gemma 4 24b. So far so good!It's small enough that concurrent requests don't seem to affect throughput on the main model in any noticeable way.
zeferrum@reddit
Nice!! Thanks for the update. Do you run it somewhere else or also on the M3 ultra ? And at what quantization? If am thing something like dual 3090 at W8A16 to have nice “snappiness” while keeping the big think on the M3 ultra. One can research and dream…
xrvz@reddit
I'm not reading a shitty substack article. If you can't be assed to make your own website put it on wordpress or blogspot like a normal person.
__rtfm__@reddit
Really great write up! I recently got an older m1 ultra studio with 128gb to delve in. I’m definitely not running such large models but it’s been interesting moving between ollama, lmstudio and now trying omlx and rapid-mlx. So I definitely understand that it’s not plug and play but it’s been a lot of fun learning. At work we have Claude and codex so this is more for privacy use at home plus learning. Appreciate you sharing all this knowledge as it’s quite helpful and intriguing!
whysee0@reddit
Thanks for this OP! Great read and got some tips out of it Been meaning to write something similar about my own setup (M4 Max 128GB * 2) but never got to it 😆
averagepoetry@reddit
This is so good. Thank you so much!
sanmn19@reddit
Great article! In your case, since the kimi k2.5 at q8 should be 1 tb or 512 gb at q4, were only the active parameters loaded to unified memory and the rest were on disk?
Can you also please test with longer context lengths and with later models like glm 5.1, minimax 2.7 that's about to release?
ezyz@reddit (OP)
My Minimax 2.7 quant trials are still running, but tokens/s on the M3 is roughly 740 prefill, 49 decode, at short context.
sanmn19@reddit
Thank you!
ezyz@reddit (OP)
Thanks! K2.5 actually ships with its experts at 4-bit, so the "full" model is only 600 GB at full precision. It's also quantization aware, so I was able to get it down to ~2.5 bit for ~360GB fully in memory: https://huggingface.co/spicyneuron/Kimi-K2.5-MLX-2.5bit At 20k prompt, prefill is drops 20% from 237 to 188.
GLM 5.1's best case is 194 prefill / 19.5 decode: https://huggingface.co/spicyneuron/GLM-5.1-MLX-2.9bit
Haven't run longer context benchmarks, but I'd expect a drop in the same 20-25% neighborhood.
muyuu@reddit
Minimax 2.7 looks very promising.
TCDH91@reddit
Great writeup, has everything I want to know. With the recent well-documented service degrade from Claude and subscription prices slowly hiking, running large models locally could get more mainstream. Qwen choosing to not open source their latest large models is disappointing, but there seem to be enough other open models to choose from at the moment.
Just curious, do you have an rough estimate of how much the M5 ultra is going to increase performance?
JinPing89@reddit
I did hear people say for mlx models you need to at least get q6 ones, on the other hand, gguf models are good at q4_k_m. Because the quantization methods are different.
ezyz@reddit (OP)
It's not that MLX quantization methods are bad, so much as the default quantization tool has limited settings.
I use a fork of mlx-lm to do per-module overrides: https://github.com/ml-explore/mlx-lm/pull/922
Most of my own MLX quants average between 3-5 bits but include select weights at 6, 8, and 16 bit to improve quality.
thrownawaymane@reddit
This is a good but frustrating article for me to read given the fork in the road I decided to walk down.
My DDR4 box didn't have enough memory/GPUs so since I have interest in photo video generation I went down the upgrade path instead of buying the 512gb Studio (I'd have sold a kidney to do it but.. I would have)
Now I have lots of memory, I can devote 512 to an LLM VM and will put the 5090 I have in once I have the PSU I need but I'm staring at TPs metrics ~10 times slower than yours for the large models which is discouraging. My box does a lot of other things but man :/
ezyz@reddit (OP)
At current RAM prices, you might be able to sell half and buy a kidney! Or a M5 this summer.
thrownawaymane@reddit
I am slightly curious about doing that. I could sell 1tb at the very most.
colorblind_wolverine@reddit
What was your main motivation in using Claude Code? Wondering if you’ve tried Pi for a more light weight harness.
ezyz@reddit (OP)
Mostly the convenience of sharing the same harness between local and API subscription. Cloud Claude still has a big lead on on fast / complex coding, though I've been impressed with GLM 5.1 so far.
One_Club_9555@reddit
Thanks for the write-up, it was great!
Trying to correlate to an M4 Max 128GB. What is the largest model and at what quant I could run? How do you figure it out?
Thanks!!
ezyz@reddit (OP)
Largest is just a product of how much memory you can set aside for GPU. By default, that's 96GB but you could push it to ~120... if you're willing to run your laptop as a dedicated headless server. Which might not be realistic.
You could easily run Qwen 3.5 122B at Q4 with plenty of room leftover. Or maybe a Minimax M2.7 at a 2 or 3 bit?
You can get a rough approximation of memory needs by just looking at the total download size of that quant. That'll undershoot, but it's a starting point.
One_Club_9555@reddit
Thanks! I’ll check them out!
rtgconde@reddit
Thank you for this OP! Great information in this article. I’m running two DGX Sparks in a cluster and multiple 128gb machines with different models. Just got my hands on the latest MacBook Pro M5 max with 128gb of RAM as well and this is really helpful even if I don’t have the same amount of memory as you.
ezyz@reddit (OP)
NP! Been a lurker here long enough, so this felt like something I needed to write.
I'm actually eagerly waiting for more details on Mac + Spark clusters. Exo launched a demo of this a couple months ago, but it hasn't moved since: https://github.com/exo-explore/exo/issues/1102
Longjumping_Crow_597@reddit
EXO maintainer here. This is coming very soon! In the issue you mentioned there's a link to a public preview in the most recent comment.
Heterogeneous hardware is coming local, just like in the data center.
ezyz@reddit (OP)
Amazing, thank you. Does NVIDIA prefill / Mac decode require the model to be fully loaded in both?
Either way, looking forward to this!