M5 Max 128GB, 17 models, 23 prompts: Qwen 3.5 122B is still a local king

Posted by tolitius@reddit | LocalLLaMA | View on Reddit | 108 comments

The last Llama (Scout/Maverick) was released a year ago. Since then US based releases have been super rare: Granite 3.3, GPT-OSS 20B & 120B, Nemotron 3 Nano / Super and now Gemma 4. Can't even compare to the solid Chinese open model output or Qwens, DeepSeeks, Kimis, MiniMaxes, GLMs, MiMos, Seeds, etc..

Gemma 4 is like a breath of fresh air. Not just the model itself, but the rollout, the beauty, the innovation: K=V in global attention, Per-Layer Embeddings, tri-modal minis (E4B, E2B), etc.

Most of my local LLM usage used to be via rented GPUs: Google Cloud, AWS, etc. But about a month ago I decided to bring it all home, and bought a shiny M5 Max MacBook Pro 128GB. It is a beast of a laptop, but also opens up the kind of models I can run locally: 128GB of unified RAM and all.

Besides the cost, the true benefit of running models locally is privacy. I never fell easy sending my data to "OpenRouter => Model A" or even hosting it in AWS on P4d/P4de instances (NVIDIA A100): it is still my data, and it is not home. where I am.

But my laptop is.

When it comes to LLMs, unless it is research or coding finding utility is difficult. But I have kids, and they have school, and if anything is super messy in terms of organization, variety of disconnected systems where the kids data lives, communication inconsistencies, that would be US public schools. But being a parent is fun, and this mess is a great fit for LLMs to make sense of. Local LLMs solve the last piece: my kids data stay on my laptop at home.

So it began. I loaded all I could to my 128GB friendly beast and start looking at which models are good for what. The flow is not difficult: go to many different school affiliated websites, some have APIs, some I need to playwright screen scape, some are a little of both plus funky captchas and logins, etc. Then, when on "a" website, some teachers have things inside a slide deck on a "slide 13", some in some obscure folders, others on different systems buried under many irrelevant links. LLMs need to scout all this ambiguity and come back to be with a clear signals of what is due tomorrow, this week; what the grades are, why they are what they are, etc. Again, a great use case for LLM, since it is lots of unorganized text with a clear goal to optimize for.

You maybe thinking just about now: "OpenClaw". And you would be correct, this is what I have started from, but then I realized that OpenClaw is as good as the set of LLMs behind it. Also if I schedule a vanilla OS cron that invokes a "school skill", the number of tokens sent to LLM goes from 10K to about 600. And while I do have an OpenClaw running on VPS / OpenRouter, this was not (maybe yet) a good use of it.

In order to rank local models I scavenged a few problems over the years that I had to solve with big boys: Claude, OpenAI, Grok and Gemini. They are nice enough to record everything we talk about, which is anything but local, but in this case gave me a chance to collect a few problems and convert them to prompts with rubrics.

I then wrote a script to start making sense of what works for me vs. what is advertised and/or works for others. The script grew fast, and was missing look and feel, so I added UI to it: https://github.com/tolitius/cupel

Besides the usual general problems, I used a few specific prompts that had tool use and muli-turns (multiple steps composed via tool calling) focused specifically on school related activities.

After a few nights and trial and error, I found that "Qwen 3.5 122B A10B Q4" is the best and the closest that solves most of the tasks. A pleasant surprise, by the way, was the "NVIDIA Nemotron 3 Super 120B A12B 4bit". I really like this model, it is fast and unusually great. "Unusually" because previous Nemotrons did not genuinely stand out as this one.

[pre Gemma 4](

And then Gemma 4 came around.

Interestingly, at least for my use case, "Qwen 3.5 122B A10B Q4" still performs better than "Gemma 4 26B A4B", and about 50/50 accuracy wise with "Gemma 4 31B", but it wins hands down in speed. "Gemma 4 31B" full precision is about 7 tokens per second on M5 Max MacBook Pro 128GB, whereas "Qwen 3.5 122B A10B Q4" is 50 to 65 tokens / second.

[(here tested Gemma 4 via OpenRouter to avoid any misconfiguration on my side + 2x faster)](

But I suspect I still need to learn "The Way of Gemma" to make it work much better. It really is a giant leap forward given its size vs. quality. After all, at 31B, although dense, it stands side by side with 122B.

[-]

bwjxjelsbd@reddit

Can you please keep updating this series of post? With Minimax M2.7 coming out this weekend it’s going to be a fun one

[-]

Format	Typical bpw	Storage per weight	Block Size	Error on 7.25	Accuracy	Perplexity	M5 Max Recommendation
FP32 (baseline)	32.0	4.00 Bytes	—	0.00000000	100.00%	1.20312	Only for tiny models
FP16 / BF16	16.0	2.00 Bytes	—	0.00000000	97.90-98.00%	1.21000	Suboptimal, VRAM inefficient
Q8_0	8.0	1.00 Byte	32	-0.0059	97.65%	1.21875	High quality fallback
MXFP8	8.25	1.03 Bytes	32	0.00000000	97.50-97.90%	1.21500	Valid 8-bit alternative
Q6_K	6.56	0.82 Bytes	16	0.00000000	96.85%	1.21875	Optimal high-VRAM format
Q5_K	5.5	0.69 Bytes	32	Nominal	94.60%	1.24218	Obsolete?
MXFP4	4.25	0.53 Bytes	32	Nominal	93.00-94.00%	1.26500	Excellent alternative
NVFP4	4.25-4.5	0.53 Bytes	16	Nominal	93.00-94.00%	1.26500	Strong second choice
MLX 4-bit (native)	4.8	0.60 Bytes	64	+0.08	93.50%	1.26562	Best overall for M5 Max
Q4_K (basic)	4.0-4.5	0.50-0.56 Bytes	32	+0.0833	89.75%	1.35937	Avoid unless tiny model
Q2_K	2.0	0.25 Bytes	32	-0.5833	Degraded	Degraded	Never recommended

stat	TTFT (ms)	tg TPS	pp TPS	Peak Mem
pp1024/tg128	1,560	26.1	656	17.56 GB
pp4096/tg128	5,929	24.4	691	19.35 GB
pp200000/tg128	812,955	2.3	246	47.03 GB

stat	pp1024	pp200000
tg TPS	63.1	14.7
pp TPS	1,263	404
Peak Mem	66.4 GB	96.8 GB