A few early (and somewhat vague) LLM benchmark comparisons between the M5 Max Macbook Pro and other laptops - Hardware Canucks
Posted by themixtergames@reddit | LocalLLaMA | View on Reddit | 59 comments
nicoloboschi@reddit
It's interesting to see the memory bandwidth becoming a key factor in LLM performance on different hardware. Memory is often a bottleneck for AI agents, and we built Hindsight to alleviate some of these limitations. https://github.com/vectorize-io/hindsight
Blues227@reddit
How does the M5 Pro do in those benchmarks against it?
StardockEngineer@reddit
Need to see the prefill. Only thing that matters. I can already guesstimate the rest.
Look_0ver_There@reddit
I have a work supplied M4 Max laptop. Using the same 4B model as OOP's images are referencing, here's what I'm seeing:
llama-bench operating on a regular GGUF: ~865 PP512
mlx_lm.benchmark operating on an MLX (Apple native) quant of the same model: ~890 PP512
This result seems curiously low for a Q4_K quant of a 4B model. On my personal 7900XTX, I see a PP512 of 2921 for the same model, which even seems low for this video card. Most 4B models would be pushing >6K
Running on an MLX 8-bit version of the Qwen-Coder-Next, which is an 80B MoE model, on the M4 Max laptop, I see PP512 of ~1013, and PP2048 of ~1261, which seems more appropriate/expected.
I guess he didn't want to post the PP scores cos they are admittedly fairly "sucky", but with so many models to choose from (Qwen 3.5 is all the rage now with its variety of model sizes) why choose an old model that doesn't seem to perform terribly well on, well, anything?
MiaBchDave@reddit
Different GPU cores on Apple M5 vs M4, with new Tensor units per GPU. One guess what they make less "sucky."
Few_Size_4798@reddit
The minimum Mac with this configuration has 48 GB of memory.
It would seem that what's stopping us from taking the 32 GB+ model so that the 5090 chokes, the 395+ finally pulls ahead of it, and the m5 max shows its undeniable advantages?
People are asking to test the larger models? We'll have to wait a long time.
__JockY__@reddit
Inference almost doesn’t matter at this point. It’s all about prompt processing speeds. It’s telling that those data are not shown.
alexp702@reddit
Pp speeds are much better with M5Max: https://youtu.be/XGe7ldwFLSE?si=AFTdqPV4Np0gsgj-
__JockY__@reddit
I stopped clicking Reddit YouTube links years ago! Why?
sixyearoldme@reddit
Can you please explain?
__JockY__@reddit
There are two basic speed metrics:
Both of these slow down with longer contexts; the longer your prompt, the slower things get.
Inference is basically a solved problem on unified RAM systems like the M5. It’s fast enough to be useable. Prompt processing, however, is another matter - it’s highly compute bound, which is where GPU tensor cores accelerate things.
On unified RAM systems… less accelerated. Much slower. Far less impressive when shown in fancy graphs and charts.
That’s why the charts only show inference speeds: it makes the M5 look good. The deliberate omission of prompt processing speeds tells us that either (a) they suck, or (b) the creator of the charts is clueless.
There’s a good deal of evidence for (b) because none of the charts actually specify at what context lengths the tests were done, which leads me to assume the creator used a tiny prompt to make the numbers look good. I’d wager good money that’s the case.
MrPecunius@reddit
M5 is about 3.5X as fast as M4 series for prefill, so the numbers should be decent.
jerieljan@reddit
For starters, just the addition of a plain timer from start to finish (Time to Complete) would be better than just tok/s. From cold start to response in each of these.
Tokens/sec is good, but it's not the complete picture since there's a lot of time also allocated for loading the model, actually processing the prompt before it even starts outputting a token.
That review isn't really great either at showing good AI performance either (if not misleading) since it looks like a single prompt and just getting the tok/s LM Studio returns and that's it. If that's all you do, it's great, but there's far more in AI nowadays than just text completions.
Sure, it's arguable that this video really isn't an AI benchmark test and is just one small portion to an overall review but man, I think we need good ones that's easy to do for normie reviewers like this. Something that covers a task-oriented benchmark (i.e, make it run opencode or something to accomplish a task) or an eval run or two for each would be better.
rditorx@reddit
That would be mostly measuring SSD speed and time to load the model for short contexts and little or no reasoning effort, and mostly prompt processing and token generation for long contexts and high reasoning effort, so the values would vary wildly, depending on the use case.
More primitive values like prompt processing, token generation and SSD read speed make it easier to get a complete picture for all cases because you can calculate your own distribution.
john0201@reddit
I think he’s indenting to mean prefill, which is more compute limited.
AdventurousFly4909@reddit
I think because people are putting 100k+ tokens into their LLMs.
themixtergames@reddit (OP)
So as we know the real deal is actually prompt processing, you can see in the latest video by Alex Ziskind that the M5 max got a 50% improvement in PP over the M3 Ultra
aimark42@reddit
Gemma 3B Q4_K, which really doesn't tell us much with such a small model.
Can someone please test a decent size model like gpt-oss-120b
iMrParker@reddit
Or minimal, GLM 5, qwen 397? But unified memory is boasted about a lot, but filling all of it usually results in extremely long prompt processing times ie. 10-20 minutes with agent if coding
aimark42@reddit
This is M5 Max we only have 128g to play with, this isn't M5 Ultra.
iMrParker@reddit
GPT 120b is still pretty small for 128gb of RAM even with high contexts. But yeah it would be better to see over Gemma 3b
misha1350@reddit
Not that it matters because GPT OSS 120B is already outdated with the existence of Qwen 3.5 122B A10B
alexp702@reddit
He does test that later on in the video??
New_Comfortable7240@reddit
Bro where is the AMD AI 395? It means AMD is on par or wins?
ImportancePitiful795@reddit
Depends. That's the low power laptop version. And also LM Studio is been used 🤮
Anarchaotic@reddit
HP Zbook. We know what the results are going to be - it's not surprising. The AI Max 395+ are great for running MoE models with lots of context + large sizes (120B) - but are slow for dense models.
tiger_ace@reddit
I think these results are coherent. Basically:
So if the model can fit on the 5090, the performance is on par with M5 Max.
However, if the model CANNOT fit in VRAM on the 5090 24GB VRAM (i.e., 32B param model tested but not shown, then the inference speed is higher on M5 Max due to unified memory architecture).
This is why there is some hype over the M5 Ultra which could be double the M5 Max memory bandwidth since in the past they duct taped two Max cores together.
Lopsided_Employer_40@reddit
5090 is 32 GB of VRAM. Also you can quantized models to NVFP4 with almost no loss.
Qwen 3.5 27B run smoothly and let enough room for context or other small models.
Nicollier88@reddit
I think he’s referring to the mobile variant of the rtx 5090, which has 24GB of vram
indicava@reddit
Da fuq?!? TIL…
ImportancePitiful795@reddit
5090 (10496 cores) mobile is cut down 5080 (10752 cores). Normal 5090 has 21760 cores.
Same applies to bandwidth. 896/960/1800GB/s respectively.
And that's the rough maths. Because 5090 is been cut down further due to power restrictions making effectively slower than the RTX5070 Ti.
The RTX5090M perf is between 5060Ti and 5070.
john0201@reddit
Now unplug both and rerun…
DistanceSolar1449@reddit
I doubt anyone’s running a 5090 mobile unplugged, ever. lol.
john0201@reddit
Seems like a slight negative in a laptop.
ANR2ME@reddit
Yeah, 5090 mobile GPU have similar performance to 5070 Ti desktop GPU.
grizwako@reddit
Yeah, people don't really get macs to run small models.
Even the power draw itself is not really that important.
If you already have a PC, can get a GPU for cheaper.
But Apple looks like to be really killing Nvidia for enthusiast market and smaller companies.
I wonder if they will make a move into datacenter market too.
tiger_ace@reddit
yep, m5 ultra mac studio with 512GB RAM @ 1.2TB/s memory bandwidth for $10K is actually a good deal depending on exactly what you're trying to do
i think they are well aware of the local LLM play but very unlikely on the datacenter play, they've openly stated lower CAPEX spend on their earnings calls and moving into that would require them to disclose massive spend
DistanceSolar1449@reddit
It won’t be $10k
If it was able to be sold at $10k Apple wouldn’t have dropped the M3 Ultra 512gb
krystof24@reddit
They won't because they can't. Their main niche is getting a lot of VRAM for relatively (to Nvidia) cheap but I don't see how that could scale to datacenter size. NVIDIA os sol number one on compute and ecosystem
Position_Emergency@reddit
RTX 5090 has 1,792 GB/s memory bandwidth!
john0201@reddit
Mobile “5090” (which is a cut down 5080) has 896GB/s
droptableadventures@reddit
Not in the mobile version, as this is a comparison of laptops.
Creative-Signal6813@reddit
benchmark conditions never include sustained load. laptop 5090 at 155w will throttle under extended workloads. m5 max holds clock speed flat for hours.
if ur running one query at a time the peak numbers matter. if ur running an agent all day, ur buying the sustained number, not what's in the video.
EvilGuy@reddit
Pretty impressive for a laptop I guess?
For comparison I get 130-ish tokens a sec with a 3090 in an old 3800x with 2400 Mhz DDR4 ram that I built from old spare parts I had sitting around and the 3090 was about $800.
No fair comparing these $5000 apple machines to real computers though I guess. ;)
Anarchaotic@reddit
I mean yeah, of course any of the higher end 3/4/5 series RTX GPUs are faster, look at their bandwidth speeds. But that's only for small models that fit entirely in VRAM.
Your 3090 will choke the second you load anything over 24GB into it, which is where the Macbook will start seeing real advantages.
Lorian0x7@reddit
The macbook will not choke with models over 24gb, but your wallet definitely will.
dtham@reddit
Running the Deepseek R1 Distill Qwen 8B?
Lorian0x7@reddit
Did you casually forget about prompt processing, btw a 5090 on laptop is not really a 5090, performance wise is on par to a 5070 on desktop.
gkon7@reddit
Sick of these only tg benchs. We can already guess this.
Euphoric_Emotion5397@reddit
Cost of machines divided by number of tokens = cost per token should be a better metrics.
but why apple users like to test only 8B model? hehe
AvailableMycologist2@reddit
the real question is prompt processing speed which they didn't show. for local LLM usage the bottleneck is usually PP not TG, especially with long context. that said the 614GB/s bandwidth on the M5 Max is impressive for a laptop. curious to see how the 128GB version handles larger MoE models
Individual-Source618@reddit
no
themixtergames@reddit (OP)
He also included this graph with incorrect labels in the spirit of LLM benchmarks
FinancialTrade8197@reddit
Eden1506@reddit
What is the prompt processing speed?
anonutter@reddit
would be cool to see token/s/usd
ohwut@reddit
The M5 Max is also going to be \~2x the cost of a 5080 Mobile equipped laptop in a lot of cases. But as a Mac user for all the other benefits, the price is irrelevant, I don't have the option of buying a 5080 anyway.
Ill_Barber8709@reddit
I'm curious about big MOE (like GPT-OSS 120B) on the 128GB version (as well as Devstral-2 123B)
mattate@reddit
I think a better test would be running something that would require CPU offloading, that is where the m5 will really shine