A few early (and somewhat vague) LLM benchmark comparisons between the M5 Max Macbook Pro and other laptops - Hardware Canucks

[-]

nicoloboschi@reddit

It's interesting to see the memory bandwidth becoming a key factor in LLM performance on different hardware. Memory is often a bottleneck for AI agents, and we built Hindsight to alleviate some of these limitations. https://github.com/vectorize-io/hindsight

[-]

Blues227@reddit

How does the M5 Pro do in those benchmarks against it?

[-]

StardockEngineer@reddit

Need to see the prefill. Only thing that matters. I can already guesstimate the rest.

[-]

Look_0ver_There@reddit

I have a work supplied M4 Max laptop. Using the same 4B model as OOP's images are referencing, here's what I'm seeing:

llama-bench operating on a regular GGUF: ~865 PP512

mlx_lm.benchmark operating on an MLX (Apple native) quant of the same model: ~890 PP512

This result seems curiously low for a Q4_K quant of a 4B model. On my personal 7900XTX, I see a PP512 of 2921 for the same model, which even seems low for this video card. Most 4B models would be pushing >6K

Running on an MLX 8-bit version of the Qwen-Coder-Next, which is an 80B MoE model, on the M4 Max laptop, I see PP512 of ~1013, and PP2048 of ~1261, which seems more appropriate/expected.

I guess he didn't want to post the PP scores cos they are admittedly fairly "sucky", but with so many models to choose from (Qwen 3.5 is all the rage now with its variety of model sizes) why choose an old model that doesn't seem to perform terribly well on, well, anything?

[-]

MiaBchDave@reddit

Different GPU cores on Apple M5 vs M4, with new Tensor units per GPU. One guess what they make less "sucky."

[-]

Few_Size_4798@reddit

However, if the model CANNOT fit in VRAM on the 5090 24GB VRAM (i.e., 32B param model tested but not shown, then the inference speed is higher on M5 Max due to unified memory architecture).

The minimum Mac with this configuration has 48 GB of memory.

It would seem that what's stopping us from taking the 32 GB+ model so that the 5090 chokes, the 395+ finally pulls ahead of it, and the m5 max shows its undeniable advantages?

People are asking to test the larger models? We'll have to wait a long time.

[-]

JockY@reddit

Inference almost doesn’t matter at this point. It’s all about prompt processing speeds. It’s telling that those data are not shown.

[-]

alexp702@reddit

Pp speeds are much better with M5Max: https://youtu.be/XGe7ldwFLSE?si=AFTdqPV4Np0gsgj-

[-]

JockY@reddit

I stopped clicking Reddit YouTube links years ago! Why?

[-]

sixyearoldme@reddit

Can you please explain?

[-]

JockY@reddit

There are two basic speed metrics:

Prompt processing: how quickly the LLM generates its first token given an input prompt (aka time to first token). Larger prompts take longer.
Inference speed. The rate at which tokens are generated once prompt processing is complete.

Both of these slow down with longer contexts; the longer your prompt, the slower things get.

Inference is basically a solved problem on unified RAM systems like the M5. It’s fast enough to be useable. Prompt processing, however, is another matter - it’s highly compute bound, which is where GPU tensor cores accelerate things.

On unified RAM systems… less accelerated. Much slower. Far less impressive when shown in fancy graphs and charts.

That’s why the charts only show inference speeds: it makes the M5 look good. The deliberate omission of prompt processing speeds tells us that either (a) they suck, or (b) the creator of the charts is clueless.

There’s a good deal of evidence for (b) because none of the charts actually specify at what context lengths the tests were done, which leads me to assume the creator used a tiny prompt to make the numbers look good. I’d wager good money that’s the case.

[-]

MrPecunius@reddit

M5 is about 3.5X as fast as M4 series for prefill, so the numbers should be decent.

[-]

jerieljan@reddit

For starters, just the addition of a plain timer from start to finish (Time to Complete) would be better than just tok/s. From cold start to response in each of these.

Tokens/sec is good, but it's not the complete picture since there's a lot of time also allocated for loading the model, actually processing the prompt before it even starts outputting a token.

That review isn't really great either at showing good AI performance either (if not misleading) since it looks like a single prompt and just getting the tok/s LM Studio returns and that's it. If that's all you do, it's great, but there's far more in AI nowadays than just text completions.

Sure, it's arguable that this video really isn't an AI benchmark test and is just one small portion to an overall review but man, I think we need good ones that's easy to do for normie reviewers like this. Something that covers a task-oriented benchmark (i.e, make it run opencode or something to accomplish a task) or an eval run or two for each would be better.

[-]

rditorx@reddit

That would be mostly measuring SSD speed and time to load the model for short contexts and little or no reasoning effort, and mostly prompt processing and token generation for long contexts and high reasoning effort, so the values would vary wildly, depending on the use case.

More primitive values like prompt processing, token generation and SSD read speed make it easier to get a complete picture for all cases because you can calculate your own distribution.

[-]

john0201@reddit

I think he’s indenting to mean prefill, which is more compute limited.

[-]

AdventurousFly4909@reddit

I think because people are putting 100k+ tokens into their LLMs.

[-]

themixtergames@reddit (OP)

So as we know the real deal is actually prompt processing, you can see in the latest video by Alex Ziskind that the M5 max got a 50% improvement in PP over the M3 Ultra

[-]

aimark42@reddit

Gemma 3B Q4_K, which really doesn't tell us much with such a small model.

Can someone please test a decent size model like gpt-oss-120b

[-]

iMrParker@reddit

Or minimal, GLM 5, qwen 397? But unified memory is boasted about a lot, but filling all of it usually results in extremely long prompt processing times ie. 10-20 minutes with agent if coding

[-]

aimark42@reddit

This is M5 Max we only have 128g to play with, this isn't M5 Ultra.

[-]

iMrParker@reddit

GPT 120b is still pretty small for 128gb of RAM even with high contexts. But yeah it would be better to see over Gemma 3b

[-]

misha1350@reddit

Not that it matters because GPT OSS 120B is already outdated with the existence of Qwen 3.5 122B A10B

[-]

alexp702@reddit

He does test that later on in the video??

[-]

New_Comfortable7240@reddit

Bro where is the AMD AI 395? It means AMD is on par or wins?

[-]

ImportancePitiful795@reddit

Depends. That's the low power laptop version. And also LM Studio is been used 🤮

[-]

Anarchaotic@reddit

HP Zbook. We know what the results are going to be - it's not surprising. The AI Max 395+ are great for running MoE models with lots of context + large sizes (120B) - but are slow for dense models.

[-]

tiger_ace@reddit

I think these results are coherent. Basically:

M5 Max is 614GB/s memory bandwidth
5090 is 896GB/s memory bandwidth
-> 5090 should still crush the M5 Max in inference speeds but laptop 5090 Razer 16 is like 155W TDP, so I guess it lets the M5 Max catch up.

So if the model can fit on the 5090, the performance is on par with M5 Max.

However, if the model CANNOT fit in VRAM on the 5090 24GB VRAM (i.e., 32B param model tested but not shown, then the inference speed is higher on M5 Max due to unified memory architecture).

This is why there is some hype over the M5 Ultra which could be double the M5 Max memory bandwidth since in the past they duct taped two Max cores together.

[-]

Lopsided_Employer_40@reddit

5090 is 32 GB of VRAM. Also you can quantized models to NVFP4 with almost no loss.

Qwen 3.5 27B run smoothly and let enough room for context or other small models.

[-]

Nicollier88@reddit

I think he’s referring to the mobile variant of the rtx 5090, which has 24GB of vram

[-]

indicava@reddit

Da fuq?!? TIL…

[-]

ImportancePitiful795@reddit

5090 (10496 cores) mobile is cut down 5080 (10752 cores). Normal 5090 has 21760 cores.

Same applies to bandwidth. 896/960/1800GB/s respectively.

And that's the rough maths. Because 5090 is been cut down further due to power restrictions making effectively slower than the RTX5070 Ti.

The RTX5090M perf is between 5060Ti and 5070.

[-]

john0201@reddit

Now unplug both and rerun…

[-]

DistanceSolar1449@reddit

I doubt anyone’s running a 5090 mobile unplugged, ever. lol.

[-]

john0201@reddit

Seems like a slight negative in a laptop.

[-]

ANR2ME@reddit

Yeah, 5090 mobile GPU have similar performance to 5070 Ti desktop GPU.

[-]

grizwako@reddit

Yeah, people don't really get macs to run small models.

Even the power draw itself is not really that important.
If you already have a PC, can get a GPU for cheaper.

But Apple looks like to be really killing Nvidia for enthusiast market and smaller companies.

I wonder if they will make a move into datacenter market too.

[-]

tiger_ace@reddit

yep, m5 ultra mac studio with 512GB RAM @ 1.2TB/s memory bandwidth for $10K is actually a good deal depending on exactly what you're trying to do

i think they are well aware of the local LLM play but very unlikely on the datacenter play, they've openly stated lower CAPEX spend on their earnings calls and moving into that would require them to disclose massive spend

[-]

DistanceSolar1449@reddit

It won’t be $10k

If it was able to be sold at $10k Apple wouldn’t have dropped the M3 Ultra 512gb

[-]

krystof24@reddit

They won't because they can't. Their main niche is getting a lot of VRAM for relatively (to Nvidia) cheap but I don't see how that could scale to datacenter size. NVIDIA os sol number one on compute and ecosystem

[-]

Position_Emergency@reddit

RTX 5090 has 1,792 GB/s memory bandwidth!

[-]

john0201@reddit

Mobile “5090” (which is a cut down 5080) has 896GB/s

[-]

droptableadventures@reddit

Not in the mobile version, as this is a comparison of laptops.

[-]

Creative-Signal6813@reddit

benchmark conditions never include sustained load. laptop 5090 at 155w will throttle under extended workloads. m5 max holds clock speed flat for hours.

if ur running one query at a time the peak numbers matter. if ur running an agent all day, ur buying the sustained number, not what's in the video.

[-]

EvilGuy@reddit

Pretty impressive for a laptop I guess?

For comparison I get 130-ish tokens a sec with a 3090 in an old 3800x with 2400 Mhz DDR4 ram that I built from old spare parts I had sitting around and the 3090 was about $800.

No fair comparing these $5000 apple machines to real computers though I guess. ;)

[-]

Anarchaotic@reddit

I mean yeah, of course any of the higher end 3/4/5 series RTX GPUs are faster, look at their bandwidth speeds. But that's only for small models that fit entirely in VRAM.

Your 3090 will choke the second you load anything over 24GB into it, which is where the Macbook will start seeing real advantages.

[-]

Lorian0x7@reddit

The macbook will not choke with models over 24gb, but your wallet definitely will.

[-]

dtham@reddit

Running the Deepseek R1 Distill Qwen 8B?

[-]

Lorian0x7@reddit

Did you casually forget about prompt processing, btw a 5090 on laptop is not really a 5090, performance wise is on par to a 5070 on desktop.

[-]

gkon7@reddit

Sick of these only tg benchs. We can already guess this.

[-]

Euphoric_Emotion5397@reddit

Cost of machines divided by number of tokens = cost per token should be a better metrics.

but why apple users like to test only 8B model? hehe

[-]

AvailableMycologist2@reddit

the real question is prompt processing speed which they didn't show. for local LLM usage the bottleneck is usually PP not TG, especially with long context. that said the 614GB/s bandwidth on the M5 Max is impressive for a laptop. curious to see how the 128GB version handles larger MoE models

[-]

Individual-Source618@reddit

no

[-]

themixtergames@reddit (OP)

He also included this graph with incorrect labels in the spirit of LLM benchmarks

[-]