Local LLMs on Refurb M4 Max vs new M5 Max

Posted by roguefunction@reddit | LocalLLaMA | View on Reddit | 42 comments

Hoping the community can guide me on this one. I'm on the fence about the following purchase:

Refurbished 16-inch MacBook Pro Apple M4 Max Chip with 16‑Core CPU and 40‑Core GPU, 64gb ram for $3,479.00

The new 16-inch MacBook Pro Apple M5 Max Chip with 18‑core CPU, 40‑core GPU, 64gb ram for $4,599.00

I'm drawn to the refurb due to price.

I'm going to be using it for work (data scientist & intelligence analyst), but I also want to run models like Gemma 4 31B at Q8, and Qwen3.6-27B Q8. Mainly data work (derivation and data element extraction etc). I've been using local models for a while, but hitting my head on the resource ceiling of 24gb shared ram.

There's a huge price difference ($1,120). Just wanted to check myself. Is the difference in pre-fill worth it for the m5, and any other enhancements? The reviews seem to indicate the M4 Max can run hot.

Thanks in advance.

[-]

OmegaNetRob@reddit

I have similar use cases and was going through the same debate a few weeks ago. I ultimately decided on the m5 max, and decided to boost the ram to 128gb to allow a larger context window. If you qualify for an education discount you can save about $400, perhaps that helps.

[-]

roguefunction@reddit (OP)

Thanks Omega! I'll appreciate the advice and info on the edu discount. Sadly not studying anymore. I thought about the 128g but soo expensive. The m5 is at the top of my budget, with the refurb I can also have a vacation to a resort. But definitely considering the m5. The other dynamic is that the m5 max 64 comes with 2tb as standard instead of 1tb with the m4 max.

[-]

dago_mcj@reddit

The edu discount isn't all that strict. I used it for my Mac mini in November 2024 and I Don't think there was even any attempt to validate beyond checking a box

[-]

OmegaNetRob@reddit

They changed it recently and you have to verify through Unidays. All it asked me to do was verify an email through my university email address. (I am an active student FWIW)

[-]

roguefunction@reddit (OP)

Good to know thanks for the tip!

[-]

looctonmi@reddit

Can you do a used m4 max with 128gb then? The extra ram makes a difference for cache and I find myself hovering around 110gb usage with only qwen3.6 35b loaded

[-]

Big_Wave9732@reddit

And that's why your experiment is going to fail.......you're going cheap on the ram.

[-]

PixelSage-001@reddit

For local LLMs on Apple Silicon, memory bandwidth is the absolute king. The M4 Max is already an absolute beast for this.

If both machines have 64GB of RAM, the performance difference for running Gemma 31B (Q8) will be minimal. The $1,120 price difference is massive—you could use that saved cash to upgrade to a 128GB refurbished studio or laptop later. Go with the refurb M4 Max; the value-to-cost ratio is significantly better.

[-]

roguefunction@reddit (OP)

Thanks Pixel! I might just do that. I’m seeing now from all the commentary that even the m5 helps the larger the modem and in the 128g ram area. For the models I listed I’m hearing there won’t be too much of a diff. I appreciate your advice.

[-]

UnhingedBench@reddit

If that help here's the performace I can get on a M4 Max.

Speed will be 30% better on a M5 Max, but it only has a decisive impact on larger models.

[-]

roguefunction@reddit (OP)

Amazing, thanks for that!

[-]

TimmyIT@reddit

Is this a system something that will be your daily driver for work? If so, have you considered having a separate system only running local LLMs?

Reason for asking is that my own experience have been that offloading it to an dedicated system thats not my main workstations have been beneficial in many aspects.

[-]

roguefunction@reddit (OP)

Thanks Timmy, i was thinking of getting a dedicated machine and running headless to and old laptop but really wanted to have one point of contact for everything. I can appreciate the benefit of doing that though. Cheers

[-]

power97992@reddit

M5 ultra should be coming out soon.., m5 max is much better for prefills than m4 max

[-]

HealthyCommunicat@reddit

The memory bandwidth isn’t all that matters. There is a literal measureable 4x prompt processing speed up using the same model and same engine if u put them side to side.

Thats the difference between waiting 10 minutes and 2-3 minutes for your llm to read your massive codebase and start working.

That adds up fast. 10 hours of processing? Or 2.5 hrs?

It doesn’t matter if you can write no matter how fast if it takes you too long to even read the instructions

[-]

roguefunction@reddit (OP)

Thanks for clarifying. That’s really good to know actually. Makes the additional cost so I’m not so bad.

[-]

MrPecunius@reddit

M5 all the way, prefill is 3X+ faster than M4. (Have owned M4 & M4 Pro, now M5 Pro).

Both will run hot. 16 inch chassis should help, I would not get a Max in the 14 inch.

This should help you sort things out:
https://omlx.ai/compare

My conclusion was that the main benefit to the Max is having 128GB to run larger MoE models--Max is not twice as fast as the Pro with smaller models; it's more like 1.5X to maybe 1.75X. Given the excellent performance of \~30b models and my strong preference for a 14 inch chassis, a M5 Pro/64GB made more sense. If I want a Max, I'll get a forthcoming Studio or whatever. I'm quite happy with the improvement over the M4 Pro I had for almost a year and a half.

[-]

roguefunction@reddit (OP)

Thanks mate. Glad to hear these facts. They'll save me big time. It's a big spend out of pocket so glad to hear the negative too.

[-]

MrPecunius@reddit

The oMLX benchmark data is a godsend and took a lot of guesswork out of my decision.

[-]

returnity@reddit

Anyone got stats on the speed diff b/w M4 Max, M5 Pro, and M5 Max? I was also considering a similar choice but M5 Pro vs M4 Max at same RAM and similar price.

[-]

EmotionalFan5429@reddit

Gemini gave me that:

Llama 3.3 70B (Q4_K_M Quantization)

M5 Max: 18 – 28 tokens per second
M4 Max: 15 – 25 tokens per second
M5 Pro: 10 – 15 tokens per second

[-]

tmvr@reddit

Those numbers are nonsense.

[-]

BawbbySmith@reddit

I went through this recently, I even bought both devices to try. Ultimately I ended up returning both and buying a 5090. In hindsight I’m regretting not buying a rtx pro 5000, but I’m still happy with the 5090.

Mind you, my main use case is agentic programming, so a lot of the model looping and chatting with itself. I’d give it a mostly defined ticket, have it do an attempt, and then review the work and refine it. Currently I’m working on having a more thorough planning step first so that I don’t have to do as much cleanup, but anyway.

I’d give it a decently sized ticket - create CRUD endpoints for new feature X, implement business logic, add database, add unit tests and e2e tests, then review the work and fix any issues. The M4 Max may take an hour, sometimes an hour and a half. The M5 Max would take 30-45 minutes. The 5090 does it in 7-10 minutes, and this is after I limited it to 400W. The difference is huge.

But the other problem, and the second main reason I returned both laptops - the fans spin up the whole time, and the laptop quickly throttles. Imagine hearing the fan blasting for an hour and a half while it does its thing. The M5 Max throttled less, but no difference in fan noise. This also destroys the portability aspect of it - sure, I can take it out of the house, but the battery will die in a couple hours while it’s running. For the 5090 I threw it in a old junker I had lying around, just with an upgraded PSU to support the power requirement, and it lives in the basement that I connect to from my laptop over internet.

Now, with the 5090 I’m able to get Qwen 3.6 27B, Q6_K_XL running, with vision, q8_0 KV and 150k context. Ideally I’d like to bump this up to Q8 and FP16 KV, but that ain’t happening with the 5090, hence my regret with not getting the rtx pro 5000. I’ve not had any issues, but I also don’t know what I’m missing out on. The MBPs would’ve been able to run at this spec, but now it’s even slower because it’s bigger than Q6, and honestly at those speeds it’s not very useful.

One caveat is that I didn’t really bother trying MTP, and while that helps with token generation, it still doesn’t help with prefill. Another one is that we don’t know if a MoE model suddenly drops that somehow beat the 27B, in which case the MBP would be actually able to run it, and get much faster speeds than a dense model. I know 35B only has 3B active parameters, but it was quite usable on the MBPs, it just lacks the reasoning of the 27B.

I still think the ideal is the future M5 Ultra, but who knows what will happen there. We may be waiting for a Nov reveal with shipping starting next year, prices being way higher than the M3 Ultras with less RAM, and being as impossible to get as the M3 Ultras today. I didn’t want to wait that long, and plus worst case I could sell the 5090 if the M5 Ultra ends up being way more reasonable than I expect.

Sorry that ended being more of a wall-of-text than anything helpful, but I hope that at least the small datapoint of a real-world use case helped.

[-]

roguefunction@reddit (OP)

Thank you so much for all the time you put into that. You've given me a ton to think about and it is great to hear some real world comparisons and experiences. I just looked at the rtx pro 5000 and it's not that much of a jump. Just need to figure out the angles. I appreciate the effort!!! Cheers.

[-]

DreamingInManhattan@reddit

Given the use case, the M5 over the M4. PP is everything for you, unless your time means nothing.

However, the right answer is neither, unless you need a mac for other reasons. Instead, I'd get a gaming pc and 2 3090s. I have that, and a M4 128gb mac, and would *never* use the mac to ingest anything data related. Unless I needed to cook some bacon and didn't have a stove handy.

[-]

roguefunction@reddit (OP)

Thanks Dreaming. That's a new perspective for me. I've been lucky with my work as they paid for the top frontier models for me to use anytime even at home to experiment. That's over now due to cost cutting and I want independence from the big 5 providers. Prefer local. Can't seem to find many **90's round for a good price. Maybe when the glut finishes.

[-]

EmotionalFan5429@reddit

There are cheap Chinese alternatives like DeepSeek.

[-]

EmotionalFan5429@reddit

Buy PC: AMD Ryzen with RTX 4060 ti (16 Gb VRAM) -- you will save a lot of money.

[-]

Mameiro@reddit

I’d lean refurb M4 Max unless local inference is your main daily workload. The M5 Max memory bandwidth bump is useful, especially for prefill, but with the same 64GB unified memory and 40-core GPU, I doubt it’s worth $1,120 more for occasional Qwen/Gemma use. For local LLMs, 64GB is the key upgrade. The M5 may be faster, but the M4 Max should already be very capable. I’d rather save the money unless you’ll be running long-context inference all day.

[-]

roguefunction@reddit (OP)

Thanks Mameiro! I appreciate your advice.

[-]

roguefunction@reddit (OP)

Thanks Mameiro! I appreciate the advice.

[-]

Beamsters@reddit

If you do major local inferences, stay away from M4 Max at almost full price (ok for \~50% price or something). M5 Max has Apple Neural Engine, which can speed up prefill a lot with metal4 and you don't want to miss that.

[-]

slavetothesound@reddit

Neural engine has been around for a while it’s the new matrix multiplication (matmul) cores on the GPU that bring the gains

[-]

roguefunction@reddit (OP)

Thanks Beamsters. Will do. Cheers.

[-]

saqneo@reddit

worth imo if price isn't a complete dealbreaker

[-]

Ok_Warning2146@reddit

Both are 64gb

[-]

slavetothesound@reddit

I wouldn’t give up the prompt processing speed gains of the m5, but I’m content with the speeds on my m5 pro 64gb. Couldn’t justify the extra $1500 for max. I chat with ~30b dense models at 8 bit and have lots of room for context. MoE models feel very snappy.

I’m not doing agentic workflows though.

[-]

boxtlandfickerel@reddit

You can choose the M4 Max

[-]

Last_Mastod0n@reddit

Since your just starting I would avoid spending so much on the M5, although I have heard the performance jump is big. I would focus on RAM capacity over anything else.

[-]

roguefunction@reddit (OP)

Thanks mate.

[-]

Ok_Warning2146@reddit

M5 Max no brainer

[-]

roguefunction@reddit (OP)

Thanks!