M4 Max 128GB running Qwen 72B Q4 MLX at 11tokens/second.
Posted by tony__Y@reddit | LocalLLaMA | View on Reddit | 69 comments
Posted by tony__Y@reddit | LocalLLaMA | View on Reddit | 69 comments
SandboChang@reddit
That’s pretty amazing. What’s the prompt processing time of you have a chance to check?
tony__Y@reddit (OP)
1-2second to first token, 10-15s at 9k tokens context chat.
Apple is being cheeky, in high power mode, the power usage can shoot up to 190W then quickly drops to 90-130W, which is around when it start streaming tokens. By then I’m less impatient about speed as I can start reading as it generates.
CH1997H@reddit
What software are you using? I imagine llama.cpp should be faster than this with the optimal settings, also on this M4 hardware
Also make sure to use fast attention etc.
tony__Y@reddit (OP)
🤔I’m using LM Studio, and it uses meal llama.cpp as backends, but I can’t pass custom arguments, maybe i should try that hummm
CH1997H@reddit
Yeah the optimal custom commands can be a bit tricky to figure out
Try these: -fa -ctk q4_0 -ctv q4_0
There are some other flags you also can try, you can find them in the llama.cpp Github documentation
SandboChang@reddit
15s for 9k is totally acceptable! This really makes a wonderful mobile inference platform. I guess by using 32B coder model it might be an even better fit.
Yes_but_I_think@reddit
How come this is mobile inference, at 170w? May be few minutes.
SandboChang@reddit
At least you can bring it somewhere with a socket with you, so you can code with a local model in a cafe or on a flight.
Power consumption is one thing but that’s hardly a continuous consumption either.
SixZer0@reddit
Hehe, the critical question of prompt processing speed, anyway it is important so yeah, I am happy you asked about this.
jacek2023@reddit
Now compare price to 3090
mizhgun@reddit
Now compare the power consumption of M4 Max and at least 4x 3090.
RikuDesu@reddit
5-6 3090s at $1k each ish
Plus you'd need a server board eats board board alone is about 1000 and a threadripper to handle the pcie lanes to hit the same amount of vram
I guess you could do it for cheaper using weird bifurcated pcie extensions but itd be more janky not to mention two giant psus
MrMisterShin@reddit
More like 3-4 RTX 3090 in this instance tbh, reason being the default RAM allocation on the M4 Max MBP - it would reserve around 25% of 128GB for the OS etc. Additionally OP said they were running other background tasks.
timschwartz@reddit
A 3090 on ebay is about $800 and you'll need 5 of them to match the VRAM in the M4. So $4000 in video cards, plus the computer / power supplies to use them.
The M4 with 128GB of RAM is $4700.
Sounds like the M4 is the winner.
noneabove1182@reddit
Kinda expected more, but in a laptop that's still quite impressive
Does that say 163 watts though..? Am I reading it wrong?
tony__Y@reddit (OP)
no, you’re reading it correctly, that’s system total power, highest I saw as 190W 😬, while powermetrics report GPU at 70W, very dodgy apple. I hope they don’t make another i9 situation in the next few years. 🤞
cheesecantalk@reddit
Holy shit. Allowing that in a 14 inch chassis is crazy.
Maybe it wasn't made for AI models after all
Can you check thermals after running an AI model for a few minutes (say 5 to 10?) just throw question after question at it
otterquestions@reddit
Dumb question, but does that include everything, including the monitor?
tony__Y@reddit (OP)
I’m using clamshell mode with docks, so if I use with builtin mini-led that’s another 10-30W, connect some external drives, easily another 10-30W 🫠
tony__Y@reddit (OP)
During inference, GPU temp stays around 110C, then throttles to keep at 110C, and then fan will start to get loud and it just use whatever GPU frequency that can maintain 110C. I guess high power mode is setting a more aggressive fan curve.
After inference, usually before I can finish reading and send prompt again (1-3min), the fan will just drop to min speed.
I'm testing Qwen coder autocomplete right now, and with 3B model, generated code basically appear in less than a second, then I have to pause and read what it generated, so I guess not much sustained load, and fan is at min speed still... quite impressive.
xXDennisXx3000@reddit
110°C??! Bro your GPU will not last longer than a year with that temp, if it even lasts that long.
Estrava@reddit
We don't really know how apple silicon will handle heat. Chips are designed differently and there's no clear rules. AMD for example.
"The user asked Hallock if "we have to change our understanding of what is 'good' and 'desirable' when it comes to CPU temps for Zen 3." In short, the answer is yes, sort of. But Hallock provided a longer answer, explaining that 90C is normal a Ryzen 9 5950X (16C/32T, up to 4.9GHz), Ryzen 9 5900X (12C/24T, up to 4.8GHz), and Ryzen 7 5800X (8C/16T, up to 4.7GHz) at full load, and 95C is normal for the Ryzen 5 5600X (6C/12T, up to 4.6GHz) when spinning its wheels as fast as they will go.
"Yes. I want to be clear with everyone that AMD views temps up to 90C (5800X/5900X/5950X) and 95C (5600X) as typical and by design for full load conditions. Having a higher maximum temperature supported by the silicon and firmware allows the CPU to pursue higher and longer boost performance before the algorithm pulls back for thermal reasons," Hallock said."
xXDennisXx3000@reddit
What execs say are mostly benefitting the corpos not the consumer. I have been using Zen 3 with the Ryzen 9 5950X on my main PC and the Ryzen 7 5800X on my LAN PC for years now.
It's true that it is designed in the way that it boosts to that temps, but even when it is designed for higher boosts and higher temps, you need to pay attention. It will still degrade faster than usual. Since they are all using silicon and not any other material, the temps that will degrade your hardware are the same as the silicon from 2010 or 2015. It's all still silicon.
Apple is the worst when it comes to saying true things about their hardware and they will say absolutely anything if it benefits them. If your GPU dies, they will not replace shit and try to squeeze every little penny out of your pocket and want to sell you new overpriced things.
Try to reduce your temps, or your GPU will die fast. It's your overpriced hardware, not mine, but i care about my hardware and that's why i am doing it for my Ryzens lol.
Estrava@reddit
Apple hasn't said anything about their hardware.
And what, silicon is silicon? Did you know the max temps of a pentium 4 was 70C? What changed in the past few decades, did silicon get better if we shouldn't have approached 70C before?
Have you looked at server CPUs? I guess they're not made out of silicon but some magic because they can sit 90C+ for years.
Dell poweredge notes for CPU high temperature for long period and lifespan. https://www.dell.com/support/kbdoc/en-us/000212668/customer-s-concern-running-the-cpu-at-high-temperatures-for-extended-periods-of-time-may-impact-its-quality-and-lifespan
Intel documentation https://www.intel.com/content/www/us/en/support/articles/000005597/processors.html
"It's unlikely that a processor would get damaged from overheating, due to the operational safeguards in place. Processors have two modes of thermal protection, throttling and automatic shutdown. When a core exceeds the set throttle temperature, it will reduce power to maintain a safe temperature level. The throttle temperature can vary by processor and BIOS settings. If the processor is unable to maintain a safe operating temperature through throttling actions, it will automatically shut down to prevent permanent damage. "
"The leading processor manufacturers intentionally design their components to function at high temperatures throughout their lifespan. They do so based on their understanding of the dependency on system fan power and cooling capabilities. For instance, if Intel or AMD specifies a maximum CPU temperature of 95°C (203°F), it means that the processor can operate at that temperature limit without negatively affecting its lifespan. This is provided the CPU does not exceed that temperature threshold."
tony__Y@reddit (OP)
Thank you for these doc links, that's comfortable to know.
What I'm more curious about is the frequency switching between high and low temps, between inference and idle. But I guess Apple would thought about it and addressed this since they're putting these chips in iPhones and iPads too.
Estrava@reddit
The only cause of concern that I can think of you could get more dry thermal paste quicker, so you may have to replace it in a few years to get the same performance. But that assumes Apple hasn’t adjusted their technology for that either.
Anywho every concern is speculation unless we know the hardware limits of Apple silicon. Enjoy your device and use it to its fullest imo.
Capable-Reaction8155@reddit
It's probably okay, but man 110C is hot.
cheesecantalk@reddit
Good to know!
So it throttles <1 minute when running 72B, but doesn't break a sweat under smaller models. Good to know
boissez@reddit
Other laptops go far above that though. This 14-incher goes beyond 230 watts. https://www.notebookcheck.net/Razer-Blade-14-2024-laptop-review-Futureproofing-with-Ryzen-AI.799687.0.html
fairydreaming@reddit
So this is the famous Apple power-efficiency? Funny that it couldn't get enough power from the power adapter and had to use the battery. Thanks for getting us some real values.
I guess it's still half of of the power that my Epyc workstation draws from the socket under load.
Daemonix00@reddit
Impressive but like my old i9… will turn off if you run it for long… eats battery too
boissez@reddit
It's not uncommon. There are several thin and light laptops running a Intel Core i7 and with a RTX 4070 pegged at 100+ watts that get up there.
For instance, this fellow reaches 208 watts under load: https://www.notebookcheck.net/Asus-Zenbook-Pro-14-OLED-laptop-review-MacBook-Pro-rival-with-120-Hz-OLED-display.725187.0.html
CH1997H@reddit
What software are you using? I feel like llama.cpp should be faster than this with the optimal settings, also on this hardware
Also make sure to use fast attention etc.
Zeddi2892@reddit
Thanks for sharing!
So basically it would work with a 64 GB mbp m4max as well like this?
How about larger models that only fit on 128gb mbp?
tony__Y@reddit (OP)
I think even this 72B at Q4 is not useable with 64GB MBP. You might need to use Q2, quit all other apps, allocate more VRAM and use small context length. Whereas on 128G I didn't need to quit any of my work apps, I can just work with 72B on the side.
Zeddi2892@reddit
So basically you argue that larger models (than 72B) wont fit on a 128gig mbp as well?
tony__Y@reddit (OP)
If you really want, you can get it to run, but I would argue for productivity assistance purposes \~72B is the limit on MBP 128GB.
For example, if I want to run Mistral 2411 Large 128B, either I have to use Q2 or Q4 but quit all my other apps, and I think it would be even slower; it feels very diminishing return going from 70B models. Not to mention at Q8 that model is 130GB in download size. At that point, I'll get impatient and use a cloud model instead.
Zeddi2892@reddit
Have you tried just for benchmarking how well 128B rums in Q4?
I‘m kinda considering buying a mbp and I‘m torn between the 64 and 128 gig version. 800€ is quite a sum and I‘m not sure if thats what I want to pay extra for slightly bigger models.
At home I have a 4090 which is awesome, but limited to ~20-30B Models (bigger models wont fit, bigger quants are usually not any helpful).
If I do buy a mbp, I want to make it worth it for local llms. If I just use 20B models in the end, I can stick to my existing setup.
tony__Y@reddit (OP)
However, at larger context length (5.4k tokens chat), it will take two minutes to process. Memory usage is still manageable ish, can still keep some light apps open.
tony__Y@reddit (OP)
Running Mistral 2411 128B Q4 at 4-5 t/s.
Daemonix00@reddit
I run a test last night on 128gb ultra m1. Ollama q4 123b. 7.5t/s
tony__Y@reddit (OP)
the Ultra chip's higher memory bandwidth still wins, I got 4-5t/s on M4 Max.
un_passant@reddit
What is the memory bandwidth of the M4 Max and how does is compare to a dual Epyc width 16 memory channels of DDR5 ?
tony__Y@reddit (OP)
Can I carry a dual Epic 16 channels of DDR5 on the go? especially on intercontinental flights
mibelashri@reddit
This post made me realize how poor academic I am. But it is nice to see that in action. Lets just wait and see how open model will take off. fingers crossed at Qwen3.0.
J-na1han@reddit
Does mlx still only allow 4 and 8 bit quantization? I feel 8 is way too much/slow. So I use 6 bit in gguf format with koboldcpp.
tony__Y@reddit (OP)
I'm not sure, but from a quick search in huggingface seems like that's the case.
LoadingALIAS@reddit
I expected more. I’m on an M1 16GB and ran the 32b int4 with MLX at about the same.
mahiatlinux@reddit
This is 72B pal. More than double the params.
Durian881@reddit
Wonder what results you'll get when running in low power mode? On my M3 Max, I get ~50% of token generation speed vs high power mode.
tony__Y@reddit (OP)
With 72B, it spend a minute processing in low power mode, so I decided to cancel it, won't be useful anyways.
WIth Llama 3.2 3B Q4 MLX, I get 158 t/s in high power mode and 43 t/s in low power mode.
Qwen2.5 7B Q4 MLX, I get 90 t/s in high power mode and 27 t/s in low power mode
Low power mode seems to work by capping the total power consumption under 40W, and I have some persistent background CPU tasks going on right now, (system using 30W without doing inference), which I guess hurt the speed a bit more in low power mode.
Low power mode also made entire system stutter during inference, getting to the point of typing lags. Whereas in high power mode I still get smooth animations during inference .
Durian881@reddit
Thanks for sharing. Seemed that M4 Max low power mode capped performance a lot more than M3 Max. I still get smooth animations during inference for M3 Max on low power mode.
tony__Y@reddit (OP)
oh emmm maybe i forgot to mention I'm connected to three 4k monitors at 6k resolution scaling... so that probably didn't help... 😅
Subjectobserver@reddit
A fellow Zoteran
tony__Y@reddit (OP)
haha yeah, I'm abusing my 128G RAM by opening 20+ Zotero pdf windows.
Mrleibniz@reddit
What's the context size? Can you use it as a local GitHub copilot on VSCode?
tony__Y@reddit (OP)
Currently testing VSCode + Continue + Qwen 2.5 3B Q4, with 32k context length, and it still autocomplete in less than a second. This thing is amazing, I'm going to download larger coders and try.
swiftninja_@reddit
can you share a screen recording of a demo?
tony__Y@reddit (OP)
I don't think reddit supports video upload? (and I don't have any video hosting service). Anyways, you can also go to any Apple store and try LM Studio on their demo units.
synth_mania@reddit
You can you any LLM as a github copilot essentially. 72B is probably gonna run slower than you would like though. I run Qwen-2.5 32B on my PC for stuff like vs code
alphaQ314@reddit
What's the usecase for running a local llm like this?
tony__Y@reddit (OP)
highly censored topics, when any cloud AI will just refuse to say anything; with LocalLLM, at least I can beat them do say something useful.
prumf@reddit
You can also use them on sensitive information. I mostly use copilot and OpenAI models, but when the data can’t be leaked at any cost, I use local models+continue.
Overall it works really well.
RikuDesu@reddit
Thanks for posting this I'm still teetering deciding on whether or not it's worth it to get a maxed out m4 max or not
randomfoo2@reddit
Curious, does your mlx script let you emulate what llama-bench does, eg, give you numbers for prefill, like pp512 performance as well as tg128 (token generation), then you could do a 1:1 comparison w/ llama.cpp's speed, but also get an idea of how fast it'll take before token generation starts for longer conversations.
rava-dosa@reddit
That’s a solid setup for running Qwen 72B—11 tokens/sec
I’ve been exploring similar configurations for large-scale model testing.
I worked with a group called Origins AI (a deep-tech dev studio) for a custom deep-learning project.
Might be worth checking out if you’re pushing the limits of what your setup can do!
shaman-warrior@reddit
m1 max, 64gb, qwen 72b Q4, I get 6.17 tokens/s.
From a total generation of 1m 38s.
without using MLX, just using ollama.
SandboChang@reddit
That’s a massive difference, and I think it should give like 8-9 token per second estimated from earlier Apple Silicon.
b3081a@reddit
MLX provides around 20% perf advantage so the gap is smaller.