M4 Max 128GB running Qwen 72B Q4 MLX at 11tokens/second.

[-]

martinerous@reddit

I wish it was a Mac mini, not a laptop. I don't want to overpay for a screen and a battery because I would never ever dare to use a >3000$ device as a portable. It would be chained to my desk with a large red sign "Do not touch!" :)

[-]

RoyalFluid7835@reddit

wait is over, checkout Mac Studio...

[-]

noneabove1182@reddit

Kinda expected more, but in a laptop that's still quite impressive

Does that say 163 watts though..? Am I reading it wrong?

[-]

tony__Y@reddit (OP)

no, you’re reading it correctly, that’s system total power, highest I saw as 190W 😬, while powermetrics report GPU at 70W, very dodgy apple. I hope they don’t make another i9 situation in the next few years. 🤞

[-]

cheesecantalk@reddit

Holy shit. Allowing that in a 14 inch chassis is crazy.

Maybe it wasn't made for AI models after all

Can you check thermals after running an AI model for a few minutes (say 5 to 10?) just throw question after question at it

[-]

tony__Y@reddit (OP)

During inference, GPU temp stays around 110C, then throttles to keep at 110C, and then fan will start to get loud and it just use whatever GPU frequency that can maintain 110C. I guess high power mode is setting a more aggressive fan curve.

After inference, usually before I can finish reading and send prompt again (1-3min), the fan will just drop to min speed.

I'm testing Qwen coder autocomplete right now, and with 3B model, generated code basically appear in less than a second, then I have to pause and read what it generated, so I guess not much sustained load, and fan is at min speed still... quite impressive.

[-]

roshanpr@reddit

110 oly fuck

[-]

Capable-Reaction8155@reddit

It's probably okay, but man 110C is hot.

[-]

MadMadsKR@reddit

Are you sure? I don't own apple computers, but if a regular laptop goes above 90C for any extended period of time that's considered pretty bad in the long term.

[-]

my_name_isnt_clever@reddit

They designed their own chips. They've thought this through far more than anyone in this thread.

The heat issues with the last few Intel based macs were reportedly because Intel promised them better thermals and then didn't deliver. Apple Silicon is a completely different vertically integrated beast.

[-]

MadMadsKR@reddit

Just because you can push a chip to 110C doesn't mean it's healthy for the chip. Your point makes sense, but it presumes Apple designs hardware that functions flawlessly regardless of how hard you push it and I think that gives them more credit than they deserve given they don't have a flawless record with hardware. The butterfly keyboard worked as intended and it was hated from day one and celebrated on the day it was finally abandoned.

[-]

my_name_isnt_clever@reddit

Nobody here has enough context to say one way or the other. I worked as a Genius for several years so I have more context than most, the vast majority of their customers can't tell the keyboards apart. I've seen a ridiculous amount of misinformation spread as fact by internet techies who think they know everything. They do not.

Except the Magic Mouse. I have no idea how corporate still thinks it's an acceptable product.

[-]

thrownawaymane@reddit

I mean, port situation aside it's uncomfortable to use. A real head scratcher

[-]

matadorius@reddit

Oh yeah sure Apple cares about consumers

[-]

sersoniko@reddit

You really can’t compare temperatures of different architectures and manufacturers, it really dependent on where the sensors are placed inside the die and a lot of other factors.

If the temperature is sustained it’s not any worse than any other temperature, a properly designed chip is made to purposely work at those conditions under load

[-]

MadMadsKR@reddit

That makes sense, what I was saying was consensus some time ago before the variation in location of temperature sensors were not really a consideration for readings.

However, I have yet to hear any tech enthusiast that I know of say that 110 should be a fine temperature to keep steadily. Just because temperature sensors vary, that doesn't mean that these super high temperatures are fine to run at. I would be very glad to update my knowledge on this if you have good sources say otherwise.

[-]

sersoniko@reddit

I recommend this interview that covers some good points about thermal design and considerations from an Intel engineer: https://youtu.be/h9TjJviotnI

There’s should also be a second part somewhere

[-]

MadMadsKR@reddit

Excellent interview, thank you for linking it! It was very informative and I learned a lot about how it works. Basically, his argument is that if you're not pushing the chip to it's limits under load, you're leaving performance on the table. I think that's a fair point and he did sway me somewhat in what I consider to be good temperatures to operate at.

However, I also think it's a slight strawman because I don't think anybody would argue this isn't true. I think where the concern lies is that, of course Intel prefers to run fast and hot, what do they have to lose? At no point did they talk about longevity, efficiency, or even just the general downsides to this approach. Even so, the max temperatures they were talking about were 100C at most, and this is coming from Intel who loves running hot.

Now let's consider Apple. Apple has repeatedly proven that they are not interested in building products that last a long time or are repairable. They would rather push their hardware to the limits and let it break. We've seen this from them designing phones with poor battery life that are hardly repairable and keyboards utilizing switches that were impossibly thin so they broke all the time. Thankfully, they seem to have mostly fixed these issues but they are fighting tooth and nail for their products not to get repaired, both legally and practically.

Combining these viewpoints, it is my personal belief that Apple are okay running their chips hotter than they "should", if you care about the longevity of your product. I have never seen a chip run at 110C sustained and have that be considered anywhere near healthy for the chip, even if that is the technical limit that provides the most performance.

So in a way, as they alluded to in the interview, it is sort of a subjective opinion depending on what you value and what your philosophy is. Just my 2 cents.

[-]

Corb3t@reddit

You sure about that? How many iPhone's with the same SoC die pre-maturely?

Apple has been designing their silicon and packaging for high Tjmax for a lot longer than they've been putting their silicon in Macs. iPhones had quite bad cooling prior to iPhone 15 and 16, and it's been fine. They don't die, they just throttle themselves back however much is required to keep the silicon at Apple's presumed Tjmax, which has always been around 110degC.

Does that mean Apple hasn't fucked up in the past? No, but they've had a good track record so far over a lot of process nodes and package designs, so odds are good that (just like all previous generations) this generation's going to be fine with running at over 100 degC.

[-]

MadMadsKR@reddit

I made it pretty clear that what I wrote was just my opinion. You are free to make up your own mind. I don't think your opinion is unreasonable, I just disagree.

[-]

Capable-Reaction8155@reddit

idk, most of the hardware I've dealt with throttles at 90C max

[-]

pyr0kid@reddit

my gpu will straight up exceed max rpm and then shut down if i try for 110c

[-]

Gongchandang420@reddit

as cool as it is that it can run, apples thermals have always been terrible

[-]

candre23@reddit

110C is not ok. Apple letting you cook your $5k laptop so you have to buy another one every 14 months.

[-]

UnfairPay5070@reddit

just make sure you buy 3 year AppleCare and cook it as hard as you can

[-]

goj1ra@reddit

If they set up some proper piping we could use it to boil water for tea

[-]

cheesecantalk@reddit

Good to know!

So it throttles <1 minute when running 72B, but doesn't break a sweat under smaller models. Good to know

[-]

ebrbrbr@reddit

It's worth noting that even on the high power mode it doesn't exceed 3000RPM. The fans go up to 5700RPM.

If you manually control the fans it won't throttle at all.

[-]

ebrbrbr@reddit

High Power mode is setting a more aggressive fan curve.

If you use a program called stats you can manually adjust the fan speed. My 16" never exceeds 82C.

[-]

joshlatte@reddit

I'm relatively new to running this locally, but what's your general process for testing this out? What do you use for inference / prompting?

[-]

MaycombBlume@reddit

I'm testing Qwen coder autocomplete

Would you mind telling me about your setup? I've been experimenting with Twinny and Continue but I haven't had a great experience with autocomplete in either one. What are you using and how did you configure it? The docs are a little sparse when it comes to Qwen specifically, so perhaps I misconfigured something.

[-]

hand___banana@reddit

I install Macs Fan Control on mine and just run the fans on full blast if I know I'm going to have a task like this coming up.

[-]

xXDennisXx3000@reddit

110°C??! Bro your GPU will not last longer than a year with that temp, if it even lasts that long.

[-]

beryugyo619@reddit

yeah what's the max junction temperature now??? can't be like outright 250C right?

[-]

Estrava@reddit

We don't really know how apple silicon will handle heat. Chips are designed differently and there's no clear rules. AMD for example.

"The user asked Hallock if "we have to change our understanding of what is 'good' and 'desirable' when it comes to CPU temps for Zen 3." In short, the answer is yes, sort of. But Hallock provided a longer answer, explaining that 90C is normal a Ryzen 9 5950X (16C/32T, up to 4.9GHz), Ryzen 9 5900X (12C/24T, up to 4.8GHz), and Ryzen 7 5800X (8C/16T, up to 4.7GHz) at full load, and 95C is normal for the Ryzen 5 5600X (6C/12T, up to 4.6GHz) when spinning its wheels as fast as they will go.

"Yes. I want to be clear with everyone that AMD views temps up to 90C (5800X/5900X/5950X) and 95C (5600X) as typical and by design for full load conditions. Having a higher maximum temperature supported by the silicon and firmware allows the CPU to pursue higher and longer boost performance before the algorithm pulls back for thermal reasons," Hallock said."

[-]

xXDennisXx3000@reddit

What execs say are mostly benefitting the corpos not the consumer. I have been using Zen 3 with the Ryzen 9 5950X on my main PC and the Ryzen 7 5800X on my LAN PC for years now.

It's true that it is designed in the way that it boosts to that temps, but even when it is designed for higher boosts and higher temps, you need to pay attention. It will still degrade faster than usual. Since they are all using silicon and not any other material, the temps that will degrade your hardware are the same as the silicon from 2010 or 2015. It's all still silicon.

Apple is the worst when it comes to saying true things about their hardware and they will say absolutely anything if it benefits them. If your GPU dies, they will not replace shit and try to squeeze every little penny out of your pocket and want to sell you new overpriced things.

Try to reduce your temps, or your GPU will die fast. It's your overpriced hardware, not mine, but i care about my hardware and that's why i am doing it for my Ryzens lol.

[-]

Estrava@reddit

Apple hasn't said anything about their hardware.

And what, silicon is silicon? Did you know the max temps of a pentium 4 was 70C? What changed in the past few decades, did silicon get better if we shouldn't have approached 70C before?

Have you looked at server CPUs? I guess they're not made out of silicon but some magic because they can sit 90C+ for years.

Dell poweredge notes for CPU high temperature for long period and lifespan. https://www.dell.com/support/kbdoc/en-us/000212668/customer-s-concern-running-the-cpu-at-high-temperatures-for-extended-periods-of-time-may-impact-its-quality-and-lifespan

Intel documentation https://www.intel.com/content/www/us/en/support/articles/000005597/processors.html

"It's unlikely that a processor would get damaged from overheating, due to the operational safeguards in place. Processors have two modes of thermal protection, throttling and automatic shutdown. When a core exceeds the set throttle temperature, it will reduce power to maintain a safe temperature level. The throttle temperature can vary by processor and BIOS settings. If the processor is unable to maintain a safe operating temperature through throttling actions, it will automatically shut down to prevent permanent damage. "

"The leading processor manufacturers intentionally design their components to function at high temperatures throughout their lifespan. They do so based on their understanding of the dependency on system fan power and cooling capabilities. For instance, if Intel or AMD specifies a maximum CPU temperature of 95°C (203°F), it means that the processor can operate at that temperature limit without negatively affecting its lifespan. This is provided the CPU does not exceed that temperature threshold."

[-]

tony__Y@reddit (OP)

Thank you for these doc links, that's comfortable to know.
What I'm more curious about is the frequency switching between high and low temps, between inference and idle. But I guess Apple would thought about it and addressed this since they're putting these chips in iPhones and iPads too.

[-]

Estrava@reddit

The only cause of concern that I can think of you could get more dry thermal paste quicker, so you may have to replace it in a few years to get the same performance. But that assumes Apple hasn’t adjusted their technology for that either.

Anywho every concern is speculation unless we know the hardware limits of Apple silicon. Enjoy your device and use it to its fullest imo.

[-]

CheatCodesOfLife@reddit

I ran a synthetic dataset generation overnight on my 14 inch M1 Max 64GB macbook pro a earlier in the year. Since then, whenever I run LLMs; during inference, the chassis makes a clicking noise, like when a car has been driven on cold day at the metal is expanding/contracting lol.

Now I only run LLMs on it when I have no internet available eg. planes.

[-]

this-just_in@reddit

Can confirm the clicking noise in my M1 Max 64gb. I can’t say when it started, but probably when I was running long-running model evaluations to assess quant impacts.

[-]

CheatCodesOfLife@reddit

Took me a while to find this. Just thought I'd report in that I've managed to make the clicking noise go away on mine.

I bought a P5 pentalobe screwdriver from amazon, flipped the mabook up-side-down, then un-fastened -> re-fastened all the screws (without fully taking them out).

Now when run inference it doesn't make the sound. It's also stopped the hinge making a noise when I open/close it.

[-]

this-just_in@reddit

Hey thanks for taking the time to circle back to this! I’ll try this too and see if I can get a fix. Really appreciate your thoughtfulness.

[-]

CheatCodesOfLife@reddit

No problem.

[-]

MaybeJohnD@reddit

Woah that’s wild, anyone know why that might be?

[-]

ForsookComparison@reddit

Holy shit. Allowing that in a 14 inch chassis is crazy.

Is it? This is pretty standard affair for gaming laptops. 240w is a standard PSU to expect from many OEMs. There's some 300w+ ones too but that's not a comparable chassis lol

[-]

otterquestions@reddit

Dumb question, but does that include everything, including the monitor?

[-]

tony__Y@reddit (OP)

I’m using clamshell mode with docks, so if I use with builtin mini-led that’s another 10-30W, connect some external drives, easily another 10-30W 🫠

[-]

boissez@reddit

Other laptops go far above that though. This 14-incher goes beyond 230 watts. https://www.notebookcheck.net/Razer-Blade-14-2024-laptop-review-Futureproofing-with-Ryzen-AI.799687.0.html

[-]

MarionberryDear6170@reddit

And GPU is 70w just absurd. It's only 27w\~30w on M1 Max with GPU from powermetrics with Qwen2.5

[-]

MarionberryDear6170@reddit

190w is just too crazy. The highest watt I've seen on my M1 Max is 130w...
Absolutely unbelievable how they increase their chips performance year after year but also increase power draw so much.🥲

[-]

PeakBrave8235@reddit

It’s literally off wall power. Why exactly is that an issue lol?

[-]

Geritas@reddit

What the hell?! Is it able to run on battery with this much power draw? I know people are concerned about the cpu temps but with that much power I would be more concerned about the battery going up in flames to be honest.

[-]

Daemonix00@reddit

Impressive but like my old i9… will turn off if you run it for long… eats battery too

[-]

fairydreaming@reddit

Man, look at these minuses. It seems that we hurt some sensitive apple-hearts by comparing Mac with plebs hardware.

[-]

Daemonix00@reddit

to be honest I got more that 10 MacBook Pro in the last 15 years. And I got most of the "bad designs" too :( . The MPB 2018 i9... would kill the battery while on a wall charger :)

[-]

fairydreaming@reddit

So this is the famous Apple power-efficiency? Funny that it couldn't get enough power from the power adapter and had to use the battery. Thanks for getting us some real values.

I guess it's still half of of the power that my Epyc workstation draws from the socket under load.

[-]

anemone_armada@reddit

I have measured a Threadripper Pro with 8 channels DDR5 and a 4090 at inferencing, it tops at a little less than 450 watt. 420-430 watt once accounted for display and UPS.

[-]

noneabove1182@reddit

Wow, that's crazy 😅 I didn't even know the SoC was ALLOWED to pull that much!

Have you experimented at all with speculative decoding? Considering how much RAM you have, it may boost performance to also load up a smaller model and run it in parallel

[-]

boissez@reddit

It's not uncommon. There are several thin and light laptops running a Intel Core i7 and with a RTX 4070 pegged at 100+ watts that get up there.

For instance, this fellow reaches 208 watts under load: https://www.notebookcheck.net/Asus-Zenbook-Pro-14-OLED-laptop-review-MacBook-Pro-rival-with-120-Hz-OLED-display.725187.0.html

[-]

ebrbrbr@reddit

My gaming laptop with an i7-13700HX and 4070 uses 280W.

I have no idea why this is surprising to you guys, GPUs use a lot of power.

[-]

MarionberryDear6170@reddit

Because the MacBook chassis is not designed for such a high load. And that's why they also designed the power to be 140w.

[-]

noneabove1182@reddit

Because one of them is a mobile SoC with everything on the same package

The other is a desktop grade CPU/GPU combo

I'm not saying it's outlandish for a laptop to use that much power, I'm surprised that the M4 max in a MacBook pro is pulling that much

I'm pretty sure the mac mini doesn't even allow that much draw and it should have much better cooling

[-]

ebrbrbr@reddit

The Mac Mini doesn't have the M4 Max as an option. I don't think it actually has better cooling.

My M4 Pro Macbook pulls about 80W doing the exact same task as OP, so it's the extra GPU cores that are pulling all that power.

[-]

rorowhat@reddit

163W lmao

[-]

PeakBrave8235@reddit

Nvidia GPUs draw hundreds of watts lol.

This is impressive

[-]

rorowhat@reddit

Nvidia gpus are not trying to lie about how efficient they are.

[-]

PeakBrave8235@reddit

Huh?

There is no lie here from anyone.

[-]

noiserr@reddit

Macs are only efficient at light workloads. They use a lot of power when fully loaded though. It's a common misconception people have.

[-]

evilduck@reddit

While 160W is on the high end for laptops (though PC gaming laptops regularly surpass this wattage), 160W is also on the extreme low end for AI power draw for running inference on 128GB of VRAM.

[-]

noiserr@reddit

A single 4090 will pull 450W

A single 4090 will be more than twice as fast also.

[-]

evilduck@reddit

Do we really need to go around in circles like this?

Both have their merits, but the 4090’s diminutive VRAM makes it a pointless direct comparison.

[-]

noiserr@reddit

You're moving goal posts. I was saying that Apple computers are only efficient for light tasks. For heavy workloads they are just like everything else

[-]

evilduck@reddit

Sorry, what? What goalposts were moved? You claimed that they're only efficient at light tasks which I claim is absurdly wrong and is incredibly controversial because no data supports that. Show me these "in depth" benchmarks of something beating Apple Silicon at perf/watt at any load level.

[-]

noiserr@reddit

I'm talking about efficiency, you're talking about VRAM sizes. How is that not the moving of goalposts.

I never argued Apple doesn't offer unique value with its unified memory solution which lets you get much more memory capacity enabling us to run large models. How is that in any way related to my point, that Apple's hardware is only really efficient at light workloads. As soon as you start doing any serious crunching it's like anything else.

It wasn't even a dig at Apple. Obviously being efficient at light workloads is a good thing. But for some reason people can't accept that Apple isn't magical.

[-]

noiserr@reddit

Proof (back of the napkin math):

11 tokens /s at 170 watts. Is 15.5 watts per token.

This guy did a bunch of benchmarks with Llama 70B at 4Q. https://www.reddit.com/r/LocalLLaMA/comments/1fe8g8z/ollama_llm_benchmarks_on_different_gpus_on/

There is a spreadsheet here: https://docs.google.com/spreadsheets/d/1dnMCBeUYHGDB2inBl6fQhaQBstI_G199qESJBxS3FWk/edit

He was getting 27.3 tokens on a PCIe A100. Which is rated at 300 watts max. That's 10 watts per token. M4 Max needs over 50% more power to produce the same result.

A100 is a an old GPU by now. M4 max is on 3nm node, while A100 is on 7nm (2 full nodes behind).

Now granted I'm not counting the CPU portion of the system for A100. But the CPU is mostly idle. And I'm also using the max 300 watt figure, while the GPU is probably not pegged at max 300 watt during inference.

Apple solution is not efficient.

Not to mention. 190 watts in a 14" laptop is an absurd amount of power and heat.

[-]

evilduck@reddit

Sorry, writing “proof” while also simultaneously admitting you’re ignoring a bunch of power consumption by assuming it simply doesn’t exist is a pretty unconvincing argument and backtracking hard on your previous claim that any in depth benchmark supports this claim.

[-]

noiserr@reddit

What the fuck does VRAM have to do with efficiency?

[-]

evilduck@reddit

Everything? You can’t meaningfully compare something that does something really fast but substantially worse. You might as well argue a solar calculator is the best AI product if you’re going to ignore what you can actually do with the device. I didn’t realize this was so hard.

[-]

Caffdy@reddit

pulled 250W as a single component

300W even, and don't make talk about the 14900K

[-]

MaycombBlume@reddit

Yep. The RTX 4080 Laptop version has a max TDP of 150W. And that's just the GPU. This is more or less in line with comparable PC laptops at full load. Well, what passes for "comparable" anyway. I don't think you can actually get this performance with 128GB on a PC laptop at all.

[-]

goj1ra@reddit

That’s true with any CPU with efficiency cores, including some of Intel’s since (I think) Alder Lake in 2021.

If your entire workload can run mostly on the efficiency cores, you’ll have great power consumption. If you start using the performance cores heavily, then the CPU turns into a regular hairy smoking golf ball.

[-]

boxxa@reddit

Interesting. Have been looking to compare how the M3 14" compares to the M4.

My stats on the qwen:72b with M3 MAX 128GB

>>> write a quick paragraph around how LLMs are amazing

LLMs, or Large Language Models, are truly remarkable in their ability to

process and generate human-like language. These models, trained on vast amounts of text data, demonstrate impressive skills in understanding context, answering questions, and even engaging in creative writing. The capabilities of LLMs continue to evolve, revolutionizing the way we interact with technology and information.

total duration:       10.934909542s
load duration:        33.69175ms
prompt eval count:    37 token(s)
prompt eval duration: 2.143s
prompt eval rate:     17.27 tokens/s
eval count:           72 token(s)
eval duration:        8.551s
eval rate:            8.42 tokens/s

[-]

Its_Powerful_Bonus@reddit

Just to compare after 3 months :)

Macbook 14" 128GB, LMStudio qwen2.5 Q4 MLX, draft model qwen2.5 0.5B Q4

10.65 tok/sec

112 tokens

1.40s to first token

Accepted 48/112 draft tokens (42.9%)

Mac Studio M1 Ultra 48GPU, without speculative decoding:

12.21 tok/sec

121 tokens

0.64s to first token

[-]

Mrleibniz@reddit

What's the context size? Can you use it as a local GitHub copilot on VSCode?

[-]

tony__Y@reddit (OP)

Currently testing VSCode + Continue + Qwen 2.5 3B Q4, with 32k context length, and it still autocomplete in less than a second. This thing is amazing, I'm going to download larger coders and try.

[-]

swiftninja_@reddit

can you share a screen recording of a demo?

[-]

tony__Y@reddit (OP)

I don't think reddit supports video upload? (and I don't have any video hosting service). Anyways, you can also go to any Apple store and try LM Studio on their demo units.

[-]

durangotang@reddit

I just downloaded this exact model, and am getting 8.6 tk/s on average on my M2 Max, 38-core, 64GB...down from your 11.25 tk/s average. That's a 30% performance increase for the M4 Max.

[-]

Hopeful-Site1162@reddit

Thanks for informing me that LMStudio was compatible with MLX Models!

I've been preferably using Ollama CLI because of its lightness, but I've switched back to LMStudio the second I learned that. The difference is impressive indeed. For the same prompt in Codestral 22B Q4, I get 14t/s with the GGUF file, and 18t/s with the MLX.

By the way, did you ever manage to make MLX Mamba Codestral work on LMStudio?

Thanks again, enjoy your Mac and happy LLMing!

[-]

foreverNever22@reddit

"Hey bro can I have the admin password to this thing? I wanna try out TheBloke/WizardLM-1.0-Uncensored-Llama2-13B-GGUF"

[-]

matadorius@reddit

How good are the autocompletion if you compare it with cursor ? I would love to use open source but I just too lazy to type too much boilerplate

[-]

Mochilongo@reddit

I have a M2 Max and use qwen2.5 coder 7B Q8 with 8,192 context for auto complete and works fine but not as good as Codeium or o1, maybe it needs more context.

For code embeddings i use voyage-coder-2.

[-]

synth_mania@reddit

You can you any LLM as a github copilot essentially. 72B is probably gonna run slower than you would like though. I run Qwen-2.5 32B on my PC for stuff like vs code

[-]

matadorius@reddit

Can you compare it with cursor?

[-]

ortegaalfredo@reddit

The problems that I see with Apple hardware is that it sucks at batching, I mean you cannot process two or more prompts at the same time. GPUs can, and that's why if you get 11 tok/s with a GPU, its likely it can do 100 tok/s via batching requests. This makes Apple hardware good for single-user assistant or RAG applications and not much else.

[-]

Regular_Working6492@reddit

Is that a software or a hardware restriction?

[-]

Due_Huckleberry_7146@reddit

hardware, as the GPU simply is not not beefy enough

[-]

WorkingLandscape450@reddit

Is buying 128GB then even sustainable or should I just go for the 64GB version and run smaller models that don’t push temperatures so high?

[-]

Durian881@reddit

Wonder what results you'll get when running in low power mode? On my M3 Max, I get ~50% of token generation speed vs high power mode.

[-]

tony__Y@reddit (OP)

With 72B, it spend a minute processing in low power mode, so I decided to cancel it, won't be useful anyways.

WIth Llama 3.2 3B Q4 MLX, I get 158 t/s in high power mode and 43 t/s in low power mode.

Qwen2.5 7B Q4 MLX, I get 90 t/s in high power mode and 27 t/s in low power mode

Low power mode seems to work by capping the total power consumption under 40W, and I have some persistent background CPU tasks going on right now, (system using 30W without doing inference), which I guess hurt the speed a bit more in low power mode.

Low power mode also made entire system stutter during inference, getting to the point of typing lags. Whereas in high power mode I still get smooth animations during inference .

[-]

MarionberryDear6170@reddit

This sounds like a double performance from my M1 Max.

But considering that 162w is also a double power draw, I'm really curious how Apple has managed their power efficiency :(

[-]

Durian881@reddit

Thanks for sharing. Seemed that M4 Max low power mode capped performance a lot more than M3 Max. I still get smooth animations during inference for M3 Max on low power mode.

[-]

tony__Y@reddit (OP)

oh emmm maybe i forgot to mention I'm connected to three 4k monitors at 6k resolution scaling... so that probably didn't help... 😅

[-]

smith7018@reddit

Were these all plugged in during the original test you showed in this post? That would definitely affect the device's speed.

[-]

tony__Y@reddit (OP)

yes… i thought apple silicon got a display engine that runs ui which is independent to the gpu, but i guess it’s on the silicon so will add up to the total chip power consumption…

[-]

un_passant@reddit

How long does the battery last while generating at max speed continuously ?

[-]

Telemaq@reddit

Not very long. MBP 16 has a battery of 100WHr for a 160W draw which gives you about 37 minutes of constant load. At 11 tok/s, you should be good for about 24k tokens. I would say good enough for about 50 queries generating about 300 tokens if you take into account prompt evaluation.

I wouldn't go LLM on a MBP if I am not tethered to a power source unless in a pinch. I find it hard to justify an M4 Max for LLMs for anyone with a MBP M1/2/3 Max already. A M4 Ultra will be twice as fast for about $6k and can act as a local server.

[-]

PeakBrave8235@reddit

The 160W power draw is because it’s connected to wall power and 3 monitors, plus charging the battery

[-]

Kirys79@reddit

Pretty impressive for that power consumption.

It's about as fast as my A6000 at half the power consumption

[-]

tony__Y@reddit (OP)

and costs about the same 😳

[-]

PeakBrave8235@reddit

And is portable. So…

[-]

FaatmanSlim@reddit

I just checked, and looks like the fully decked out M4 Pro 40-core with 128 GB RAM is ... $4999 ? The A6000 GPU alone costs that much or more 😐 and that's only 48 GB VRAM, so the M4 128 GB is actually a good deal pricing wise.

[-]

Kirys79@reddit

Competition is good, I wonder how good is Linux support for this kind of workflow

[-]

un_passant@reddit

What is the memory bandwidth of the M4 Max and how does is compare to a dual Epyc width 16 memory channels of DDR5 ?

[-]

tony__Y@reddit (OP)

Can I carry a dual Epic 16 channels of DDR5 on the go? especially on intercontinental flights

[-]

jman88888@reddit

It's a server. You don't take it with you but you have access to it from anywhere you have internet access, including international flights.

[-]

PeakBrave8235@reddit

So I’m spending thousands on a laptop plus thousands on server, plus hundreds for electricity, for a marginal +1 per second?

Pass.

[-]

un_passant@reddit

No. Why do you 'need' to ask ?

On the other hand, you can get *much* more than 128 GB of RAM on a server, and you can carry a client connecting to that server. A price comparison could also be interesting especially if one is ok with secondhand components (which is not possible for M4 max) : 16GB at 5600 Mhz for 80€, Epyc Genoa proc for 750€ each, new mobo for 1500€.

for 2k€ (adding up to 7k€), I believe I could have a functional server with 384 GB of RAM with 900GB/s bandwidth or for the same price I could have a portable computer with 128GB with 546 GB/s (i.e. Apple M4 max).

Picking the portable computer would require me to really need to portability. How many hours in a year would I need my Gen AI capabilities and not have a connection ? No enough.

But everyone's mileage may vary.

[-]

monsterru@reddit

Maybe I’m missing something but Ram bandwidth is just part of the performance equation.

What would Epyc CPU do to compare in performance with M4 gpu and npu? Or are we talking about an nvidia server? Then Ram bandwidth doesn’t matter much because models would run on gpu vram…

I don’t think you would get anywhere near 10t/s on Epycs. I would expect single tokens on 70b models with decent context window.

[-]

Willing_Landscape_61@reddit

https://github.com/ggerganov/llama.cpp/issues/6434#issuecomment-2055934863

For single socket Epyc : "With this I get prompt eval time 12.70 tokens per sec and eval time 3.93 tokens per second on llama-2-70b-chat.Q8_0.gguf"

[-]

monsterru@reddit

Thank you! This makes sense! This setup looks good for eval optimization!

[-]

DryArmPits@reddit

I mean. You can just hook it up on a Tailscale network and use it remotely? This way you avoid the 160W power draw on your laptop AND don't need a 12k laptop to make it happen.

[-]

Themash360@reddit

unnecesarily defensive

[-]

calcium@reddit

OP makes a fair point, you aren't going to be carting a server with you anywhere you go.

[-]

Themash360@reddit

It is a fair defence, to a nonexistent attack.

Q: What is the difference in memory speed between these two products. A: I can take one of them on airplane

Op is making an assumption that the real question is, why didn’t you buy a dual Xeon workstation.

Don’t do that.

[-]

CheatCodesOfLife@reddit

TBF, OP's had to read lots of people's unnecessary snarky comments saying GPUs are better, etc.

[-]

InvestO0O0O0O0r@reddit

Epyc width 16 memory channels of DDR5 ?

EPYCs are 12channel. There were rumors about Zen 6 having 16 channel support, maybe you are confusing it with that?
Regardless, 12 channel DDR5 EPYCs have roughly similar bandwidth to M4.

[-]

un_passant@reddit

No, I was thinking about older Epyc with 8 memory channels, but with a dual cpu mobo (which is what I'm currently building, but it's only DDR4 @ 3200, with 16 × 64Go ). So for newer Epyc, I should have asked for *24* channels for a dual cpu mobo with newer Epyc cpus.

[-]

InvestO0O0O0O0r@reddit

I am not too sure about how good dual CPU EPYCs perform for these tasks. You probably want NUMA enabled anyway.
Assuming it works perfectly your current rig should provide 2/3rds of M4 bandwidth, and 24 channel DDR5s should be roughly double of M4.

[-]

un_passant@reddit

That is an interesting question. Conventional wisdom is that CPU inference is RAM bandwidth bound, but of course with increasing bandwidth that should stop being true at some point, but a dual recent Epyc CPU system does also pack some computing power.

But a more interesting point imo is that to have the full RAM bandwidth, one needs to use all the memory channels. So it's not like you can have 24 channel's bandwidth on a 70Gb model if you have 24×16Go of DDR5 on your dual Epyc system. Each platform has it's own strengths and weaknesses for specific use cases.

[-]

Willing_Landscape_61@reddit

"M4 Max supports up to 128GB of fast unified memory and up to 546GB/s of memory bandwidth, "

[-]

kingwhocares@reddit

M4 Max memory bandwidth is 546 GB/s, slightly more than the desktop variant of RTX 4070.

[-]

estebansaa@reddit

can you try generating an image with Flux and see how long it takes?

[-]

ebrbrbr@reddit

On an M4 Pro it's 12s/it.

The M4 Pro is identical in speed to my 1080Ti.

[-]

HairPara@reddit

Impressive. How much RAM on your M4?

[-]

ebrbrbr@reddit

48 gigs, just barely allowing 123B IQ3XXS models to run without any swap.

[-]

HairPara@reddit

Thanks for responding. Are you happy with 48gb? Do you regret it at all or wish you had gotten 24GB? I’m debating it primarily for Flux and LLMs (hobby not professional) and it seems like it’s usable but not great (eg maybe 5 tk/sec for larger models). I’ve been delaying buying it as I try to figure this out

[-]

ebrbrbr@reddit

I'm happy that I can run larger models, 72B is about 5.5tk/s and 123B is 3.35tk/s.

One of the benefits of 48GB is that you can run 32B models at Q8 instead of Q4_K_M, and still have plenty of memory for using your PC. On 24GB you'd be running at a lower quant and have to close everything, including changing your wallpaper to a blank colour!

[-]

HairPara@reddit

Thanks this is so helpful. Last question, did you get 16 core GPU or 20?

[-]

t-rod@reddit

FYI, there are some timings here using MLX: https://github.com/filipstrand/mflux

[-]

estebansaa@reddit

Thank you

[-]

kashif2shaikh@reddit

I have an intel i7-14700k - that thing is designed to operate at high temps close to 100 degrees, but will throttle automatically to keep it at that. We try to set throttle the voltage etc, so it can operate a bit cooler at 85 degrees with lower watts.

Same thing for the RTX GPUs, my 3090 would easy hit 105 degrees … it’s designed to operate < 115 degrees with a power draw of 350 watts.

Lots of folks worry of a melt down.

Apple designed their system, let them operate what they think is safe. But I will say that pushing the laptop consistently at high heat will wear down thermal pads and what not, causing temps to increase more and throttle more easily, reducing temps in the future

[-]

estebansaa@reddit

Mac Studios are going to work so well for this! To bad they kinda suck for StableDiffusion.

[-]

PurpleUpbeat2820@reddit

To bad they kinda suck for StableDiffusion.

Draw Things is awesome!

[-]

J-na1han@reddit

Does mlx still only allow 4 and 8 bit quantization? I feel 8 is way too much/slow. So I use 6 bit in gguf format with koboldcpp.

[-]

PurpleUpbeat2820@reddit

Does mlx still only allow 4 and 8 bit quantization?

They seem to be hosting q3 and others.

[-]

tony__Y@reddit (OP)

I'm not sure, but from a quick search in huggingface seems like that's the case.

[-]

kaiwenwang_dot_me@reddit

What do you keep in Zotero?

[-]

tony__Y@reddit (OP)

about 2000 references, each with pdf attached, many plugins. I’m also using opened tabs as reading reminder/todo list, not great for RAM usage…

[-]

kaiwenwang_dot_me@reddit

bump plzzzzz show zotero setupp!!!!

[-]

kaiwenwang_dot_me@reddit

can you share a screenshot of how you use zotero or your workflow?

I just store a bunch of pdfs and epubs that I downloaded from libgen in categories

[-]

shaman-warrior@reddit

m1 max, 64gb, qwen 72b Q4, I get 6.17 tokens/s.

From a total generation of 1m 38s.

without using MLX, just using ollama.

[-]

CheatCodesOfLife@reddit

Same, not worth it imo. And that's not including the slow prompt ingestion for long context.

Fortunately there are reasonable smaller models like Qwen14/32b, gemma and Mistral-Small now.

[-]

capivaraMaster@reddit

Ingestion seems to be double the speed for mlx compared to llama.cpp for me. The problem is keeping mlx xontext on the memory. Llama.cpp it's just some commands to do it, but mlx doesn't give you an option to keep the prompt loaded.

[-]

CheatCodesOfLife@reddit

I just started playing with mlx yesterday. Definitely a lot faster than llamacpp for both generation and injestion. Makes qwen coder a lot more usable.

Haven't looked into prompt caching yet

[-]

shaman-warrior@reddit

I think it depends for what you’re using it

[-]

Ok_Warning2146@reddit

M1 Max RAM speed is 409.6GB/s. M4 Max is 546.112GB/s. GPU FP16 TFLOPS is 21.2992 and 34.4064 respectively. (546.112/409.6)*(34.4064/21.2992)=2.15. Quite close to 11/6.17.

[-]

SandboChang@reddit

That’s a massive difference, and I think it should give like 8-9 token per second estimated from earlier Apple Silicon.

[-]

b3081a@reddit

MLX provides around 20% perf advantage so the gap is smaller.

[-]

netroxreads@reddit

That's a nice way to see how much M4 Max can handle - it is surprising it can do 11 tokens/ps given the massive amount of LLM with 72B/Q4. I cannot wait for M4 Ultra to come out as it should improve significantly with twice more cores and RAM.

[-]

Impossible-Bake3866@reddit

Can you send the apps you used. I want to get this working on my m4

[-]

furyfuryfury@reddit

Activity monitor reveals it's LM Studio. Don't expect to run this exact model unless you have that much RAM, though. If you have 16 gigglebytes, you'll be able to run maybe 3b or 7b parameter models. LM Studio will stop you from loading it if it thinks you'll run out of RAM. I have 48 and managed to lock up my machine hard when I turned off its guardrails and loaded too many models.

[-]

Stanthewizzard@reddit

With 24 32B runs. Not extra quick but it runs

[-]

monsterru@reddit

On Mac? What quantazation?

[-]

Stanthewizzard@reddit

32B Q4_0

[-]

Specific-Goose4285@reddit

Could you test Mistral large? For reference I achieve ~3.30 t/s on 4_K_M on the 128GB M3 Max.

[-]

Competitive_Ideal866@reddit

I get 0.28s load time and 5.5toks/sec for mistral-large on M4 Max 128GiB.

[-]

Specific-Goose4285@reddit

Thats sad. I could had waited a year more and got double the performance. I'm seething lol.

[-]

Competitive_Ideal866@reddit

The fastest is still the M2 Ultra.

[-]

Trollfurion@reddit

I did test that and I think I was getting close to 6 t/s

[-]

Specific-Goose4285@reddit

Thanks.

[-]

COBECT@reddit

For that price of upgrade CPU and RAM, you can probably get a life-time subscription to OpenAI 😄

[-]

sahil1572@reddit

Does the battery of your Mac drain even if charging is on, as the power flow suggests?

[-]

SandboChang@reddit

That’s pretty amazing. What’s the prompt processing time of you have a chance to check?

[-]

tony__Y@reddit (OP)

1-2second to first token, 10-15s at 9k tokens context chat.

Apple is being cheeky, in high power mode, the power usage can shoot up to 190W then quickly drops to 90-130W, which is around when it start streaming tokens. By then I’m less impatient about speed as I can start reading as it generates.

[-]

SandboChang@reddit

15s for 9k is totally acceptable! This really makes a wonderful mobile inference platform. I guess by using 32B coder model it might be an even better fit.

[-]

Yes_but_I_think@reddit

How come this is mobile inference, at 170w? May be few minutes.

[-]

ebrbrbr@reddit

That 170W is only when it's not streaming tokens. The majority of the time it's half that.

It's about 1-2 hours of heavy LLM use when unplugged.

[-]

SandboChang@reddit

At least you can bring it somewhere with a socket with you, so you can code with a local model in a cafe or on a flight.

Power consumption is one thing but that’s hardly a continuous consumption either.

[-]

CH1997H@reddit

What software are you using? I imagine llama.cpp should be faster than this with the optimal settings, also on this M4 hardware

Also make sure to use fast attention etc.

[-]

tony__Y@reddit (OP)

🤔I’m using LM Studio, and it uses meal llama.cpp as backends, but I can’t pass custom arguments, maybe i should try that hummm

[-]

CH1997H@reddit

Yeah the optimal custom commands can be a bit tricky to figure out

Try these: -fa -ctk q4_0 -ctv q4_0

There are some other flags you also can try, you can find them in the llama.cpp Github documentation

[-]

SixZer0@reddit

Hehe, the critical question of prompt processing speed, anyway it is important so yeah, I am happy you asked about this.

[-]

kellempxt@reddit

Any chance you will be running comfyui and posting the result here too?

[-]

RikuDesu@reddit

Thanks for posting this I'm still teetering deciding on whether or not it's worth it to get a maxed out m4 max or not

[-]

Competitive_Ideal866@reddit

I'm still teetering deciding on whether or not it's worth it to get a maxed out m4 max or not

I'm loving it.

[-]

furyfuryfury@reddit

The lower bin only comes with 36 gigglebytes of RAM. 48, 64, and 128 are all the fully unlocked 40 GPU core version.

[-]

LoadingALIAS@reddit

I expected more. I’m on an M1 16GB and ran the 32b int4 with MLX at about the same.

[-]

mahiatlinux@reddit

This is 72B pal. More than double the params.

[-]

LoadingALIAS@reddit

Thanks. I didn’t see that. 🙄

Obviously, it’s double the params. It’s also 8x the memory.

[-]

PawelSalsa@reddit

Having 128gb why you even use q4 the lowest quant an not at least q6 or even q8? it is about temperature that would last too long processing queries as compare q4?

[-]

jacek2023@reddit

Now compare price to 3090

[-]

timschwartz@reddit

A 3090 on ebay is about $800 and you'll need 5 of them to match the VRAM in the M4. So $4000 in video cards, plus the computer / power supplies to use them.

The M4 with 128GB of RAM is $4700.

Sounds like the M4 is the winner.

[-]

OmarBessa@reddit

A couple caveats:

When SoCs fail the entire thing goes down.
An M4 will throttle.
You can upgrade the PC server when necessary.
The speed ain't very good.

[-]

a_beautiful_rhind@reddit

I've probably put more into my server than that over the course of the last 2 years. I'm still not nearly up to the costs of a M2 ultra, funny enough.

Of course that includes storage, upgrading the board with a newer revision. 3xP40s, 2080ti22g, 3x3090, riser cables, odds and ends. Those $20-30 purchases add up and I'm likely over $5k by now from march/april of 2023 onward. Still want a 4th 3090 and either upgrade to the next intel gen or move to epyc and pcie4. What even counts as a "final" price?

With the macbook, you buy one thing all at once and then you're done. It's a different mindset. Someone who just needs a complete solution to do LLMs and nothing else. Maybe they were already buying $2-3k+ laptops as their main computer. They're more consumer than enthusiast in most cases. When the speed isn't good, they wait for the new one and upgrade that way.

[-]

OmarBessa@reddit

I'm currently designing and building a cluster. We have already have a cluster and are considering buying these macs as well.

Just sharing some of our considerations.

We need to know how many of these will fail and how often.

[-]

a_beautiful_rhind@reddit

Aren't the minis too slow? The most they come with is a pro.

[-]

OmarBessa@reddit

I need to figure which chipset will work best yet. The hardest parts have been pushing paperwork for an industrial electricity installation.

[-]

a_beautiful_rhind@reddit

Depends on what you want to run on the cluster. Also have the option of adding a GPU into the mix with some of the software to try to get around the lack of compute. There is a reason why people often only post the t/s of the output and not how long it took to crunch the prompt.

[-]

OmarBessa@reddit

What do you mean? More specifically...

[-]

a_beautiful_rhind@reddit

If you're spanning one large LLM over mac minis in a cluster, you're still going to get slow prompt processing. If you're using them to compute something else they might be fine. I know that at least llama.cpp supports distributed inference and maybe a GPU machine in the mix would help that.

[-]

OmarBessa@reddit

Oh, ok yh. I have an old solver of mine to which I feed the data and comes up with the solution.

I just feed it the model with the trade-offs. I'll keep what you said in mind. Thanks.

[-]

chumpat@reddit

Now do the tokens / second / watt. ;)

[-]

A_for_Anonymous@reddit

Yes, if you don't care at all for speed. And if all you're running are LLMs, of course. (Now try running Stable Diffusion in FP16.)

[-]

RikuDesu@reddit

5-6 3090s at $1k each ish

Plus you'd need a server board eats board board alone is about 1000 and a threadripper to handle the pcie lanes to hit the same amount of vram

I guess you could do it for cheaper using weird bifurcated pcie extensions but itd be more janky not to mention two giant psus

[-]

MrMisterShin@reddit

More like 3-4 RTX 3090 in this instance tbh, reason being the default RAM allocation on the M4 Max MBP - it would reserve around 25% of 128GB for the OS etc. Additionally OP said they were running other background tasks.

[-]

Turbulent-Action-727@reddit

The default RAM allocation can be changed in seconds. Basic operation and background tasks are stable with 4 to 8 GB RAM. 120 for LLM won’t be a problem. So 4-5 3090s.

[-]

CheatCodesOfLife@reddit

used EPYC servers from ebay on the cheap ;)

[-]

teachersecret@reddit

I think the kicker here is that people are already going to buy their stupid-expensive macbook/mac mini/mac studio. I know this, because I've got a few of the last-gen top of the line macs still sitting on my desk (I used/use them to run my publishing business and they are still perfect at that task - people will claw my 5k retina imac from my cold dead hands, it still looks flawless).

I wouldn't go out of my way to buy a new macbook just for AI capabilities... but if I was already buying one of these things, it IS pretty nice that a small, lightweight, portable device can run such powerful AI at usable speeds, and I'd probably throw one onboard. The kicker? Now it's running in a thin little macbook or a tiny little mac studio/mini box sipping electricity on the desk instead of a giant behemoth mining-style server rack with four 3090s crammed into it cyberpunk-style heating your entire home.

Different strokes, and I say this as a guy with a 4090 crammed in a space-heater rig sitting next to an imac retina.

[-]

spezdrinkspiss@reddit

can you carry 4 3090s in a backpack

[-]

CheatCodesOfLife@reddit

According to o1 mini; while challenging, it's possible:

Carrying five NVIDIA RTX 3090 graphics cards in a standard backpack would be quite challenging for several reasons: 1. Physical Dimensions and Weight

Size:  Each RTX 3090 is a large GPU, typically around 12.3 inches (31.3 cm) in length, and it occupies two or even three expansion slots in a desktop setup. 
Weight:  Each card weighs approximately 2.35 pounds (1.07 kg). Five of them would total around 11.75 pounds (5.35 kg), not including any additional padding or protective materials you might need.

2. Backpack Capacity

Space:  Most standard backpacks, even larger ones designed for laptops, won't have the necessary compartmentalized space to securely fit five RTX 3090s. The cards are long and somewhat bulky, making them difficult to arrange without putting excessive pressure on each other.
Protection:  GPUs are sensitive electronic components. Without proper padding and secure placement, there's a high risk of damage from movement and impacts during transit.

3. Practical Considerations

Safety:  Carrying such valuable hardware in a backpack increases the risk of theft or accidental damage.
Heat and Ventilation:  While transporting, ensure that the GPUs are kept in a cool, dry environment to prevent any overheating, especially if they've been used recently.

Recommendations

Specialized Cases:  Invest in a hard-shell case specifically designed for transporting delicate electronics. These cases often have customizable foam inserts to securely hold and protect each component.
Fewer Units Per Trip:  If possible, carry fewer GPUs at a time to reduce risk and manage weight more effectively.
Shipping Services:  For transporting multiple high-value items like GPUs, consider using a reputable shipping service that offers secure, cushioned packaging and insurance.

Conclusion

While it might be physically possible to fit five RTX 3090s in a very large and sturdy backpack with adequate padding, it's not recommended due to the high risk of damage and the practical challenges involved. Using specialized transport solutions would be a safer and more effective approach.

[-]

mizhgun@reddit

Now compare the power consumption of M4 Max and at least 4x 3090.

[-]

a_beautiful_rhind@reddit

But Q4 72b doesn't require 4x 3090s, only 2 of them. If you want a fair shake vs a quad server, you need to do 5 or 6 bit mistral large.

[-]

CheatCodesOfLife@reddit

My 4x3090 rig gets about 1000-1100w measured at the wall for Largestral-123b doing inference.

Generate: 40.17 T/s, Context: 305 tokens

I think OP said they get 5 T/s with it (correct me if I'm wrong). Seems kind of similar to me per token, since the M4 would have to run inference for longer?

~510-560 t/s prompt ingestion too, don't know what the M4 is like, but my M1 is painfully slow at that.

[-]

a_beautiful_rhind@reddit

They mostly win on the idling. Then again, maybe it gets better if your hardware supports sleep.

[-]

East-Cauliflower-150@reddit

Recommend to try wizard-lm 8x22 q4, still one of the most impressive models and runs fast and cool. MoE is where 128gb apple really shines! Too bad so few MoE models have been released lately…

[-]

PiccoloAble5394@reddit

woah

[-]

whispershadowmount@reddit

Could you elaborate on how you’re running the benchmarks? That’s pretty different from my M4.

[-]

MMAgeezer@reddit

This is very cool. Appreciate you sharing your setup, and it's awesome to see Macs starting to be viable alternatives for slower inference of larger models.

[-]

kintotal@reddit

There is a point where it makes sense to use the cloud I think.

[-]

Zeddi2892@reddit

Thanks for sharing!

So basically it would work with a 64 GB mbp m4max as well like this?

How about larger models that only fit on 128gb mbp?

[-]

tony__Y@reddit (OP)

I think even this 72B at Q4 is not useable with 64GB MBP. You might need to use Q2, quit all other apps, allocate more VRAM and use small context length. Whereas on 128G I didn't need to quit any of my work apps, I can just work with 72B on the side.

[-]

Zeddi2892@reddit

So basically you argue that larger models (than 72B) wont fit on a 128gig mbp as well?

[-]

tony__Y@reddit (OP)

If you really want, you can get it to run, but I would argue for productivity assistance purposes \~72B is the limit on MBP 128GB.

For example, if I want to run Mistral 2411 Large 128B, either I have to use Q2 or Q4 but quit all my other apps, and I think it would be even slower; it feels very diminishing return going from 70B models. Not to mention at Q8 that model is 130GB in download size. At that point, I'll get impatient and use a cloud model instead.

[-]

milo-75@reddit

Doesn’t a 70B Q4 model only need 35GB of memory/disk? And Q8 would be 70GB (8bits per weight, right)? What am I missing?

[-]

tony__Y@reddit (OP)

context length and batch size. also there’s always some auxiliary files that goes with any ML model.

[-]

Zeddi2892@reddit

Have you tried just for benchmarking how well 128B rums in Q4?

I‘m kinda considering buying a mbp and I‘m torn between the 64 and 128 gig version. 800€ is quite a sum and I‘m not sure if thats what I want to pay extra for slightly bigger models.

At home I have a 4090 which is awesome, but limited to ~20-30B Models (bigger models wont fit, bigger quants are usually not any helpful).

If I do buy a mbp, I want to make it worth it for local llms. If I just use 20B models in the end, I can stick to my existing setup.

[-]

tony__Y@reddit (OP)

However, at larger context length (5.4k tokens chat), it will take two minutes to process. Memory usage is still manageable ish, can still keep some light apps open.

[-]

tony__Y@reddit (OP)

Running Mistral 2411 128B Q4 at 4-5 t/s.

[-]

Daemonix00@reddit

I run a test last night on 128gb ultra m1. Ollama q4 123b. 7.5t/s

[-]

tony__Y@reddit (OP)

the Ultra chip's higher memory bandwidth still wins, I got 4-5t/s on M4 Max.

[-]

mgr2019x@reddit

What are your prompt evaluation speed numbers if i may ask?

[-]

sammcj@reddit

Have you tried it with llama.cpp? Id expect quite a bit better performance than that. That's less than what I get on my M2 Max

[-]

Fusseldieb@reddit

I don't like Macbooks or Apple in general, but this stuff really teases me ngl

[-]

No_Definition2246@reddit

God ai want this

[-]

-a-non-a-mouse-@reddit

Congrats! So in literally never, you'll recoup the price of the laptop!

$5000 buys you 12.5B tokens of Qwen 72B. So in 36 years you'll recoup that price. But now electricity. 160W @ $0.15/kWh means you'll have spent an extra $7,500 on electricity in that time.

Keep it under 105W and you'll sllllooooooowwwwwwllllllllyyyyyy break even. In a few centuries.

tl;dr -- macs aren't good values for running LLMs

[-]

CH1997H@reddit

What software are you using? I feel like llama.cpp should be faster than this with the optimal settings, also on this hardware

Also make sure to use fast attention etc.

[-]

mibelashri@reddit

This post made me realize how poor academic I am. But it is nice to see that in action. Lets just wait and see how open model will take off. fingers crossed at Qwen3.0.

[-]

Subjectobserver@reddit

A fellow Zoteran

[-]

tony__Y@reddit (OP)

haha yeah, I'm abusing my 128G RAM by opening 20+ Zotero pdf windows.

[-]

alphaQ314@reddit

What's the usecase for running a local llm like this?

[-]

tony__Y@reddit (OP)

highly censored topics, when any cloud AI will just refuse to say anything; with LocalLLM, at least I can beat them do say something useful.

[-]

prumf@reddit

You can also use them on sensitive information. I mostly use copilot and OpenAI models, but when the data can’t be leaked at any cost, I use local models+continue.

Overall it works really well.

[-]

randomfoo2@reddit

Curious, does your mlx script let you emulate what llama-bench does, eg, give you numbers for prefill, like pp512 performance as well as tg128 (token generation), then you could do a 1:1 comparison w/ llama.cpp's speed, but also get an idea of how fast it'll take before token generation starts for longer conversations.

[-]

rava-dosa@reddit

That’s a solid setup for running Qwen 72B—11 tokens/sec

I’ve been exploring similar configurations for large-scale model testing.

I worked with a group called Origins AI (a deep-tech dev studio) for a custom deep-learning project.

Might be worth checking out if you’re pushing the limits of what your setup can do!