Geekerwan benchmarked Qwen2.5 7B to 72B on new M4 Pro and M4 Max chips using Ollama

[-]

Balance-@reddit (OP)

Summary: M4 Max is about 15-20% faster than M3 Max on most models. M4 Pro is about 55-60% of M4 Max or around two-thirds of M3 Max.

All slower than a 4090, as long as the models fits within memory. Only the Max models can run the 72B model at reasonable speed, around 9 tokens per second for M4 Max.

[-]

mrskeptical00@reddit

These aren’t purpose build to run LLMs - these are general purpose computers that also happen to run LLMs reasonably well. Mac Mini M4 is much faster than my Core i5-13000 with DDR5.

I upgraded my Windows PC to a 3090 a few months ago just to be able to run bigger models. It’s great for gaming too but I don’t do much of that. Had the option been available I’d have just bought a new Mac Mini.

Is it as fast as a 3090? No. Is it good enough for most people to run a model they pull from HuggingFace? Absolutely. I set my Windows PC to sleep after an hour of non-use and I need to wake it remotely if I want to access the model from my phone. With a Mac Mini, it’s efficient enough (and good enough) that I would just always leave it on and available.

[-]

Dalethedefiler00769@reddit

These aren’t purpose build to run LLMs - these are general purpose computers

Pretty much every computer is a "general purpose computer" including a non-apple PC with a GPU, not sure what your point is.

[-]

mrskeptical00@reddit

Point is they’ll always lose to an RTX GPU but you can do a lot more with a $1000 Mac Mini than a $1000 GPU.

[-]

xxPoLyGLoTxx@reddit

Only losing in terms of speed, but they can still run larger models at useable speeds.

[-]

mrskeptical00@reddit

$1000 gets you at least 24GB of VRAM. I still think you're better off spending that money on a video card (I got my 3090 for $600).

I look at the base model Mini as a great PC that can also run LLMs and idle at 3W so I never have to turn it off like I do my Windows PC.

[-]

xxPoLyGLoTxx@reddit

But sadly 24gb vram isn't that great. If you want to run the 70b model, an m4 max 128gb RAM will give you double the performance compared to a 5090. So you'd need two 5090s to match the performance of a single portable laptop. And you can't do anything but game on the 5090s and sadly two of them still doesn't improve gaming.

It's just too niche of a pathway imo.

What I'm excited for is the Max m5. If it can go beyond 128gb ram, it'll be massive.

[-]

mrskeptical00@reddit

That’s more than $1000.

[-]

xxPoLyGLoTxx@reddit

Solely to run a 70b llm? That's poppycock. You can literally do anything you want with it. It's a supercharged laptop. Video editing will be a snap, you could run multiple VMs, coding, etc.

You can't do anything other than game on a GPU. Granted a MacBook is more money, but you'd need multiple GPUs to even come close. Not to mention the power draw. Oh and you literally can't buy any new GPUs because nvidia sucks.

Project digits will be an interesting case. Let's see how that goes. But again, also pretty niche.

[-]

mrskeptical00@reddit

I didn’t have poppycock on my bingo card.

The only video editing I do is on my phone and I run multiple VMs on an old Dell Core i5-7500, I don’t need a $4700 MacBook Pro for that :)

But hey, it’s your money - whatever floats your boat!

[-]

xxPoLyGLoTxx@reddit

I'm not rushing to purchase. But my next CPU purchase will have the 70b LLM as a major factor. If I was buying right now, I'd go the macbook route. But I'm just chilling and waiting for now, rocking the 14b model lol.

[-]

ShengrenR@reddit

Ha, that's a fun party trick - I don't trust my net skills to not screw it up and host free inference for a group of amused kids half way around the world, or I'd do this too.

[-]

mrskeptical00@reddit

Look into Tailscale, it's a local VPN for all your devices. It's on my phones, iPad, PCs, servers. Each device has it's own name (magic DNS) so any of your online devices connected to Tailscale can reach any other - no need to open ports on your firewall or anything like that. Works great.

When I want to use Open WebUI on my mobile from inside the house or while out, I just open https://pc-home:3000 and I'm connected to my home PC.

[-]

NEEDMOREVRAM@reddit

Can you remotely start your AI server from your phone when away from home?

And does it show your entire desktop on your phone?

[-]

mrskeptical00@reddit

You can get an app like Remote Desktop or VNC to view your computer remotely - I’ve used those on my iPad on occasion but I don’t need to use them on my phone.

I use WakeOnLan to send a packet to the Mac of my PC. When out of the house I trigger a script on a raspberry pi that wakes my PC.

For the LLM I just use Open WebUI to access it via a browser.

[-]

randomusername44125@reddit

“As long as models fits within memory”. This is the whole value proposition of these macs. 4090 is always going to be faster. But what’s the size of the model that I can economically fit in it vs what can I fit in a 128G m4 max. I am actually planning to buy something by end of this year and so trying to do the research.

[-]

akashocx17@reddit

Does this mean m4 max (128) is better choice if we want to run larger models ? As compared yo 4080 or 4090 whose vram could be only 16 - 24 gb?

[-]

xxPoLyGLoTxx@reddit

I personally think so. A 128gb RAM macbook could devote 90gb-100gb of RAM to VRAM. That's like 3x what the 4090 has. It just won't be quite as fast, but at a certain point speed is irrelevant.

[-]

Neither_Quit4930@reddit

Looks like M4 pro should be able to generate 4-5 t/s on 64gb.

[-]

Prestigious_Elk1253@reddit

m4 pro 14+20 64G running Qwen 2.5 72b 4bit (mlx version), 2 minute test around 5.88-6.12 tokens/s, pretty good for me with 1000-2000 fans, quiet and comfy

[-]

businesskitteh@reddit

Curious what impact connecting 2 M4 Pro Mac Minis via Thunderbolt 5 would have on inference

[-]

EFG@reddit

Have two new unused m4 minis. Will try today.

[-]

krisirk@reddit

That's awesome, I look forward to the results.

[-]

EFG@reddit

Terrible shot but was very easy to get then set up. Have 4 other unused m3/m4 iMacs I’m gonna toss together with the minis for a prototype office RAG. Benchmarks don’t really matter for me if I can remotely send it instructions to execute during downtime. Actually lots of a possibilities when you can get compute cost down so much with old gear lying around.

[-]

krisirk@reddit

What does the performance look like?

[-]

EFG@reddit

No benchmarks yet, and just realizing it didn’t even attach my screenshot

https://imgur.com/a/DP0iCe7

If you have a preferred benchmarking tool, let me know, and I’ll try sometime this morning after getting the iMacs connected. If it goes well will definitely look into getting something far beefier for my server.

[-]

krisirk@reddit

The rest of the Mac M series lineup is benchmarked here, using Llama 2 7B at f16, Q8_0, and Q4_0. It would be neat to see what the prompt processing for a 512 token input and the text speed of a 128 token output are.

https://github.com/ggerganov/llama.cpp/discussions/4167

[-]

roshanpr@reddit

Is it worth it?

[-]

Neither_Quit4930@reddit

It depends on whether running 72B is important to you. I think given the price differences between m4 pro and max, the inference speed is negligible at that scale: 4-5t/s vs 8-9t/s.

[-]

roshanpr@reddit

Thank you.

[-]

NEEDMOREVRAM@reddit

You can't get an M4 Pro with 64Gb of RAM. Apple forces you to upgrade to a Max for 64GB.

[-]

AngleFun1664@reddit

Sure you can, Mac Mini, M4 Pro, 64 GB RAM for $2k

[-]

NEEDMOREVRAM@reddit

Are you talking about the Mac Mini Pro?

Because I have been unable to get 64GB of RAM added to a M4 Macbook Pro. It tops out at 48GB. I need to go to the Max to get the option of 64GB.

[-]

redundantly@reddit

The M4 tops off at 32 GB
The M4 Pro goes up to 48 GB
The M4 Max can have up to 128 GB

[-]

AngleFun1664@reddit

M4 Pro goes to 64 GB in the Mac Mini

[-]

redundantly@reddit

The person I was replying to was talking about trying to get a 64 GB MBP

Because I have been unable to get 64GB of RAM added to a M4 Macbook Pro. It tops out at 48GB.

[-]

Mochilongo@reddit

He may refer to the mac mini with m4 pro and 64gb ram

[-]

ShengrenR@reddit

They should have just run a slightly smaller quant honestly and they could have had a sane number instead of 0.02, though of course more context and precision is always nice

[-]

LanguageLoose157@reddit

What about just M4? I'm not a fan to shell out another $500 for Pro

[-]

Balance-@reddit (OP)

Probably half M4 Pro performance (about 30% of M4 Max). 14B models will be fine, 32B maybe with 32GB memory.

[-]

LanguageLoose157@reddit

Do we have BMK comparison with the new intel chips?

Intel is expected to release their 200 series H processor which are geared towards high performance. Also Intel chips got NPU.

[-]

Maxxim69@reddit

According to unofficial info, Intel’s upcoming 200 series H processors will support up to 64GB of dual-channel DDR5-5600 memory. While that's a relatively fast RAM by PC standards, it's still very slow compared to Apple's Unified RAM.

To explain in very simple terms, running an LLM on a CPU, completely or partially, you have two major bottlenecks: 1) Prompt processing speed, mostly dependent on how fast your GPU is (if any), and 2) Inference speed, dependent on how fast your RAM is. GPU and RAM in Apple's M-series systems are a lot faster than anything Intel has to offer.

And forget about NPUs (at least for a couple generations). That hardware is designed for very simple tasks (like 1/10 complexity compared to running an 8B LLM) and it may take years for it to get decent software support.

[-]

prefusernametaken@reddit

Even the new jetson?

[-]

Maxxim69@reddit

See this discussion, for instance.

[-]

LanguageLoose157@reddit

I see. I had a hunch the NPU won't be much for LLM model. The last time I saw the TLOPS, the IGPU is faster than the NPU. Nonetheless, it would awesome to see, say a 7b or 13b running off the NPU in the background all the time and leave the CPU and GPU for other tasks.

I find it absurd how useless desktop RAM is compared to VRAM in LLM space. RAM is so cheap and easily upgradable for a faster one. I don't understand this over reliance on VRAM.

[-]

Maxxim69@reddit

to see, say a 7b or 13b running off the NPU

Who knows, maybe in a few years. But then again, by that time we might be running quantized 400B models on CPUs.

I find it absurd how useless desktop RAM is compared to VRAM in LLM space.

When you get even a very basic idea of how LLMs work (matrix multiplication), GPUs vs CPUs (architecture), and the difference between RAM and VRAM (speed), it will make total sense.

[-]

ShengrenR@reddit

The cpu itself with ram won't do the trick, but it may open up more acceptable edge cases.. let you offload 1 more layer type deal, or step up one tier in quant.. if the 5090 is 32GB, fast cpu/ram could get the high iq3 types for 72B into a bearable range, at least in small context windows.

[-]

hey_listin@reddit

interesting that the 4090 would beat M4 Max on lower models but then Max outperforms on the higher model? as someone new to this, I'm wondering does anyone have an explanation?

[-]

roshanpr@reddit

Bigger models don’t fit In the 4090 so if it offloads to ram/cpu it will be super super slow

[-]

a_beautiful_rhind@reddit

This test is misleading. What quant is he running? What context?

Did 72b not fit in the memory of the A6000 ada? My 2x3090s smoke a better card? I highly doubt it. With GPUs the model has to fit in memory so why not pick a model that fits in both for an accurate comparison?

All it will tell you is that you can expect P40+20% generation speeds on M4 max and P40 speeds on M3 max. The thing you really want to know is how long context processing takes and if that is reasonable too. Nobody talks to their model for just one message.

[-]

mpasila@reddit

If you look at GGUF files at around 4bpw it's possible that A6000 also ran out of memory which would explain why it's slower. (Q4_K_M is 47.42GB which appears to be what Ollama uses)

[-]

a_beautiful_rhind@reddit

With the KVcache full context, that might do it.

[-]

mrskeptical00@reddit

It’s a silly comparison. Once the 4090 runs out of memory it’s game over. There’s no point in testing these against models that are too big to fit in memory.

[-]

MaycombBlume@reddit

4090 beats everything else in the consumer space in terms of raw compute.

But that doesn't go very far with only 24GB of VRAM. As soon as you go past the memory capacity of the 4090, you might as well just run a potato.

The rumor is that the 5090 will have a modest upgrade to 32GB. Not enough to change the game.

[-]

Historical-Pen-1296@reddit

Can't offload all 72b-Int4 weights into 24GB VRAM, half of layers offload to CPU RAM.

[-]

Ok_Warning2146@reddit

I suppose for 72B model, 4090 doesn't have enough VRAM, so quite many layers were offloaded to CPU.

In general, I think we can expect M4 Ultra to perform similarly to 4090 for inference as its RAM speed is on par with 4090. Of course, the 256GB RAM can also open you up to run llama 3.1 405B Q4_4_8.

[-]

NEEDMOREVRAM@reddit

I'm considering getting a M4 Macbook Pro 48GB of RAM.

Do you think I will be able to run a Q8 quant of a 33B model? The next option is going for the 64GB Pro Max.

[-]

thezachlandes@reddit

48 is viable. Do you also want to run a 7B code completion model or is this just for chat? Do you have any big docker images to run at the same time?

[-]

sasik520@reddit

64 is pretty affordable, its just +$200 compared to 48.

I wonder if it is worth to upgrade to 128 GB which is additional $800.

[-]

thezachlandes@reddit

In my opinion, 64GB is not a tough argument to make. I think returns begin diminishing after that, especially if you don't know how you are going to use all that RAM. Regardless, it's a personal decision. I went for it.

[-]

Round_Handle6657@reddit

thanks for the input! I’m also considering the 64GB option along with exo cluster approach for running bigger model. But the m4 pro is limited by its bandwidth, GPU cores, and its tflops relative to m4 max, making it difficult for me personally to go beyond 48GB to 64, as you said the returns begin diminishing. Yesterday i tried configuring the Mac Studio to test for the price, turns out M2 Max chips with 64GB of ram is only $400 ish dollars more, compared to m4 pro 20 gpu cores + 64GB

[-]

NEEDMOREVRAM@reddit

I'd prefer something along the lines of Gemma 27B. This is for me needing an answer quickly about anything general knowledge and not wanting to load up Claude3 or fire up my 4x3090 server (which requires me walking over to the keyboard to enter my password in Pop-OS).

For example, I spent about an hour last night talking to ChatGPT about some electronic devices I found on Aliexpress. Ideally, I would prefer to do that locally. The reason I didn't is because my motherboard is RMA and I'm stuck using ChatGPT until I get the new one sent.

Trying to get off the Anthropic and ChatGPT teet.

Outside of that...unsure what is considered "big" as far as docker goes...but I may have a SearXNG instance up and running or something else. Nothing mission critical and nothing I can't close to load up LM Studio (or OpenWeb UI).

[-]

Durian881@reddit

Seemed a bit slower than expected for both M3 Max and M4 Max. My M2 Max can generate 7-8 t/s for Qwen2.5-72B Q4 at 7-8 t/s. I had expected ~10 t/s for M4 Max and 8t/s for M3 Max.

[-]

sassydodo@reddit

can't you offload part of layers so the speed is actually better?

[-]

SandboChang@reddit

9 token isn’t bad at all, what about the prompt processing time for small context?

[-]

-6h0st-@reddit

M4 Ultra will rock the space. For 4-5k have performance of 4090 with much more vram available

[-]

HairPara@reddit

Does any know if he’s testing the 12/16 core M4 Pro or the 14/20?

[-]

GrosPoulet33@reddit

Does ollama use MLX or the CPU?

[-]

vigg_1991@reddit

How does it happen then? What’s is it using! Sorry! But I was under the assumption that if we get the m4 max or ultra , assuming all of these models are built using may be one of the known frameworks PyTorch or tensorflow for training . The stored model is probably also supporting the MLX ! Don’t mind for not understanding the better but please do help me understand this concept better. Thanks!

[-]

GrosPoulet33@reddit

An ML engineer needs to implement the calculations using the MLX framework. It would be quite a bit of work to support everything, but it's do-able.

M4 will still be great since the CPU supports vectorized operations, but with MLX you'd get a big speedup since it can use the neural engine.

You might be able to just run your model using this https://pypi.org/project/mlx-lm/ . Adding the web part serving would be trivial.

[-]

Nepherpitu@reddit

RTX4090 + RTX3090 using exllamav2 with speculative model 0.5B, everything at q4 and cache q6, gives about 30 t/s!

[-]

Beautiful_Car8681@reddit

Beginner question: can a ryzen with integrated graphics processor do something interesting like Mac when adding more ram?

[-]

Anotheeeeeeant@reddit

You will have to run it via cpu but like the other guy says it is just slow compared to mac where things are 1000000000000x more optimised. Not to mention amd hasn't invested a lot in ai stuff so it does hurt a bit.

[-]

roshanpr@reddit

Super slow

[-]

DawgZter@reddit

Looks like m4 ultra may beat 4090 on t/s

[-]

cybran3@reddit

X (Doubt)

[-]

roshanpr@reddit

Why doubt? The 4090 can’t even load 70b models

[-]

infiniteContrast@reddit

The best value for money is still the 4090 or the 3090 which is basically the same card.

[-]

roshanpr@reddit

You drunk by claiming the 4090 and the 3090 are the same card . Please 🙏 don’t spread misinformation.

[-]

mizhgun@reddit

Are you considering power consumption?

[-]

smith7018@reddit

Apple’s not able to dethrone it yet but the pricing pf the M-series is getting competitive with the 4090. The M4 Ultra next year will either tie or beat it. At that point, it’s probably only a little more expensive than a complete 4090 set up but with more VRAM. The M5 or M6 will be really interesting to watch

[-]

emprahsFury@reddit

running llama-bench on a medium sized model should be what the review outlets do when testing these new fangled ai machines. It's repeatable, it's quantifiable, it's scriptable. You can build it on the machine.

[-]

SomeOddCodeGuy@reddit

Unless I'm mistaken, the problem with Llamabench is that it doesn't represent processing speed well.

Late last year there was a big rush of folks wanting to buy Mac Studios, because the tokens per second on benchmarking tools were looking fantastic. "11 tokens per second for a 70b!" and things like that; folks loved it.

The issue is that a lot of these benching tools were using 100 token context, maybe 1000 at most. So I posted this post, showing folks the ACTUAL real world speeds at higher contexts... and you can see the comments section. A lot of folks found it simply unusable for their desires.

So unless llamabench has changed and now runs 8,000+ context tests and shows ms per token prompt processing speeds, I definitely have to disagree on using it. The numbers it presents are pretty useless for the mac otherwise.

[-]

stefan_evm@reddit

you can benchmark prompt processing speed and token generation seperately with llama-bench. my use case is mainly about prompt processing (i.e. processing large contexts / prompts) on m1 / m2 ultras and llama-bench is my favourite test

[-]

sipjca@reddit

What size model are you thinking about?

I am doing some work on tweaking llama-bench currently to add other features like sampling efficiency (by querying the accelerators driver for power information) and adding this to llamafile so it can be distributed and accelerated by default in a single file. In addition to building a open source website to host this data

[-]

jacek2023@reddit

Summary: it's bad, buy 3090 instead

[-]

roshanpr@reddit

3090 can't run 72b models

[-]

limitless_11111@reddit

2 3090 can which is still cheaper than 64gb mac

[-]

Mochilongo@reddit

The setup maybe cheaper but the electricity bill will be reducing that difference every month.

A killer benefit for the macs is that they can provide a decent performance and are portable, if portability is secondary then you can also buy an M2 Ultra for almost the same price.

[-]

btoad@reddit

My M1 Max 64GB can run Qwen2.5 72B instruct Q5_K_M (54GB~) using < 60W of power, meanwhile when I run a smaller quant of the safe model on my 2 x 3090 desktop, I have each 3090 power limited to 250W just to keep things smooth.

It's nice having the choice of slower, higher quality, and more power efficient inference versus faster, lower quality, and less power efficiency

[-]

Beneficial_Win_492@reddit

how many "token/s" do you get when running your m1 max 64gb on the Qwen2.5 72B?

[-]

roshanpr@reddit

I dont support global warming.

[-]

Covid-Plannedemic_@reddit

So you always bike to the grocery store right? And restaurants in your neighborhood? You never ever ever ever ever drive a car locally, right? And you don't eat meat either, right? Because all of those things matter 1000x more than virtue signalling about how running a consumer graphics card at half its power limit takes too much energy

[-]

roshanpr@reddit

Your woke mindset for sure can’t take a joke. My point is that the amount of heat/power required to rune the 3090 SLI setup is way more significant and that is worth considering given the footprint and efficiency of apples carbon neutral ARM devices. Shut it!

[-]

slavchungus@reddit

call that room heating and a high power bill

[-]

a_beautiful_rhind@reddit

I really wished that worked. You'd have to run them 24/7 to notice any heating.

When I tried the meme last winter, all the plants in my garage frosted and died.

[-]

slavchungus@reddit

damn well maybe its closer to a hand warmer and only if u have an i9

[-]

Rhypnic@reddit

Hmm delicious pricey dual motherboard and giant psu.

[-]

slavchungus@reddit

dont forget giant case and fans

[-]

--mrperx--@reddit

don't buy computers then.

Buy carbon offsets.

[-]

Glebun@reddit

Less memory

[-]

roshanpr@reddit

cheaper, but limited

[-]

CarretillaRoja@reddit

How does the MacBook Pro m4 pro/max compare with Windows laptops running on battery?

[-]

panthereal@reddit

Interesting how the actual performance gains here are much less significant than tools like Geekbench show.

Geekbench lists the M4 Max at an 80% improvement over the M3

Meanwhile this is at best 22%

[-]

zeeb0t@reddit

It should basically be illegal to benchmark models at tiny context. Most of the models tested would not even fit on the cards with higher context, even quantized.

[-]

InvestigatorHefty799@reddit

I'm primarily interested in long context large models, how does apple silicon perform? I'm think about the M4 Max 128gb but I don't want to commit to it if it's going to be extremely slow.

[-]

ortegaalfredo@reddit

I would like to see the batching speed of a M4 max. For heavy-duty use, batching is the speed that counts.

With 4x3090 I get 80 tok/s max using tensor-parallel and vllm. I don't know if a M4 can even do tensor-parallel.

[-]

AaronFeng47@reddit

The 72B model speed of M4 Max & M3 Max shows M4 max is still compute constrained, the tk/s speed improvements doesn't align with the ram speed improvements

[-]

Mochilongo@reddit

You can actually see the improvement from RAM speed (up to 20%), both machines have 40 GPU cores on the other hand nvidia cards have much more cores and almost double the memory bandwidth.

[-]

fallingdowndizzyvr@reddit

both machines have 40 GPU cores on the other hand nvidia cards have much more cores

Comparing the number of cores across different architectures is meaningless.

[-]

sonterklas@reddit

I used llama3 70b to make summary of the transcript especially about this topic.

Here is a concise summary of the transcript, focusing on the M4 Pro and M4 Max performance on Ollama and large language models:

M4 Pro and M4 Max Performance on Ollama and Large Language Models

The M4 Max performed well on Ollama, with an inference speed similar to the RTX 4090 (around 50-60% of its performance) at scales of 7b-32b.
The M4 Pro's performance was significantly lower, likely due to its limited GPU capabilities.
When running a 72b large language model, the M4 Max and M3 Max were the only platforms that could run smoothly without exhausting the graphics memory.
The M4 Max's unified memory of 128G allowed it to maintain a stable performance, while the RTX 4090 and RTX 6000 Ada experienced significant performance drops due to memory limitations.

Overall, the M4 Max demonstrated impressive performance on large language models, thanks to its high-capacity unified memory and powerful GPU. The M4 Pro, on the other hand, was limited by its less powerful GPU.

[-]

s20_p@reddit

No info on first token latencies?

[-]

Ulterior-Motive_@reddit

That's not bad actually. M4 Max has almost the same performance as my dual MI100s but twice the memory. And significantly less power usage, presumably. If I had an extra 5k, I'd probably go for it.