Geekerwan benchmarked Qwen2.5 7B to 72B on new M4 Pro and M4 Max chips using Ollama
Posted by Balance-@reddit | LocalLLaMA | View on Reddit | 101 comments
Source: https://youtu.be/2jEdpCMD5E8?t=796
Balance-@reddit (OP)
Summary: M4 Max is about 15-20% faster than M3 Max on most models. M4 Pro is about 55-60% of M4 Max or around two-thirds of M3 Max.
All slower than a 4090, as long as the models fits within memory. Only the Max models can run the 72B model at reasonable speed, around 9 tokens per second for M4 Max.
Neither_Quit4930@reddit
Looks like M4 pro should be able to generate 4-5 t/s on 64gb.
businesskitteh@reddit
Curious what impact connecting 2 M4 Pro Mac Minis via Thunderbolt 5 would have on inference
EFG@reddit
Have two new unused m4 minis. Will try today.
krisirk@reddit
That's awesome, I look forward to the results.
EFG@reddit
Terrible shot but was very easy to get then set up. Have 4 other unused m3/m4 iMacs I’m gonna toss together with the minis for a prototype office RAG. Benchmarks don’t really matter for me if I can remotely send it instructions to execute during downtime. Actually lots of a possibilities when you can get compute cost down so much with old gear lying around.
krisirk@reddit
What does the performance look like?
EFG@reddit
No benchmarks yet, and just realizing it didn’t even attach my screenshot
https://imgur.com/a/DP0iCe7
If you have a preferred benchmarking tool, let me know, and I’ll try sometime this morning after getting the iMacs connected. If it goes well will definitely look into getting something far beefier for my server.
krisirk@reddit
The rest of the Mac M series lineup is benchmarked here, using Llama 2 7B at f16, Q8_0, and Q4_0. It would be neat to see what the prompt processing for a 512 token input and the text speed of a 128 token output are.
https://github.com/ggerganov/llama.cpp/discussions/4167
roshanpr@reddit
Is it worth it?
Neither_Quit4930@reddit
It depends on whether running 72B is important to you. I think given the price differences between m4 pro and max, the inference speed is negligible at that scale: 4-5t/s vs 8-9t/s.
roshanpr@reddit
Thank you.
NEEDMOREVRAM@reddit
You can't get an M4 Pro with 64Gb of RAM. Apple forces you to upgrade to a Max for 64GB.
AngleFun1664@reddit
Sure you can, Mac Mini, M4 Pro, 64 GB RAM for $2k
NEEDMOREVRAM@reddit
Are you talking about the Mac Mini Pro?
Because I have been unable to get 64GB of RAM added to a M4 Macbook Pro. It tops out at 48GB. I need to go to the Max to get the option of 64GB.
redundantly@reddit
The M4 tops off at 32 GB
The M4 Pro goes up to 48 GB
The M4 Max can have up to 128 GB
AngleFun1664@reddit
M4 Pro goes to 64 GB in the Mac Mini
redundantly@reddit
The person I was replying to was talking about trying to get a 64 GB MBP
Mochilongo@reddit
He may refer to the mac mini with m4 pro and 64gb ram
ShengrenR@reddit
They should have just run a slightly smaller quant honestly and they could have had a sane number instead of 0.02, though of course more context and precision is always nice
randomusername44125@reddit
“As long as models fits within memory”. This is the whole value proposition of these macs. 4090 is always going to be faster. But what’s the size of the model that I can economically fit in it vs what can I fit in a 128G m4 max. I am actually planning to buy something by end of this year and so trying to do the research.
akashocx17@reddit
Does this mean m4 max (128) is better choice if we want to run larger models ? As compared yo 4080 or 4090 whose vram could be only 16 - 24 gb?
hey_listin@reddit
interesting that the 4090 would beat M4 Max on lower models but then Max outperforms on the higher model? as someone new to this, I'm wondering does anyone have an explanation?
roshanpr@reddit
Bigger models don’t fit In the 4090 so if it offloads to ram/cpu it will be super super slow
a_beautiful_rhind@reddit
This test is misleading. What quant is he running? What context?
Did 72b not fit in the memory of the A6000 ada? My 2x3090s smoke a better card? I highly doubt it. With GPUs the model has to fit in memory so why not pick a model that fits in both for an accurate comparison?
All it will tell you is that you can expect P40+20% generation speeds on M4 max and P40 speeds on M3 max. The thing you really want to know is how long context processing takes and if that is reasonable too. Nobody talks to their model for just one message.
mpasila@reddit
If you look at GGUF files at around 4bpw it's possible that A6000 also ran out of memory which would explain why it's slower. (Q4_K_M is 47.42GB which appears to be what Ollama uses)
a_beautiful_rhind@reddit
With the KVcache full context, that might do it.
mrskeptical00@reddit
It’s a silly comparison. Once the 4090 runs out of memory it’s game over. There’s no point in testing these against models that are too big to fit in memory.
MaycombBlume@reddit
4090 beats everything else in the consumer space in terms of raw compute.
But that doesn't go very far with only 24GB of VRAM. As soon as you go past the memory capacity of the 4090, you might as well just run a potato.
The rumor is that the 5090 will have a modest upgrade to 32GB. Not enough to change the game.
Historical-Pen-1296@reddit
Can't offload all 72b-Int4 weights into 24GB VRAM, half of layers offload to CPU RAM.
Ok_Warning2146@reddit
I suppose for 72B model, 4090 doesn't have enough VRAM, so quite many layers were offloaded to CPU.
In general, I think we can expect M4 Ultra to perform similarly to 4090 for inference as its RAM speed is on par with 4090. Of course, the 256GB RAM can also open you up to run llama 3.1 405B Q4_4_8.
NEEDMOREVRAM@reddit
I'm considering getting a M4 Macbook Pro 48GB of RAM.
Do you think I will be able to run a Q8 quant of a 33B model? The next option is going for the 64GB Pro Max.
thezachlandes@reddit
48 is viable. Do you also want to run a 7B code completion model or is this just for chat? Do you have any big docker images to run at the same time?
sasik520@reddit
64 is pretty affordable, its just +$200 compared to 48.
I wonder if it is worth to upgrade to 128 GB which is additional $800.
thezachlandes@reddit
In my opinion, 64GB is not a tough argument to make. I think returns begin diminishing after that, especially if you don't know how you are going to use all that RAM. Regardless, it's a personal decision. I went for it.
Round_Handle6657@reddit
thanks for the input! I’m also considering the 64GB option along with exo cluster approach for running bigger model. But the m4 pro is limited by its bandwidth, GPU cores, and its tflops relative to m4 max, making it difficult for me personally to go beyond 48GB to 64, as you said the returns begin diminishing. Yesterday i tried configuring the Mac Studio to test for the price, turns out M2 Max chips with 64GB of ram is only $400 ish dollars more, compared to m4 pro 20 gpu cores + 64GB
NEEDMOREVRAM@reddit
I'd prefer something along the lines of Gemma 27B. This is for me needing an answer quickly about anything general knowledge and not wanting to load up Claude3 or fire up my 4x3090 server (which requires me walking over to the keyboard to enter my password in Pop-OS).
For example, I spent about an hour last night talking to ChatGPT about some electronic devices I found on Aliexpress. Ideally, I would prefer to do that locally. The reason I didn't is because my motherboard is RMA and I'm stuck using ChatGPT until I get the new one sent.
Trying to get off the Anthropic and ChatGPT teet.
Outside of that...unsure what is considered "big" as far as docker goes...but I may have a SearXNG instance up and running or something else. Nothing mission critical and nothing I can't close to load up LM Studio (or OpenWeb UI).
LanguageLoose157@reddit
What about just M4? I'm not a fan to shell out another $500 for Pro
Balance-@reddit (OP)
Probably half M4 Pro performance (about 30% of M4 Max). 14B models will be fine, 32B maybe with 32GB memory.
LanguageLoose157@reddit
Do we have BMK comparison with the new intel chips?
Intel is expected to release their 200 series H processor which are geared towards high performance. Also Intel chips got NPU.
Maxxim69@reddit
According to unofficial info, Intel’s upcoming 200 series H processors will support up to 64GB of dual-channel DDR5-5600 memory. While that's a relatively fast RAM by PC standards, it's still very slow compared to Apple's Unified RAM.
To explain in very simple terms, running an LLM on a CPU, completely or partially, you have two major bottlenecks: 1) Prompt processing speed, mostly dependent on how fast your GPU is (if any), and 2) Inference speed, dependent on how fast your RAM is. GPU and RAM in Apple's M-series systems are a lot faster than anything Intel has to offer.
And forget about NPUs (at least for a couple generations). That hardware is designed for very simple tasks (like 1/10 complexity compared to running an 8B LLM) and it may take years for it to get decent software support.
LanguageLoose157@reddit
I see. I had a hunch the NPU won't be much for LLM model. The last time I saw the TLOPS, the IGPU is faster than the NPU. Nonetheless, it would awesome to see, say a 7b or 13b running off the NPU in the background all the time and leave the CPU and GPU for other tasks.
I find it absurd how useless desktop RAM is compared to VRAM in LLM space. RAM is so cheap and easily upgradable for a faster one. I don't understand this over reliance on VRAM.
Maxxim69@reddit
Who knows, maybe in a few years. But then again, by that time we might be running quantized 400B models on CPUs.
When you get even a very basic idea of how LLMs work (matrix multiplication), GPUs vs CPUs (architecture), and the difference between RAM and VRAM (speed), it will make total sense.
ShengrenR@reddit
The cpu itself with ram won't do the trick, but it may open up more acceptable edge cases.. let you offload 1 more layer type deal, or step up one tier in quant.. if the 5090 is 32GB, fast cpu/ram could get the high iq3 types for 72B into a bearable range, at least in small context windows.
mrskeptical00@reddit
These aren’t purpose build to run LLMs - these are general purpose computers that also happen to run LLMs reasonably well. Mac Mini M4 is much faster than my Core i5-13000 with DDR5.
I upgraded my Windows PC to a 3090 a few months ago just to be able to run bigger models. It’s great for gaming too but I don’t do much of that. Had the option been available I’d have just bought a new Mac Mini.
Is it as fast as a 3090? No. Is it good enough for most people to run a model they pull from HuggingFace? Absolutely. I set my Windows PC to sleep after an hour of non-use and I need to wake it remotely if I want to access the model from my phone. With a Mac Mini, it’s efficient enough (and good enough) that I would just always leave it on and available.
ShengrenR@reddit
Ha, that's a fun party trick - I don't trust my net skills to not screw it up and host free inference for a group of amused kids half way around the world, or I'd do this too.
mrskeptical00@reddit
Look into Tailscale, it's a local VPN for all your devices. It's on my phones, iPad, PCs, servers. Each device has it's own name (magic DNS) so any of your online devices connected to Tailscale can reach any other - no need to open ports on your firewall or anything like that. Works great.
When I want to use Open WebUI on my mobile from inside the house or while out, I just open https://pc-home:3000 and I'm connected to my home PC.
NEEDMOREVRAM@reddit
Can you remotely start your AI server from your phone when away from home?
And does it show your entire desktop on your phone?
mrskeptical00@reddit
You can get an app like Remote Desktop or VNC to view your computer remotely - I’ve used those on my iPad on occasion but I don’t need to use them on my phone.
I use WakeOnLan to send a packet to the Mac of my PC. When out of the house I trigger a script on a raspberry pi that wakes my PC.
For the LLM I just use Open WebUI to access it via a browser.
Dalethedefiler00769@reddit
Pretty much every computer is a "general purpose computer" including a non-apple PC with a GPU, not sure what your point is.
mrskeptical00@reddit
Point is they’ll always lose to an RTX GPU but you can do a lot more with a $1000 Mac Mini than a $1000 GPU.
Durian881@reddit
Seemed a bit slower than expected for both M3 Max and M4 Max. My M2 Max can generate 7-8 t/s for Qwen2.5-72B Q4 at 7-8 t/s. I had expected ~10 t/s for M4 Max and 8t/s for M3 Max.
sassydodo@reddit
can't you offload part of layers so the speed is actually better?
SandboChang@reddit
9 token isn’t bad at all, what about the prompt processing time for small context?
Nepherpitu@reddit
RTX4090 + RTX3090 using exllamav2 with speculative model 0.5B, everything at q4 and cache q6, gives about 30 t/s!
Beautiful_Car8681@reddit
Beginner question: can a ryzen with integrated graphics processor do something interesting like Mac when adding more ram?
Anotheeeeeeant@reddit
You will have to run it via cpu but like the other guy says it is just slow compared to mac where things are 1000000000000x more optimised. Not to mention amd hasn't invested a lot in ai stuff so it does hurt a bit.
roshanpr@reddit
Super slow
DawgZter@reddit
Looks like m4 ultra may beat 4090 on t/s
cybran3@reddit
X (Doubt)
roshanpr@reddit
Why doubt? The 4090 can’t even load 70b models
infiniteContrast@reddit
The best value for money is still the 4090 or the 3090 which is basically the same card.
roshanpr@reddit
You drunk by claiming the 4090 and the 3090 are the same card . Please 🙏 don’t spread misinformation.
mizhgun@reddit
Are you considering power consumption?
smith7018@reddit
Apple’s not able to dethrone it yet but the pricing pf the M-series is getting competitive with the 4090. The M4 Ultra next year will either tie or beat it. At that point, it’s probably only a little more expensive than a complete 4090 set up but with more VRAM. The M5 or M6 will be really interesting to watch
emprahsFury@reddit
running llama-bench on a medium sized model should be what the review outlets do when testing these new fangled ai machines. It's repeatable, it's quantifiable, it's scriptable. You can build it on the machine.
SomeOddCodeGuy@reddit
Unless I'm mistaken, the problem with Llamabench is that it doesn't represent processing speed well.
Late last year there was a big rush of folks wanting to buy Mac Studios, because the tokens per second on benchmarking tools were looking fantastic. "11 tokens per second for a 70b!" and things like that; folks loved it.
The issue is that a lot of these benching tools were using 100 token context, maybe 1000 at most. So I posted this post, showing folks the ACTUAL real world speeds at higher contexts... and you can see the comments section. A lot of folks found it simply unusable for their desires.
So unless llamabench has changed and now runs 8,000+ context tests and shows ms per token prompt processing speeds, I definitely have to disagree on using it. The numbers it presents are pretty useless for the mac otherwise.
stefan_evm@reddit
you can benchmark prompt processing speed and token generation seperately with llama-bench. my use case is mainly about prompt processing (i.e. processing large contexts / prompts) on m1 / m2 ultras and llama-bench is my favourite test
sipjca@reddit
What size model are you thinking about?
I am doing some work on tweaking llama-bench currently to add other features like sampling efficiency (by querying the accelerators driver for power information) and adding this to llamafile so it can be distributed and accelerated by default in a single file. In addition to building a open source website to host this data
jacek2023@reddit
Summary: it's bad, buy 3090 instead
roshanpr@reddit
3090 can't run 72b models
limitless_11111@reddit
2 3090 can which is still cheaper than 64gb mac
Mochilongo@reddit
The setup maybe cheaper but the electricity bill will be reducing that difference every month.
A killer benefit for the macs is that they can provide a decent performance and are portable, if portability is secondary then you can also buy an M2 Ultra for almost the same price.
btoad@reddit
My M1 Max 64GB can run Qwen2.5 72B instruct Q5_K_M (54GB~) using < 60W of power, meanwhile when I run a smaller quant of the safe model on my 2 x 3090 desktop, I have each 3090 power limited to 250W just to keep things smooth.
It's nice having the choice of slower, higher quality, and more power efficient inference versus faster, lower quality, and less power efficiency
Beneficial_Win_492@reddit
how many "token/s" do you get when running your m1 max 64gb on the Qwen2.5 72B?
roshanpr@reddit
I dont support global warming.
Covid-Plannedemic_@reddit
So you always bike to the grocery store right? And restaurants in your neighborhood? You never ever ever ever ever drive a car locally, right? And you don't eat meat either, right? Because all of those things matter 1000x more than virtue signalling about how running a consumer graphics card at half its power limit takes too much energy
roshanpr@reddit
Your woke mindset for sure can’t take a joke. My point is that the amount of heat/power required to rune the 3090 SLI setup is way more significant and that is worth considering given the footprint and efficiency of apples carbon neutral ARM devices. Shut it!
slavchungus@reddit
call that room heating and a high power bill
a_beautiful_rhind@reddit
I really wished that worked. You'd have to run them 24/7 to notice any heating.
When I tried the meme last winter, all the plants in my garage frosted and died.
slavchungus@reddit
damn well maybe its closer to a hand warmer and only if u have an i9
Rhypnic@reddit
Hmm delicious pricey dual motherboard and giant psu.
slavchungus@reddit
dont forget giant case and fans
--mrperx--@reddit
don't buy computers then.
Buy carbon offsets.
Glebun@reddit
Less memory
roshanpr@reddit
cheaper, but limited
CarretillaRoja@reddit
How does the MacBook Pro m4 pro/max compare with Windows laptops running on battery?
panthereal@reddit
Interesting how the actual performance gains here are much less significant than tools like Geekbench show.
Geekbench lists the M4 Max at an 80% improvement over the M3
Meanwhile this is at best 22%
GrosPoulet33@reddit
Does ollama use MLX or the CPU?
zeeb0t@reddit
It should basically be illegal to benchmark models at tiny context. Most of the models tested would not even fit on the cards with higher context, even quantized.
InvestigatorHefty799@reddit
I'm primarily interested in long context large models, how does apple silicon perform? I'm think about the M4 Max 128gb but I don't want to commit to it if it's going to be extremely slow.
ortegaalfredo@reddit
I would like to see the batching speed of a M4 max. For heavy-duty use, batching is the speed that counts.
With 4x3090 I get 80 tok/s max using tensor-parallel and vllm. I don't know if a M4 can even do tensor-parallel.
AaronFeng47@reddit
The 72B model speed of M4 Max & M3 Max shows M4 max is still compute constrained, the tk/s speed improvements doesn't align with the ram speed improvements
Mochilongo@reddit
You can actually see the improvement from RAM speed (up to 20%), both machines have 40 GPU cores on the other hand nvidia cards have much more cores and almost double the memory bandwidth.
fallingdowndizzyvr@reddit
Comparing the number of cores across different architectures is meaningless.
sonterklas@reddit
I used llama3 70b to make summary of the transcript especially about this topic.
Here is a concise summary of the transcript, focusing on the M4 Pro and M4 Max performance on Ollama and large language models:
M4 Pro and M4 Max Performance on Ollama and Large Language Models
Overall, the M4 Max demonstrated impressive performance on large language models, thanks to its high-capacity unified memory and powerful GPU. The M4 Pro, on the other hand, was limited by its less powerful GPU.
thezachlandes@reddit
It’s looking more and more like the (2-weeks-away) 32B qwen2.5 coder is going to be the sweet spot for local development on the new m4 max. And 72B will work fast enough for general purpose chat!
bharattrader@reddit
I am planning gor mac mini m4 with 32 GB
Dead_Internet_Theory@reddit
Honestly, Apple is doing what neither AMD nor Intel could do.
Compete with Nvidia.
s20_p@reddit
No info on first token latencies?
Ulterior-Motive_@reddit
That's not bad actually. M4 Max has almost the same performance as my dual MI100s but twice the memory. And significantly less power usage, presumably. If I had an extra 5k, I'd probably go for it.