Choosing a Mac Mini for local LLMs — what would YOU actually buy?
Posted by Kindly_Sky_1165@reddit | LocalLLaMA | View on Reddit | 52 comments
Got three options on my radar and genuinely can't decide. Not looking for spec sheets — want to hear from people actually running this stuff daily:
M4 (32GB) — newest but apparently the slowest of the three for inference?
M2 Pro (32GB) — solid used option, heard it actually beats the base M4 on tok/s
M1 Max (64GB) — oldest chip but highest memory bandwidth, seems like the hidden gem
Running Ollama, coding assistants (Qwen/Kimi), maybe some RAG pipelines. Budget is $2–3k so I'm not totally screwed on options. And yeah obv openclaw to stop spending on closed models.
The big thing holding me back: there are strong rumours that Apple is dropping an M5 Mac Mini and M5 Mac Studio around WWDC 2026. Apparently stock on current models is already drying up (4–5 month wait times in some configs). So do I pull the trigger now or sit tight a few more months?
What's you are using ? And if you were buying today, would you wait for M5 or just grab the M4 Pro 48GB and get to work?
Kindly_Sky_1165@reddit (OP)
thanks everyone, learned a ton:
appreciate all the input 🙏
LieStandard5398@reddit
do you have Ideas or when apple will release M5 Studio? thx!
Kindly_Sky_1165@reddit (OP)
no idea but all we hear are rumours saying somewhere around july
LieStandard5398@reddit
thx
_derpiii_@reddit
could you explain the TB5 clustering? That’s news to me.
Kindly_Sky_1165@reddit (OP)
basically with TB5 you can connect multiple macs together and they act as one machine for running models. so you could start with one M4 Pro Mini and add another later to double your RAM/compute. older chips don't have TB5 so whatever you buy is all you get forever
_derpiii_@reddit
Ohhh! I thought that was a proprietary connection, didn't realize it was TB5!!
I hope it applies to macbooks too :)
Mean-Elk-8379@reddit
If your use case is agentic coding or anything tool-heavy, prioritize unified memory over raw cores. 64GB is the practical floor; 32GB gets you a Q4 of a 30B and not much headroom for context + OS + IDE. A Mac mini with 64GB is the best price/perf for staying local on 30-35B class models with decent ctx. If you can wait for the next M5 refresh the bandwidth jump is the real upgrade.
Kindly_Sky_1165@reddit (OP)
Would you buy studio or mini ? if you have an option ?
Miserable-Dare5090@reddit
studio, hands down. Right now with the prices you have very little option at 2-3k. most people buying a mac mini for lobsters don’t run it with a local model, you’ll wait all day for a single task on the Mini’s bandwidth. But TB5 clusters work for macs, so later when you get the itch and you will for sure, you can use an M4 pro mini with beefier machines as a cluster. The older chips/models do not have Tb5, so whatever they have in terms of ram and compute, that’s going to be it for their functionality.
Kindly_Sky_1165@reddit (OP)
Oh didn't know TB5 lets you cluster macs together , the older chips a hard pass then
Miserable-Dare5090@reddit
m3 ultras have TB5.
Lhurgoyf069@reddit
You sure about all M5? Macbook Air M5 has TB4
Miserable-Dare5090@reddit
Yes, you can do distributed/mlx ring with TB4 but not RDMA. Air usually has -1 level tech on it. Mini 4 didnt have TB5 until you get to the pro chip. Ultra chips carry 4-6 TB ports and were TB4 on M2, Tb5 on M3, probably on M5 ultra.
Kindly_Sky_1165@reddit (OP)
I hadn't thought about how much context + OS + IDE overhead eats into what's actually available for the model. Thanks for this and more to consider.
Miserable-Dare5090@reddit
In mac, plan to reserve 8 gb for system, so effectively 36gb is 27ish GB vram, etc.
alphatrad@reddit
Unified memory is not what you should prioritize if you're doing tool-heavy anything. You want speed - so stuff gets done in a timely manner. Not in hours.
Durian881@reddit
Memory and context window are still important though, especially when you are running subagents concurrently.
alphatrad@reddit
While true - are you running 8 agents in parallel at 20tps ? How's that working out?
Durian881@reddit
Subagents given focused context often achieve better outcome vs a single main agent trying to do everything.
So far, I limit to one main and 3 subagents on my 64GB M2 Max and Qwen3.6-35B-3A (with 262k context) was able to handle it pretty well.
Would your use cases benefit from subagents or bigger context window? If not, saving on memory and focus on cores can make sense.
instant_king@reddit
32GB is not enough in 2026 if you are interested in LLMs
Kindly_Sky_1165@reddit (OP)
agreed and this is where I landed
peppeg@reddit
It’s a tough balancing act between VRAM capacity and memory bandwidth. Sure, GPUs are incredibly fast, but in today’s market, an RTX 5090 costs around €4,000 and still leaves you with only 32GB of VRAM.
If you’re aiming for 27B dense models or 30B MoEs, you need more room. If you can’t fit the entire model, weights, and KV cache into VRAM, your performance will tank immediately. Of course, you could rig up four 5090s and go pro... :D but then you're looking at insane power draw and heat.
That’s why I found the M4 Pro Mac Mini with 64GB RAM to be the ultimate sweet spot. While its 273 GB/s bandwidth isn't on par with a top-tier discrete GPU, it's plenty for smooth inference. You can comfortably load larger models with a decent context window while drawing a ridiculous 40W. Even at 15-20 t/s, you can just leave it running 24/7 without worrying about the electric bill.
This is the conclusion I've reached after weighing the options. I’m currently holding out for the M5 Pro pricing, but the M4 Pro is already a beast for this.
Regarding the M1/M2 models mentioned in the thread: keep in mind that the base/Pro versions of those chips have significantly lower bandwidth. Even with more RAM, you’d likely see a much lower token generation speed compared to the M4 Pro architecture.
ketosoy@reddit
Did you look at 64gb M1 Max chips. On paper the 400gb/s bandwidth suggests it would punch way over its weight class on inference.
SolitaryShark@reddit
considering how much cheaper a m1 max 64gb is compared to m5 pro 64gb (literally half the price), plus the memory bandwidth being higher, is the m1 max a good choice?
alphatrad@reddit
Not a Mac Mini. Ever.
Look, I'm a Mac guy. But I just wouldn't run them. Can you? Sure... but they're not fast because of it’s low memory bandwidth. This is the problem with the Mac hype - they've given people totally bad information.
People ask why I insist on GPUs and not Mac Studios/Mac minis? Yes, you can buy a M3 Mac Studio Ultra with 512gb of unified memory and load a massive model and have it spit out tokens at 2 per second.
Super not useful unless you want to wait forever.
Ever notice how the Mac grifters are always talking about running local models over night. Yeah, becuase they're slow. No one is going to wait 8hrs for a component to be updated by their agent.
A Mac Mini vs a 3090
RTX 3090 is noticeably faster like 20-40% higher tps
- Nemotron-3-Nano 4B: RTX 3090 =187 tok/s vs. Mac Mini M4 = 25 tok/s
- General 7B–13B or small 33B Q4/Q5: 3090 build wins by 20 - 40%.
- Qwen3-30B (older M3 Ultra vs 3090): 3090 edged out on token generation in most tests.
Mac Studio M4 Max = 65 tps vs. much faster RTX 5090 at 240tps!!!
TLDR:
This is repeatedly called out as the core limiter for Apple Silicon in inference:
- Mac Mini M4 (base): 120 GB/s <--- slow as poo!
- Mac Mini M4 Pro: 273 GB/s <--- still slower than a 3090 !!!
- Mac Mini M4 Max / Studio: up to 546 GB/s
- RTX 3090: 936 GB/s (GDDR6X)
- For context, newer RTX 5090 hits 1,792 GB/s.
SkyFeistyLlama8@reddit
You're missing the prompt processing speed. Anything before M5 will have glacial prefill/PP speeds compared to a 40xx or 50xx GPU.
Then again, I've also got a unified RAM machine and I stick to MOEs mainly. Qwen 35B and Gemma 26B, or Qwen Next 80B if I can spare the RAM.
Kindly_Sky_1165@reddit (OP)
good point on prefill, hadn't thought about that. do you notice the slow PP much in day to day use or only with really long contexts? and what is your setup ? still trying to figure out ideal setup that doesnt lag and I dont need to spend $$ again and again to fix mistakes
SkyFeistyLlama8@reddit
I notice it every single time. I run a simpler coding harness and it can take a minute or more to get a reply if I feed it a large module or a few long functions. Then again, I spent barely any money other than getting a laptop so I'm happy with that tradeoff.
If you want minimal lag as in high PP speeds and fast time to first token, then a discrete GPU is the way to go. The limited VRAM means you're stuck with smaller models unless you spend a lot of money on a 5090 or two.
Macs and other unified RAM machines like Strix Halo allow you to run larger models slowly.
alphatrad@reddit
32gb for $1k top and 24gb for $750 bottom.
You don't need a 5090.
Kindly_Sky_1165@reddit (OP)
bro can you explain your setup and speed of inference ?
alphatrad@reddit
It's in the middle of a "Upgrade" if you will.
But previous configuration here was dual AMD Asrock RX 7900 XTX Phantom Gaming Cards - 24gb - two of them giving me 48gb of VRAM. Both I bought used on eBay.
Have served me well. And they perform well because they have high memory bandwidth. They're just FREAKING HUGE cards.
I've done a lot of bench marking with these doing Dense models and MOE and using them day to day.
Posted a lot on X - like Vulkan vs ROCm using llama.cpp dual vs single card
https://x.com/1337hero/status/2027512570199085431/photo/1
Speeds always vary by model and density and qaunt. Like I have Trinity Nano which is the fastest thing on one of these cards at absurd TPS. But it's also the stupidest model I've ever used outside of basic basic chat. Can't do tool calls or anything. So who cares if it's lightening fast.
My plan right now - I ordered ONE Asrock AMD Pro AI R9700 32gb of ram to see how it would perform. Started running some head to head tests because it's RDNA4 so it has some hardware accelration the RDNA3 XTX doesn't. But it's memory bandwidth is just slightly slower.
On dense models the XTX inches ahead, but barely. And the breathing room the 32gb card makes up for. They're head to head on MOE and then with MXFP4 there is just a serious bump that makes the R9700 trump the XTX in speed.
I have 3 PCIe lanes on my board that allow full x16 - and I couldn't fit a 3rd card with those XTX's.
So my plan now that the other one is much smaller, sell the two XTX's on ebay.
Pick up two more of the R9700's
Should give me a total of 96gb of VRAM - for WAY LESS MONEY than an RTX 5090 which again, I can't fit 3 of those
alphatrad@reddit
Qwen3.5-35B-A3B Q3_K_M (MoE, 15.6 GiB, Vulkan, FA)
Qwopus 27B Q6_K (dense, 20.56 GiB) — bandwidth crossover
Kindly_Sky_1165@reddit (OP)
thanks yeah its pretty big $$$ thats why collecting as much info as possible before I jump in to get a proper roi
SkyFeistyLlama8@reddit
ROI is gonna suck unless you can recalibrate your expectations.
Cloud models are fast even with huge contexts. You can buy a lot of cloud usage for $5k or $10k.
A midrange M5 laptop can get you 3/4 of the way there. Use a recent MOE like Qwen 3.6 35B or Gemma 4 26B as your main models and reserve the heavy lifting for cloud.
Kindly_Sky_1165@reddit (OP)
true and the prompt processing lag on local is the real killer for heavy coding workflows.
ai_guy_nerd@reddit
RAM is definitely the priority here. If you can swing the 64GB M1 Max, that's the move for larger models, though the M4 efficiency is tempting. For RAG pipelines and coding assistants, you'll hit the memory wall way before the chip speed.
Memory bandwidth on the Max chips makes a huge difference for tokens per second. Since you're already using OpenClaw to dodge the API tax, you'll appreciate the speed. Regarding the M5 rumors, they're always floating around. The M4s are already beasts. Grab the best RAM you can afford now and get to work.
Responsible_Buy_7999@reddit
I’d wait until wwdc
alexwh68@reddit
These are my devices
M3 max 96gb M4 mini 24gb M5 air 16gb
The only one that can sensibly run local models that are actually productive is the M3, ram is the biggest factor.
A mini with 64gb of ram is a starting point but it’s limited in what can be run effectively.
Single core speeds have improved a lot M3 2724 M4 3432 M5 4167
Also disk speeds have improved a lot.
I would consider clustering mac mini’s in the future, it’s one way to gradually ramp up things.
Kindly_Sky_1165@reddit (OP)
thanks, so realistically the M3 Max 96GB is the only one of the three actually pulling weight for inference then
alexwh68@reddit
Memory is the most important thing the lower the bit of the quantification the worst things get for me Q6 is the sweet spot, it works well but uses a lot of memory.
Memory bandwidth is important as is disk speed to some degree.
Kindly_Sky_1165@reddit (OP)
makes sense, so more RAM basically means you don't have to sacrifice quality by dropping to Q4 or lower. Q6 on a 70B on 96GB+. thanks man
roaringpup31@reddit
M1 Max with how many GPU cores? This makes a big difference (\~80% on inference). Regardless, would go for the 64GB
FilterJoe@reddit
I own a Mac Mini m2 Pro 16GB RAM and I love it but what everyone else is saying is true. You can play with little models (I have) and get used to how it all works for sure. But if you want to go beyond cute demos, you'll need 64GB RAM minimum, and 128GB RAM preferable.
With 128GB RAM you can have one sizeable model with large context and even running a couple smaller models as well (the bigger model delegates simple tasks to the little ones.
I can only dream about doing such things until I get a 128GB Mac. I'm holding out for the m5 Studio which will have a significant advantage over prior generations thanks to the GPU-integrated Neural Accelerators (matrix multiplication built into the hardware) which speeds up prompt processing.
You can absolutely use a Mac Mini m2 pro for learning. But eventually you'll want 64GB as an absolute minimum, if not 128GB.
Kindly_Sky_1165@reddit (OP)
thanks yeah wanted to get into local models but sounds like I'd just be hitting a wall pretty quick with anything under 64GB. guess I might need wait for M5 then. So confusing with lot of $$ on line .
QuchchenEbrithin2day@reddit
If I am not mistaken, the 128GB RAM automatically maxes you out on CPU/GPU front as well, no ? With a M4, it'd mean a M4 Max (since Ultra is not an option)
No_Mango7658@reddit
You will find memory speed is extremely important. M5: 156gbps Strix halo: 256-275gbps M5 pro: 307gbps M5 max: 460-614gbps
I would never choose an M5 for inference. Good luck
El_Danger_Badger@reddit
2020 M1 Mac Mini, 16gm RAM. You're limited to mid tier models, but honestly, just starting out, you just need a model that you can stand up.
By the time tou get to the point where you have reached the machine's limits, the M5s will already be a generation back. Plus they're cheap.
Gesha24@reddit
IMO it's not worth buying a 32GB mac, especially if you want to code on it. PC with 32GB dedicated card + 16GB of ram will be able to comfortably run your local IDE and have a solid context (I am running qwen3.6 4-bit quant with 260K context and there's still a little headroom). But at 64GB of ram things change.
gh0stwriter1234@reddit
Yeah get a clunker with enough ram 32gb+ (lots of old workstations like that) and throw 2 R9700s in it...
kkcheong@reddit
If you buy something purposely for llm, then it's either 64gb or 128gb. There's no other way
WorldlinessTime634@reddit
Hi. How these things have vram on board?