Lowkey disappointed with 128gb MacBook Pro
Posted by F1Drivatar@reddit | LocalLLaMA | View on Reddit | 127 comments
How are you guys using your m5 Max 128gb pro’s? I have a 14 inch and I doubt the size is the issue but like I can’t seem to find any coding models that make sense locally. The “auto” model on cursor outperforms any of the Qwens and GLM I’ve downloaded. I haven’t tried the new Gemma yet but mainly it’s because I just am hoping someone could share their setup because I’m getting like 50 tok/s at first then it just gets unbelievably slow. I’m super new to this so please go easy on me 🙏
fabkosta@reddit
Local models simply don't perform as well as the commercial beasts. You will inevitably be disappointed when you try to compare your local models to something running on an H100 or similar GPU. I would guess the minimum is probably a Mac Studio with 512 GB of memory, but probably even then you'd not reach the impressive qualities of Anthropic Claude Code or OpenAI Codex.
Is that a problem? Well, that depends on your expectations. If that's what you hoped for, then you may be disappointed. If, however, you have an impressive tool running fully locally at reasonable prices, then you simply do not have too many alternatives.
Torodaddy@reddit
I think this is a very good pragmatic answer. You could downgrade the laptop and use that money for an api key
Blackdragon1400@reddit
Might go quick, I'm watching $1 per question to my openclaw stack after anthropics rug pull this week on their subscription service, and that's just the one single agent that uses a cloud model.
Elegant_Tech@reddit
Something to be said about token anxiety as well. You are free to do whatever you want as much as you want. Goes for all text, image and video generation. Also the capability of models that a 128gb unified system in the last 12 months has skyrocketed. By end of the year I can't imagine how capable local models will become.
wakIII@reddit
What about kwh anxiety
Albedo101@reddit
On a macbook pro that idles at 5W and comes with a 96W power-brick? That's less than 0,1kwh at constant max power. Your TV probably uses more power than that.
Antoniethebandit@reddit
We already have plenty of solar panels and batter on the way next month (with some government support)
Crawlerzero@reddit
This is why I set a personal challenge to never pay for any AI service for personal work. I spend all day in my company-provided Claude Code account, and then switch over to my stack of old hardware running 4B models for various tinkering tasks just to help mentally reset expectations about local.
wakIII@reddit
Yeah. I use it to experiment with token utilization because we have no limits at work and I spend billions of tokens some days.
crazylikeajellyfish@reddit
What use case gets you up to billions of tokens in a day? I've got jobs that burn like 20M over an few hours, albeit on a set of like 50 documents. Are you just consuming absolutely huge datasets?
Antoniethebandit@reddit
Or PC alternative with a double RTX3090 what Im using
carrotsquawk@reddit
thats even worse. OP has one unified 128gb. you have a split 2x24gb. OP definitely beats you there
Antoniethebandit@reddit
To be fair my build is around 3000 USD and power hungry but I have a nice fully water cooled fish tank case what is around 30 kg.
carrotsquawk@reddit
you totally do
Antoniethebandit@reddit
?
ubrtnk@reddit
I have a 6gpu rig with 2x 4090, 2x 3090, a 4080 and a 5060ti. Even after a 45k input, qwen 3.5 122b still above 50 tokens with full 256k input. Qwen Coder running full 256k runs above 100t/s even after 15k token output. Tho I'm not a dev, it works for my scrips and little side projects
Antoniethebandit@reddit
NVIDIA’s Ampere architecture and GDDR6X memory bandwidth crush Apple m5 in raw throughput. I’m using qwen 80B model what is using experts. I am satisfied with this build. Was looking for m4 / m5 but I’m still not convinced
carrotsquawk@reddit
‘course they do, sweetheart, course they do
piedamon@reddit
Minor clarification: the M5 Max’s 128GB unified memory is a real advantage for inference, but two 3090s are not irrelevant. They still tend to be stronger for training and many fine-tuning workloads because of higher raw CUDA/Tensor throughput and the CUDA software stack.
So high parallelization tasks like fine-tuning video models are actually faster on the 3090s.
carrotsquawk@reddit
doh, like, of course… nobody said 3090 is good for nothing.
the 3090 surely is better at some other task nobody asked for in this thread like „running on windows“ than the mac but that was not the question here.
The question is coding.
and for coding inference you need memory.
Pleasant-Shallot-707@reddit
It for sure definitely matters on what you’re doing.
Antoniethebandit@reddit
This is the way
Pleasant-Shallot-707@reddit
Which still isn’t doing what data center models are able to do
Antoniethebandit@reddit
Yes Captain Obvious
Pleasant-Shallot-707@reddit
Which has ZERO context in what you said, dipshit
TheOnlyBen2@reddit
How is your experience with dual 3090 ?
TheThoccnessMonster@reddit
It’s fine. Even with lots of ram it’s “not the same” as mega beast paid apis
Ihunk@reddit
i have a M3 Pro with 128GB and a try with local models not to obtain the same performance because you know the big AI companies spend a lot money training the models and has a mega infrastructure for this. simple not posible replicate the same with my machine was more how far and slow is it. is decent the code is 3x, 4x slower? maybe to others stuffs is ok but for coding with a good velocity is a no in my case.
pantalooniedoon@reddit
That’s unfortunate for me since I just bought one. Id look at it a different way - models have insane development trajectory. What doesnt work today might work in a years time. While you can’t match the sota coders, you can maybe match them in financial planning with the right workflow or going through your insurance documents with full privacy. These machines are basically investments at this point and M5 is an extremely capable chip.
Pleasant-Shallot-707@reddit
1-bit models are going to rip
kyr0x0@reddit
Try bonsai and look for bonsai garden 1 bit MLX inference server on GitHub
sgt102@reddit
Yeah give it 6mths to a year. When efforts like Turboquant propagate into the opensource models. There is a real drive to efficiency atm because demand is so high that it's possible to get occupancy for anything that can run a model.
FullOf_Bad_Ideas@reddit
Turboquant PRs in llama.cpp are almost dead (one hanging with low chance of making it). And vllm/sglang PRs don't look great either. It's just that Turboquant doesn't actually perform well quality wise to offset speed loss.
It's barely better than a standard q4_0 llama.cpp kv cache on master (with hadamard transform) - you get a boost from 3.5x (q4_0) to 3.95x (tq4) kv cache in terms of compression ratio at similar quality. While also it's much slower.
kyr0x0@reddit
And worse quality wise when QJL is used, as softmax() maxes small errors too. I did implement TurboQuant for KV and even weights (an adaption) and after many ablation studies I ended up choosing an APEX style quantization pattern using classic affine quantization. The paper is over hyped and/or some stuff is proprietary and not well described.
chisleu@reddit
You need to be running smaller models. Qwen 3 Coder Next is a great start. you want fp8 mlx version.
Zen-Ism99@reddit
New learner here. Would MLX models be better on the HW?
Pleasant-Shallot-707@reddit
Yes
Zen-Ism99@reddit
What could the OP do to take advantage of MLX?
Pleasant-Shallot-707@reddit
Run a model converted to MLX format on a platform that runs MLX formatted models
Zen-Ism99@reddit
Perhaps he can learn to convert GGUF models to MLX…
Thank you for your assistance…
Pleasant-Shallot-707@reddit
Or just download an MLX version from hugging face
Zen-Ism99@reddit
HF doesn’t have MLX versions of many models.
Hence learning how to convert them…
_hephaestus@reddit
What are you using to run the models? Caching works wonders when it comes to speed, oMLX is the best option I’ve tried for this and models like Qwen-Coder-Next do feel usable on my m3 ultra. I feel like 128GB should be able to handle that
carloselieser@reddit
Feel free to send it my way if you don't want it 😁
superSmitty9999@reddit
Dude before you buy a machine go on OpenRouter and try out the model that will fit on your RAM !! And expect it to be slower!
I1lII1l@reddit
I will go easy on you. You don’t need to buy hardware to test models. There are providers out there, like openrouter. Test the models thoroughly first before spending thousands of dollars.
Don’t thank me for the tip, just send me your MBP (dm for address).
meccaleccahimeccahi@reddit
If it starts fast, but then gets really slow it’s because you are running out of Vram so it switches to CPU
srigi@reddit
Don’t tell that you’re running local models using Ollama.
boutell@reddit
Is this really the issue though? Ollama's come a long way. Fussing over which runtime to put it in might just be noise at this point? I pivoted my local chatbot app from Ollama to llama.cpp and honestly I didn't notice much of a difference. Even if historically there has been.
SkyFeistyLlama8@reddit
Ollama is way behind when it comes to feature parity with llama.cpp. The Ollama developers integrate bits and pieces from llama.cpp but support for new models always lags behind.
ElementNumber6@reddit
They also take ages to merge fixes, even when they're open and waiting. It's just generally better to go somewhere that cares.
Awwtifishal@reddit
Ollama is slower, with confusing model names sometimes, lacking some features that I consider important, and a symptom that one has relied too much on LLMs for information: they love to recommend ollama for some reason.
Equivalent_Job_2257@reddit
The auto model on cursor is most probably kimi2.5 - 1T model - you can hardly beat it with less than 128 GB memory on Mac. I think your best bet is Qwen3.5 model family with Qwen Code, not cursor, if you really want to go local.
putrasherni@reddit
I second this, your best bet is using qwen 3 coder - cerebras variant
Equivalent_Job_2257@reddit
I meant qwen3.5 fot local. Also, qwen3 coder is non-reasoning and fails at anything complex, although fails fast
putrasherni@reddit
my qwen3-coder-next experience is far better than qwen3.5
can you share your set up so I can give it a shot ?
Equivalent_Job_2257@reddit
I use Qwen Code with Qwen3.5-27B:Q8_0 model on 2xRTX 3090. Glad to here that Qwen Coder works well for you. Maybe our projects are different enough so that mine requires Qwen3.5 (didn't go with Qwen Coder well for me)
Keep-Darwin-Going@reddit
Cursor auto is composer 2 right? And for certain more difficult task they route to opus or gpt
Equivalent_Job_2257@reddit
Not sure about routing to opus.
Keep-Darwin-Going@reddit
I am quite sure in the past they do but considering how expensive it is and they chilling with openai more I would not be surprised they will nudge it towards gpt unless it is a UI lol. Sadly open ai UI is so bad it will never be used for Ui based on auto routing.
Goldstein1997@reddit
Bro comparing 120B model to 1T model and blaming it on a laptop
BlankProcessor@reddit
Don’t be afraid to return it if it’s still in the window. I have purchased hardware setups that didn’t fit my use case, and have never regretted a return.
Heavy-Focus-1964@reddit
mail it to me and i will dispose of it appropriately
TechExpert2910@reddit
for free!
msitarzewski@reddit
You're getting a lot of heat here, but I think most people are missing the "I’m super new to this" line and judging your decision to go big out of the gate. Expectation setting is real though. Yes you can do all of the things locally - but you still (even with M5 Max) pay in terms of speed. Take a little bit to understand how context windows work and why they impact local models so heavily. Watch a few videos from Alex https://www.youtube.com/@AZisk to see what SotA look like on your hardware. This is early, early days in local inference - speed will always happen in the datacenter with frontier models, but local is becoming more and more capable. Just learn now, be patient, so when the time comes you'll understand the whole picture!
sparkandstatic@reddit
you should just return your mac if you max out your specs just to think you can beat those nvida gpus that are in data centers. dream on. do proper research next time.
createthiscom@reddit
Listen, 14 inches is a lot. Don’t let people talk down to you about it.
unlucky_fig_@reddit
Agreed, it’s not the size it’s how you use it. 14 is very adequate
Mundane-Mortgage-624@reddit
Ragazzi perchè la mia Fiat Panda va meno veloce di una Ferrari ?
Fun_Nebula_9682@reddit
the slowdown is almost certainly kv cache growing as your context gets longer. totally normal for local inference, not a hardware issue. you can try shorter conversations or clear context more often.
also heads up — cursor's "auto" mode is hitting cloud APIs (claude/gpt-4), not running locally. so you're comparing a quantized 70B against frontier models on a datacenter lol. for coding specifically, cloud models are still miles ahead of anything you can run local, that gap hasn't closed yet.
128gb mac is genuinely great for other stuff tho — embeddings, rag, local chat where you want privacy/offline. just wouldn't expect it to compete with cloud for code generation right now
obanite@reddit
Yeah, this was the answer I ended up on after doing some research recently, too. If you want to be able to do agentic coding, then you just need to wait -- no matter how high end your Apple hardware is, there just aren't any models you can run locally on it that will compare to frontier models. That gap is still quite wide.
My plan is to wait until at least the M5 iMac comes out, then re-evaluate. Things *are* moving fast.
Commercial_Sweet5486@reddit
are you sure you will even get your hands on an m5 imac? You do know shipments for the m3 ultra studio are like 5 months out for purchases today. That m5 studio will be sold out before you know it.
TheThoccnessMonster@reddit
Not THAT fast they’re not.
lippoper@reddit
Don’t know why the downvotes. Seriously, check again when the m6 drops
leetcode_knight@reddit
You don’t meed 128gb macbook to do this though. It is faster but not required. I do the same with less powerful macbook.
FusionCow@reddit
128gb vram is a lot, but anything local will be very slow. your best bet honestly is running something like gemma 4 31b at bf16 and a lot of context, or the qwen 3.5 122b. in theory you could pull off the 397b with a lot of quantization, but I don't know if I'd recommend that. regardless of what you do, api models will always be a bit ahead, the difference is you don't have to pay anything other than electricity for these models
asfbrz96@reddit
That's why I got a strix halo, half of the price, and I'm using it as a basic replacement of chatgpt on the web, but for coding, I'm keeping my 20 bucks subscription or code by hand
Theverybest92@reddit
Your mistake was paying so much to run a local model when you can use a better model for free online with 32 gb of ram. Just saying. Pretty dumb. How much did it cost?
robvert@reddit
You sound pretty dumb. Just saying. But I’m sure you already know it
Theverybest92@reddit
Okay kupo have fun with your unused 100 gb if ram.
idiotiesystemique@reddit
"my computer is bad because the 100gb models I run on it at low power perform worse than 1500gb models running in data centres"
MrKBC@reddit
… be thankful at least for the fact that you all have jobs with enough steady income to even be able to afford these devices. Coming from someone who’s been applying for two years and is stuck with an M3 Pro that has been through so much in the year and a half that I’ve owned it, I’d give anything to replace it. I highly doubt this one will last a decade and then some like my 2012 MBP has.
Euphoric_Emotion5397@reddit
coding just use the frontier model to save you the trouble.
Local models are useful for analysis and scraping websites and having your own memory system.
alexwh68@reddit
My M3 max has 96gb of ram, I run a local LLM + api subscriptions for frontier models, I use the local LLM for work I want to kick off over night etc. For quick stuff frontier models are hard to beat.
GCoderDCoder@reddit
Auto on cursor is either limited or costly if you use it heavily. As providers raise prices cursor will become less of a value like it has alredy been trending. Good quants of Qwen 3.5 122b and 27b get basic isht done for me. It's not just push button though.
I have spent months integrating local AI into my local lab. I have local services to provide tools to the models so they can safely work with internet, safely work with my email, safely work with my pm tools, etc. I built a second brain where on any given task it loads my lab info securely because it is isolated in my lab and can't reach out of my lab without going through the controlled channels I made for it.
Cloud models can do the same but these tools become equalizers where if you enable the model to do a task it just either gets done or it doesnt and local models can get things done so on defined scoped tasks I actually get better output locally than the cloud on many things.
Size to performance on my m5 max Qwen3.5 122b q6 mlx is my go to right now (>40t/s) GPT-OSS-120B is still really good for it's size (>65t/s) Technically qwen 3.5 27b and Gemma4 31b both beat these models in coding and essentially tie in intelligence but they are slow unless you have high bandwidth hardware like a 5090 so I dont love those models on apple sillicon.
The good thing about large Apple sillicon though is Gemma4 31b is scoring higher than minimax m2.5 and glm 4.7 in coding so a q8 of Gemma4 31b gets like 10-15t/s starting out for me in gguf. I havent gotten mlx working yet but I imagine it will be even faster. That's the best coding ability to size ratio of any self hostable model. Qwen 3.5 27b/122b are generally better at agentic tasks supposedly per benchmarks. I haven't had an issue with gemma 4 yet but I'd plan on 122b as an agent and gemma4 31b as a coder as long as I dont have to stare at the screen the whole time. Put it in roo code then start it and walk away to it's finished lol. That is my plan right now since the larger more quantized models lose coding ability with more aggressive quantization.
Understand your hardware strengths and limitations and play to those is my suggestion.
Human_Information561@reddit
Wow sounds like you know your stuff and gone into the deep. Would love to hear more/see an architecture diagram. Great job!!
GCoderDCoder@reddit
It's purely driven out of anxiety since the big tech I work for keeps laying off people. I'm trying to set up myself to be able to extend my value if I suddenly can't find a job. Instead of focusing on doing my job better I'm focusing on surviving without my job...
Im sure I'm not the only one responding like this to the culture
mp3m4k3r@reddit
On the t/s given is this generation speed or prompt processing speed? I feel like I see these often but also see people hit on both so its been difficult to separate them. For example I get 2k+t/s prompt processing with just 135t/s for generation on my qwen3.5-35-a3b, so not sure if people are keeping the higher of the two or not.
GCoderDCoder@reddit
I normally focus on token generation because I know models think and take time for prompt processing so I compartmentalize that lol. That said, I swapped my m4pro for an m5 max because prompt processing has gotten much better! I only see generation and ttft on lm studio so I focus on generation since ttft to me is missing context that pp gives me but you feel the difference. High context feels more conversational on my m5 max closer to what my cuda does.
Technical_Split_6315@reddit
Is just too early. You can’t really compete with SOTA models, you will pay more money for less performance.
Until we are able to run something like opus 4.6 locally you are just spending more money because you value it running in local at the expense of performance.
If you just want best performance/$ pay a subscription
sagiroth@reddit
Expects sota performance on local 128gb macbook smh
boutell@reddit
I hear you, but I think the current market situation is counterintuitive because it is so breathtakingly subsidized. It feels like you ought to be able to do at home for $2000 or $3,000 what you can do for $20 or $100 a month online because that's more than true for other things, like VPS servers in the cloud. It just brings home how ludicrous the GPU interference subsidy really is.
So I'm sympathetic to people who find it hard to believe their 128 GB M5 can't get it done. In any other context, that is a baller machine.
Even without a subsidy though, there is a logic to not owning this stuff personally. Since many of us don't leave high-end AI models grinding literally 24/7, the cloud servers are not quite as ludicrously subsidized as they seem. It is the classic time sharing business model.
Still doesn't add up though!
Infninfn@reddit
It doesn’t add up because people don’t realise the amount of compute/GPU resources that goes into hosting llms, competent or SOTA for the general public.
People also don’t seem to understand that large amounts of VRAM isn’t the only requirement for llms, it’s GPU capability, the other thing that separates an RTX 5090 from an RTX 5060 and the GPU in a MacBook Pro.
High unified memory Macs and Strix Halos were never meant to be fast at PP and TG. They’re just capable of running larger open source models for dev purposes.
boutell@reddit
All of that is true. It's also true that the prices are made up, for now, and the real price would be prohibitive for most users.
Practical-Collar3063@reddit
I don’t know what you are using for an inference but if you are new you might be using Ollama. Please don’t.
On a mac you should be using MLX models, the easiest way would be to download LM Studio and pick MLX models instead of GGUF.
LikeSaw@reddit
I know I will get hated for this but you spent probably $5000+ on a 14 inch M5 Max with 128GB and didn't research beforehand what it can and can't do?
For reference, your M5 Max has 614 GB/s memory bandwidth. A used RTX 3090 has 936 GB/s. An RTX 5090 has 1,792 GB/s. You paid multiple times more for less throughput. The Mac's advantage is loading models that don't fit on a GPU, but those run just really slow. That's not a coding workflow, that's a waiting simulator.
getmevodka@reddit
The m5 line is very capable since it has ai capability directly baked into the gpu corrs instead of the older m generations, but you are right regarding bandwidth. My m3 ultra has 819GB/s and best performing big model is the q4 k xl of the qwen235b as a a22b moe. So id suggest not running more than a 32b model locally even on an m5 max. I think the qwen 3.5 27b model either as fulk or q8 k xl is a best option there, but that would only need 64gb xD
peligroso@reddit
This means literally nothing.
getmevodka@reddit
No it means that the throughput of context is several times faster than before since it can be computed directly on the gpu cores... So it works 4x faster to first token generation start.
It may still hold other capabilities i dont know about. But it definetly doesnt mean that it means nothing.
peligroso@reddit
How in any way is this baked-in for AI? You're just describing microcontroller architecture.
People made and designed this stuff years ago. Nobody was like "hey guys! like, ai. right?"
getmevodka@reddit
The Apple M5 chip is significantly better for AI tasks than the M4, featuring dedicated neural accelerators in every GPU core for up to peak GPU compute performance for AI compared to the M4. The M5 offers roughly 3.6x faster LLM token generation, faster AI image generation, and significantly faster local AI inference, making it a superior choice for on-device AI.
All i said is that it is more capable than its prededecessors. Not that it is a good solution or even an intelligent feat to sink 5k into 128gb mbp. I really dont get where you guys read that in my answers.
peligroso@reddit
LikeSaw@reddit
It doesn't matter how fancy you wanna describe the "ai capability". Even a 3090 is \~2x faster in PP than the M5 Chip. For local coding its really the worst decision ever to invest so much money.
getmevodka@reddit
Its not like i denied that, all i said is that the m5 is more capable than its predecessors and i didnt even disagree on that it is not a good idea to have 5-6k invested in a mbp ?!? Really dont get where you guys got that impression from haha
emreloperr@reddit
After about 8k - 16k context size, TPS will decrease significantly. Any coding agent will fill that pretty quickly. Nothing on your Mac will match those remote models served in data centers.
It is related to memory bandwidth and there is not much to do atm. Less data you move aroud the faster it will be. MoE architecture, quantization helps with that. TurboQuant is hyped bc of that. Less KV size without accuracy lose? Big.
If you need speed, choose a MoE model like Qwen3.5, keep the context size low, prefer CLIs instead of MCP.
Definitely use recommended settings from Unsloth but use MLX instead of gguf.
Pleasant-Shallot-707@reddit
An MCP like Serena helps too as it reduces the token requirements of agentic coding
if420sixtynined420@reddit
Tools perform best when you know what you’re doing
MachineZer0@reddit
Working with a NeoCloud to mimic Cursor’s Composer2, which is a RL fine-tuned Kimi K2.5. Auto model most likely uses Composer the majority of the time if Fireworks has the compute cycles. The model has 1TA30B parameters and their hosting partners probably have custom enterprise inference servers which is more optimized than open source llama.cpp, vLLM, SGLang etc. The most cost efficient setup to mimic Composer2 was Quad Nvidia B300 and MiniMax M2.5 (230BA10B params) on vLLM. They said a team can expect about 3B tokens daily with that setup.
Nvidia B300 is capable of 4.1tb/s bandwidth with 288gb VRAM and 144 pflops of fp4 on sparse models.
Sacrifices are already being made with about $300k in hardware (Less params, no RL, OSS inference). Although a very good setup, it will not match Cursor’s offerings. Your setup is <$6k for 128gb unified memory at 614gb bandwidth. You’d need to make even greater sacrifices with weights having less parameters, quantized and may be running inference with stock settings. You’ll need tensor parallel to get more consistent prefill and decode speed on higher context.
According to this post Qwen 3.5 35b-a3b on Metal seems to be your best bet on M5 Max https://www.reddit.com/r/LocalLLaMA/s/tDBvDxlMVM
Don’t expect Cursor Auto/Composer2 level performance, but should be totally usable.
Pleasant-Shallot-707@reddit
You are probably running a poorly optimized setup.
Negative_Dark_7008@reddit
Qwen is not great I found... I'm a deepseek or Kimi guy but I would look on hugging facebfor a model that fits your use case.
putrasherni@reddit
Both PP and TTFT are far worse as context grows on MacBooks.
If you can still return it , I would recommend you to do it
narrowbuys@reddit
Yikes and a laptop. I recall seeing a chart of how long a task would take to run. Mac Studio was an order of magnitude faster. Laptops just can’t disperse the heat from their own chips anymore.
Frontier models are much better. I use them to tune scripts and breakdown repeatable tasks that can then be run against my local llm. For general purpose reasoning, local llm can’t do much
Zeeplankton@reddit
a mbp running on like 50 watts isn't going to even be in the same world as an h100 running a frontier model in a server farm
Pogsquog@reddit
For decent speed on a Mac, you need to stick to mixture of experts models with a relatively low number of active parameters - I.e. qwen 3.5 35b - a3b 8 bit quantised. You can try larger models, but it will be painfully slow when the context is large. Qwen isn't optimised in all agentic harnesses, roocode seems to be ok, apparently it works ok in Claude code or the qwen cli of course, maybe zed also. Gpt-oss-120b should also give decent performance. Be sure to update the max context window size, newbie error is to leave it at the default of like 32k.
g_rich@reddit
The Qwen3.5 and Qwen3 Coder Next are going to be your best models to run locally but even with an impressive 128GB of RAM you’re not going to be able to match the larger models available with cloud providers which can easily require 800GB to 1TB+ of RAM to run. These are the types of models people are running by clustering multiple M3 Ultra Mac Studio’s together via RDMA over Thunderbolt 5.
WeUsedToBeACountry@reddit
I use local models for document classification, OCR extraction, and personal agents.
Coding is a higher-order task that requires more juice than 128 GB.
Think of a model as an employee. Your 128gb can afford to hire lots and lots of employees, but probably not a world class engineer.
Try Qwen3.6 though and a harness other than Cursor like OpenCode. Or the new gemma.
Shot-Buffalo-2603@reddit
Yup this model is awesome on my mac! I only have 32GB or ram so struggle with context size when running it but this would be my goto on 128GB. With a local setup for coding you need to be a lot more methodical about your uses, you can’t just tell it to complete your whole project and make no mistakes like claude. Still a huge productivity boost compared to manual coding
TrustIsAVuln@reddit
I had an M4 pro max 64gb RAM and everything i threw at it ran awesome. Maybe its some configuration?
robberviet@reddit
You got the catch. Non local model is that good. Theoretically Kimi K2.5 or GLM is local model but it's impossible. Minimax is the smallest possible one. And don't start on speed, speed is terrible.
leetcode_knight@reddit
There will be always better commerical hardware and model that requires commercial hardware. Instead of doing this, try running open source models on runpod and connect them to opencode. Still you need some improvement but this is the best choice at this moment instead of very expensive hardware. And it is very easy to setup/stop/run.
minmaxhero@reddit
IRC Anything over 48GB unified memory doesn't pick up more CUs, the RAM doesn't become faster, thermal limits do not change etc.
The main benefit is that you can fit more in memory: you CAN run larger models, bigger graphics scenes.
s101c@reddit
Try Minimax M2.5 or M2.1. This is the max size model that will fit into your Mac.
tmvr@reddit
If your expectation was to have frontier online model performance at home (especially with 128GB RAM) then it's a misunderstanding you've had for some reason. Not from here though, every time some comes here and asks "what model to tun to have Claude at home" they are clearly told it is not going to happen.
Now saying that, the models you can run at home with your M5 128GB are pretty capable. Use the recommended settings for the model (you can usually find them in the unsloth blogs) and you should get acceptable results from something like Qwen3.5 122B, Qwen3.5 27B or Qwen3 Coder Next for example.
Keep-Darwin-Going@reddit
Local right now is just not anyway near sota level, the version you run is probably a quantized version. To be honest I tried a lot of quantized version back in the early days most of them drop quality significantly for coding and translation task. For regular chatting might be ok
Equivalent-Win-1294@reddit
I fell into this dream before as well when I got my m3max 128gb. It just wouldn’t compare to claude code. this time, i’ll wait out the model progression first before committing again to hardware. models and frameworks for tool calling need to get better first.
bajaenergy@reddit
sorry but I think a 128GB RAM won’t replace top-tier commercial tools. You’re confusing memory size with model intelligence. Even the best M5 Max can’t match the best cloud models on complex coding tasks. Those models have better training data and fine-tuning. 128GB is great for local dev and mid-size models, but beating top commercial tools on your hardware requires a much beefier setup (Mac Studio with 192GB+ or a dedicated server).
MidAirRunner@reddit
Which qwens and glms have you downloaded? Qwen3.5 122b is pretty good for me.