M5 Max 128GB Owners - What's your honest take?
Posted by _derpiii_@reddit | LocalLLaMA | View on Reddit | 141 comments
What models are you running and favoring?
Any honest disappointments or surprises?
I'm very tempted to pick one up, but I think my expectations are going to be a bit naive.
And yes I understand local models cannot compete with frontier model with trillions of parameters.
So I'm wondering what use cases are you 100% happy you got the M5 Max 128GB?
Something something pineapple pancakes to prove this is not AI writing.
Varmez@reddit
I bought one, it hasn't arrived yet. I figured with how much belt tightening the online models are doing, in conjunction with the expectation itl'l last me\~5 years, that the extra between 64-128gb is "cheap insurance" .
My hope is that I can effectively get by on a $20 plan or two, use something liek Codex as the "planner" in something like Cline, then have a local model, likely qwen or gemma do the actual implementation. I've been trying this on my M1 Max in a round about way by using some of the offline available models on openrouter to get a feel for it and it works pretty well for my uses.
shuwatto@reddit
Could you care for sharing your usecases and workflows?
I've tried the same thing and miserably failed. :/
Varmez@reddit
My use is probably considered pretty basic, just crafting n8n workflows that interact with shopify and zoho apis, do a fair bit of data parsing from scraping vendor sites / controling browserless. Along with a bit of liquid code related stuff on shopify themes / templates.
I have docs / scafolding that i've refined continuously overtime that probably help a fair bit though
cobquecura@reddit
I have one and I have found that while it is not incredibly fast, with Qwen3-Coder-Next in conjunction with OpenCode and OpenSpec I am able to consistently get features added with only occasional intervention. Something around 500 t/s prompt processing and 50 t/s generation up to 200k~ context.
I also make heavy use of kubernetes locally and having a ton of memory is a huge help for that too.
SkyFeistyLlama8@reddit
At 200k context and PP 500 t/s you're waiting 7 minutes for the first token. I hope there's a way of caching that huge context.
MiaBchDave@reddit
No, you don’t wait 7 mins. Unless you’re deleting KV every time (or using LM Studio instead of oMLX).
SkyFeistyLlama8@reddit
You are waiting 7 mins if you kill the inference process to save RAM for another program or if you switch models without using a model router.
Sometimes I deal with large documents that are stuffed completely into context. There could be different documents used for each chat run so it's a fresh context load at 100k for each new chat.
MiaBchDave@reddit
oMLX has hot/cold KV cache on SSD. Is that what you’re looking for? Check their settings. LM Studio does not handle cache well at all through any harness using MLX (or batching for that matter).
SkyFeistyLlama8@reddit
I'm not on Apple, I use llama-server from llama.cpp as my main inference runner.
MiaBchDave@reddit
Ahhhh… it was an Apple topic so I assumed. I think there’s swap disk cache or something through vLLM possibly, but I haven’t experience there.
_derpiii_@reddit (OP)
ohhh, I never made that connection. You could just straight math it 🤣
tbf, that's the entire context window.
I guess realistically, for light coding (prototyping, no massive codebase loading), if you restart session at 20% that's around 90 seconds.
And that's just token processing/intake. Inference/thinking is another step right?
SkyFeistyLlama8@reddit
To be fair, I rarely use that much fresh context. You gotta calculate TTFT (time to first token) to see how responsive a rig would be, depending on model and context size.
I don't have an M Apple beast machine so I'm getting pathetic numbers like PP 150 t/s at smaller contexts like 20k tokens. At that speed and context size, I'm waiting about 2 minutes for prompt processing to finish.
Inference can be with reasoning or non-thinking. With reasoning, I could be waiting another minute before the final output tokens appear.
_derpiii_@reddit (OP)
Hmm. Are you implying there's prompt caching optimizations?
No_Afternoon_4260@reddit
Ofc there is
skilesare@reddit
Ok..fine ..I'll ask...link? I run llama.cpp and could certainly benefit from caching as I run a ton of agentic stuff with different personalities and emphasis.
SkyFeistyLlama8@reddit
Based on my own usage of llama-server, it caches prompts in memory up to a certain size. I think it looks at the first few thousand tokens to find a cache match, then it runs prompt processing only on the difference between that cache hit and the new prompt.
It works if you run a long chat session against a fixed document corpus or you run coding agents against a fixed code base. If your context is always changing, prompt caching doesn't work.
I wish there was a way to save a huge prompt cache to disk and then reload it using the same llama-server web interface as regular chat history.
DifficultyFit1895@reddit
There are ways to cache context. LM Studio has it for most GGUF models and oMLX has it for MLX.
somerussianbear@reddit
I’m getting good cache on oMLX but can’t see the GGUF work on LMS. Mind pointing to how to set up/how to get it to work? Or just which HF models you’re using that works for you. oMLX makes it super clear to see the cache and the impact while LMS and llama.cpp have no visuals (that I’m aware) so hard to see it.
SkyFeistyLlama8@reddit
Caching works on llama.cpp or llama-server but only if the coding harness or chat UI doesn't add a prefix to chat history.
I don't have much use for long context caching because I'm switching models, agents and conversations a lot. Killing a llama-server process kills any cached prompts. Can you save and reload processed context from disk?
somerussianbear@reddit
oMLX’s hot/cold cache would make that almost instant
jonydevidson@reddit
Are you hosting in oMLX?
cobquecura@reddit
I tried, but couldn’t get oMLX working as well as LM Studio. I ran into looping issues, and I can add a system prompt in LM Studio that I can’t seem to do in oMLX. I’d like to try again with oMLX in the near future because it seems the processing speed would improve even more.
MartiniCommander@reddit
I'm using oMLX without a hitch an I'm pretty illiterate. Start from scratch.
Narrow-Belt-5030@reddit
That still feels surprisingly good though - 200K context (claude code) + 50t/s is fast enough to work with (imho anything below 10 is a nightmare, and above 20 is ok)
rm-rf-rm@reddit
Honestly, for most tasks that most devs have, its more amenable to just get the agent running and only revisit after 1-2hrs - the dev can/should move on to other work. If its done in minutes, you are in that no mans land where you cant switch to doing something else as by the time you start with that, the agent is awaiting input. And even if you are getting somethnig else done (could even be other agents), the constant context switching is unproductive and taxing
AlwaysLateToThaParty@reddit
Yeah, that's pretty much where I'm at. On an edge device, it doesn't matter if it's 5-10, as they generally aren't time-critical. But if you're doing back and forth inference, whether it's coding or RP, it's just not usable below that.
Gipetto@reddit
This actually sounds great for me. I'm heavy handed with my monitoring of AI coding, so I have it go in steps and I monitor the diffs. I think I'm gonna be just fine when mine finally arrives.
_derpiii_@reddit (OP)
I know this isn't a fair comparison but, how would you rate it against Claude Code w/ Opus? I'm not expecting parity of course :)
I hear of hybrid workflows of using Opus to plan, and local to implement.
Broad_Stuff_943@reddit
Opus to plan and then local or a cheap model to implement is exactly what I do. It works well. The key is to have Opus outline an implementation plan so the "lesser" model has very little thinking to do.
_derpiii_@reddit (OP)
That sounds like the best of all worlds :)
Are there any clever prompting tricks you give to make Opus aware it's handing off to a local agent model? Or does it just figure it out from the config?
Broad_Stuff_943@reddit
I tell it to create a plan and write it as though a junior is going to pick it up. It adds a lot more information that way.
Whenever I've told Opus that I'm planning to hand off to a different model the output isn't as good. By telling it that a junior is picking up the work, it seems to add a lot more context and examples.
_derpiii_@reddit (OP)
That's a great idea. Thank you!
Broad_Stuff_943@reddit
You're welcome!
Last_Mastod0n@reddit
This is the way
Randomdotmath@reddit
Yeah, this is classic Cline-style workflow from the early days. Those tools were built around the idea that even if the LLM isn’t that smart overall, if you break the task down into tiny, well-defined pieces, it can perform surprisingly well.
The meta hasn’t really changed — except now your local Qwen is probably stronger than GPT-4o. So using frontiers for planning and a strong local model for implementation is actually excellent.
Last_Mastod0n@reddit
Its okay if you give it explicit instructions. Claude code with Opus is going to fill in all the gaps and design the implementation MUCH better than any open model at the moment.
_derpiii_@reddit (OP)
How does Kubernetes fit in with local LLM workflows? Is each pod running its own LLM?
cobquecura@reddit
I just mean that it’s another valuable use of the substantial RAM that goes beyond LLMs, not that I use kubernetes with LLMs (in containers or something).
_derpiii_@reddit (OP)
ah gotcha! I was envisioning some sort of elastic LLM 🤣. Kubernetes is so cool.
wouldntyaliktono@reddit
I stepped up to M5 128gb from M1 64gb and it's a night and day difference, mostly because of the prompt processing speed. It's made local Claude Code a realistic option for offline development. Qwen Coder Next 70b with the 8-bit quant has been my go-to, but I've also had some success running the 4-bit quant plus a smaller secondary model for sub-agent tasks. Here's a quick comparison I just did of my new machine vs. the M1 I was using previously: https://www.youtube.com/watch?v=k8YCLZ-OAuk
_derpiii_@reddit (OP)
Ok wow, that's QUITE the difference.
What's your main use case?
Taking note your favorite model is: Qwen Coder Next 70b 8-bit
shansoft@reddit
Very happy about it. I am able to run Qwen 3.5 122B with 200k+ context while maintaining 40-60tok/s and 500-3000 ppt/s. Upgraded from M3 Max and the speed difference is wild. Local model is now on Sonnet level and it can easily do what SOTA cloud model can do these days, as long as you prompt / plan it right. It also fixed some secret keys problem I was having when Opus 4.6 / GPT 5.4 both just put a hack patch instead of fixing the root cause.
_derpiii_@reddit (OP)
Woah, sounds like your configuration is setup well! I'm seeing so many varying experiences. Yours seems the most positive so f ar.
eaz135@reddit
I have a max 64, and a pc with a 5090 (with 192gb ram). Find my hands automatically wanting to work with the PC. I run local qwopus 27b v3, and getting very good results with it.
I treat the Mac as more of a beastly machine for working with cloud inference (cursor, codex, cc, etc). Good specs to be running many agents simultaneously on ghostty, building multiple things at once, etc.
I get more done with the Mac in the setup above, I treat the PC local AI setup mostly as entertainment / hobby. Don’t get my wrong I’m very productive on it and get a lot done, and it’s very fun at the same time running it locally - but it’s not the same as having 10 terminals open simultaneously with Opus / Codex cranking away in each of them.
_derpiii_@reddit (OP)
Appreciate the insight into your workflow. I'm new and still plebbing out with iterm2 🤣
I keep on seeing Ghostty mentioned. Could you go more into your setup/tooling (ghostty with tmux?)?
Also surprised you're using all 3 (cursor, codex, CC). Are they collaborating together or are they individual projects?
eaz135@reddit
I'm subscribed and I have access to pretty much everything, deciding which tools I want to use today on certain projects is kind of like deciding what clothes I want to wear on the day.
I've been working in software (big tech, investment banks, scale-ups, etc) since 2010 so I have a good sense of tech. My style with AI is generally giving very direct, bite-sized tasks for execution. With this approach a lot of the tools are good enough - because I'm driving a lot of the direction myself, so it almost doesn't really matter what I pick, the end outcome is going to be very similar. I just like to get my hands dirty with all the tools, I find it fun
zorgis@reddit
Interested too.
I still can't decide if the max 128gb is worth it over the pro 64gb.
JuliaMakesIt@reddit
I have both (Mac mini M4 Pro 64GB) and MacBook Pro M5 Max 128GB).
It’s night and day both in speed and the scale of model / context length I can run. I don’t regret paying the extra for the Max with more RAM. It was 100% worth it.
zorgis@reddit
What do you run with the 128gb?
What do you use the model for? Coding, local agent, chat?
JuliaMakesIt@reddit
I don’t do much chat or writing with AI.
I automate a lot of workflows, mostly with smaller local models (20-35B with large but managed context)
I use the Qwen-Agent Python module for a lot of things as it has great tool calling, RAG, code execution all built-in. I have built maybe 6 of these agents since Qwen3 came out. Think of them as mini OpenClaw’s but built for specific missions.
They can use any model, but I’ve been mostly running the Qwen3.5 series. I use 4B as a super fast automation support model, then based on the task upscale to Qwen3.5 27B, 35B-A3B or the 122B-A10B model.
It’s really all about the tooling. A smaller model with a good RAG document set and tools will do a lot more than a big model without.
The other big models I sometimes use are: NVIDIA-Nemotron-3-Super-120B-A12B and GPT-OSS-120B. The Gemma4 models are also quite nice.
The M5 Max will reduce your time to first token quite a bit.
The 128GB will let you have some things running in Docker, and a few very solid models loaded. You can also jettison the smaller models and load in a Q4_K_M 120B model if you need.
I use the Colima engine to run Docker with full METAL MPS access.
I use either MLX or Llama.cpp in router mode to dynamically pick and load models on the fly.
Yup, wall of text. Sorry just wanted to share my setup. It’s been super useful for me. Also, I’m human and like pineapple on pizza.
rm-rf-rm@reddit
excellent write up! not a wall of text. This is giving me the idea of doing a megathread for folk to share their setups from hardware through applications.
Never heard of it before actually. And as a Openclaw skeptic, it looks interesting. What are you using it for and how much effort does it take to setup for reliable use?
txdv@reddit
800$ for 64gb of ram sounds like a good deal nowadays
_derpiii_@reddit (OP)
jesus. checking my email, you know how much I sold my G.SKILL TridentZ RGB 64GB?
$50 + $10 shipping 🥲
Jan 4, 2025. Just a year ago. wtf
txdv@reddit
time to let it go, i looked at btc at 100$ and decided they hit the roof
rm-rf-rm@reddit
Still not too late to buy. Its either going to be bigger than gold's market cap (so something like a 30x multiplier left) or will tend to 0 as time goes to infinity. So its only a question of your worldview and risk apetite
thelebaron@reddit
Just imagine some guy is posting somewhere about the score he got in light of today’s prices and how happy he is though
Consumerbot37427@reddit
Right? It shows how out of hand things have gotten. The "Apple Tax" for RAM or disk upgrades has always been signficant.
_derpiii_@reddit (OP)
Same! I'm playing around with the m2 ultra 64GB and quickly finding how disappointing it is. And I don't see how having double the memory would help much.
So if I'm going to continue local inference with Apple, might as well wait for M5 512GB.
SkyFeistyLlama8@reddit
Prompt processing is much improved on the M5 Max. Anything before that can run large models slowly and with glacial prompt processing, which makes agentic coding or long workflows excruciatingly slow.
_derpiii_@reddit (OP)
Could you share some of your workflows you're using local inference for? I'm trying to get an idea of what's pleasantly feasible.
xXy4bb4d4bb4d00Xx@reddit
it’s great
somerussianbear@reddit
Solid argument
xXy4bb4d4bb4d00Xx@reddit
thanks
abnormal_human@reddit
If you want a mid-competent chat assistant with predictable latency and privacy they're great, but it's no RTX 6000 when it comes to running large models quickly with long context and it's too compute poor for significant parallel work.
Consumerbot37427@reddit
To elaborate on that point: my observation is that parallel sub-agents basically freeze entirely whenever there is any prompt processing to be done. I can only assume that a machine with multiple graphics cards wouldn't behave this way.
_derpiii_@reddit (OP)
Are the subagents sharing the same model or does each have its own LLM? If they're sharing the same model, I can understand it pausing.
But if they each have their own LLM, I wonder what the bottleneck is (not memory, nor bandwidth - maybe some queue switching issue?).
Consumerbot37427@reddit
I was using the "parallel slots" feature in LM Studio.
somerussianbear@reddit
Same. Noticed that we definitely have to disable background agents.
victor_lowther@reddit
It is good stuff. Opencode + oMLX (0.3.4) + unsloth-Qwen3-Coder-Next-mlx-8bit is a local sweet spot -- I average around 50tok/s generation, and oMLX's prompt cache makes prompt processing a total non-issue especially compared to lm studio. Currently experimenting with pi + oh-pi, but the ant colony agent style is driving the system into swap -- currently getting 1k tok/s prompt, 20 tok/s gen. Haven't experimented with turboquant yet -- it and Gemma are next on the list once oMLX support stabilizes.
_derpiii_@reddit (OP)
Awesome. Saving this for later, thank you
synn89@reddit
Depends on what you want to do with it. I have a M1 Ultra 128GB and it's been wonderful for chat models. It's low enough power I can just leave it on, all the time, and 128GB of RAM is a lot of breathing room for 120B and down models. Even though right now I'm running a Drummer Skyfall-31B, which doesn't need all the RAM, it's nice to have when I want to run a 120/122B and I can squeeze in a 235B if I really want to.
It's quiet, sips power and is very flexible.
_derpiii_@reddit (OP)
I'm actually setting up a RAG on M2 Ultra 64 for a friend. What do you like to use for chat models (chat is generation right?)?
habachilles@reddit
Get the ultra if you can
jkcoxson@reddit
I’ve personally settled on qwen3.5-122b. I get roughly 40 t/s using oMLX, which is faster than any other program I tried. I use OpenCode, and generally leave it while doing other things like sleeping or socializing. I give it clear specs of what I want done and how I think it should be done, as well as a way for it to test its own work. Usually it’ll iterate for a few hours and eventually get done the list of tasks I have for it.
Eventually I want to write my own harness, since I feel like OpenCode is too loose. I know how to write code, I know how I want things implemented, I know how to test for success, so I need more structure for the LLM.
Basically it’s not super fast, but can get things done in time that is otherwise occupied by my life.
JuliaMakesIt@reddit
QwenML has a Python Module for doing agentic workflow that supports tools, rag, and tool use.
It’s model agnostic so you don’t have to use Qwen models. I use it with Llama.cpp or MLX in router mode loading models as needed.
If you’re going to write your own harness for agentic workflows and are comfortable with writing Python, it handles all the boring stuff for you. They have a lot of examples in their GitHub.
https://github.com/QwenLM/Qwen-Agent
(Disclosure: I’m not a Qwen employee and have nothing to do with this code. I just use it in projects.)
PinkySwearNotABot@reddit
i used to use MLX variants for the speed boost, but then i started reading about them not performing as well as their GGUF counterparts. so now I'm strictly going for high quality GGSUF quants. what's your experience with MLX?
jkcoxson@reddit
GGUF is abysmally slow on Apple Silicon, roughly half the performance for me. I haven’t heard anything saying that MLX is lower quality, just another format to store parameters. If you have any source, I’d be interested in reading it.
jkcoxson@reddit
Because others have mentioned smaller models, I find 27b and 31b too stupid to write Rust and C well. They loop over work they’ve already tried even with a 250k context window.
PrinceOfLeon@reddit
I have a M3 Max 128 GB and have been happy since picking it up the week it came out.
I keep Qwen3-Coder-Next @ Q8 w/256k context running at all times, with Qwen2.5-VL-7B-Instruct (for occasional vision) alongside.
There is enough memory left over that I have felt no impact with dozens of browser windows, my IDE, Docker, email client, and so on. As in looking right now there is still 8 GB of memory free.
I still use Claude Code for primary development work, but with a custom hooks-based AI monitor leveraging the local LLMs (via llama.cpp server) to watch what Claude is doing and analyzing risky tool calls and other operations (reading is green, writing or network transfers are orange, and delete is red), as well as evaluating "drift" if it looks like CC is doing things which are not aligned with user prompts or CLAUDE.md instructions. This results in periodic, brief bursts of GPU usage which don't have any perceivable effect on my workflow. I'm not actively waiting for replies and performance-wise I wouldn't know the LLMs were kicking in if I didn't have a CPU/GPU/RAM/Network monitor going in the taskbar.
I've tried pointing CC at Qwen3-Coder-Next for development, and it can get the job done, but I've actually had better results using Mistral Vibe (currently) or OpenCode (previously) as the harness. "Better" as in I get responses back quicker and sometimes CC seems to just get "lost" an will still be appear to be processing files an hour later but with no clear end in sight. I only tend to go entirely local more for routine sysadmin tasks or editing things like Home Assistant and Frigate configurations (things I don't want to leave the private network).
In short, having that level of headroom for memory means not only being able to run "large-ish" models locally, but being able to run useful models while still using the system to get actual work done, without compromises.
JonSwift2023@reddit
How's it compare directly to the M4 Max 128GB? Anybody do the upgrade and have real numbers?
Particular-Pumpkin42@reddit
I bought this one as a successor to my MacBook M1 64GB and don't regret. I run Qwen3.5 122B 6bit with full context as my daily driver and it's a huge help for my professional work as a software developer. The M1's prompt processing was becoming a stopper for the more recent capable models. However note: money-wise it didn't hurt me much. And I started with local interference with my Macbook and never really Had hands-on experience with large models in a GPU Cluster. And I do not use Cloud interference at all for professional work due to privacy concerns. That's why I never felt that the M5 Max is not intelligent or fast enough as I am not comparing :) Only you can decide what's right for you, but to me, it feels magical having a LLM model strong enough for professionell work running locally in an all-in-one portable machine with Keyboard, Touchpad and Display :)
PinkySwearNotABot@reddit
how much ram does that model take up on your 128GB?
_derpiii_@reddit (OP)
That's awesome! How's the thermals? I'm sold as long as it has no thermal limitations over the upcoming M5 Ultra.
Taking note, thank you.
lolwutdo@reddit
how fast is your PP with 122b?
xraybies@reddit
Apart from macOS being a cluster of interlinked junk consuming >5GB on load and being hard to debloat, the hardware is not bad. I have an M5 Max 128GB; if I let the agents do their thing, the fans kick on in 10s and you can watch the battery go down 1% every 20s with any model above an active 3B. MLX is pretty good, but realistically you only use 118GB (54GB on the 64GB) for models, so you still cannot run \~120B Q8, at best Q6.https://omlx.ai/benchmarks will give you a good idea of what you can run. I ordered both the 64GB and 128GB versions and, apart from the SSDs (SanDisk vs. Toshiba), they performed identically. The keyboard also felt very slightly different, just a tad firmer on the 64GB.
Think of it as an RTX 5060 with 110GB VRAM + i7 12700k.
Image gen is \~1/3 the t/s (DrawThings) vs an RTX 4090 (ComfyUI)
As for workloads: heretic Qwen 122B, Nemotron 120B and GPT OSS 120GB q6 or mxfp4.
Overall 6.5/10 if it weren't for macOS be such a bloated PoS it would be 8/10.
PinkySwearNotABot@reddit
i started with MLX and now are moving more towards high quality GGUF quants. MLX are a bit faster, but I hear they can be "dumber" -- which makes sense. your experience?
_derpiii_@reddit (OP)
Have you run into any thermal throttling issues? Was curious if thermals would be an issue (vs let's say a mac studio).
Oh wow, that's 10GB taken up. I was expecting 5GB max. That's Really good to know.
I am very surprised. I would think 2x RAM would yield nonlinear gain, but sounds like... exponential decay in diminishing returns?
What about people that connect like 4x512GB for 2TB of ram? Are they seeing any benefit at all, or is just for the memes?
Southern_Sun_2106@reddit
Loving it. Had M3 128GB before. PP is the real deal with this generation. My fav models are GLM 4.5 Air and the new dense 27b Qwen. This M5 makes running those two and smaller omnicoder 8b instances all together (as little agents) very nice. I recommend taking it for a drive for 14 days as apple allows, if you have such an opportunity, and decide for yourself, trying it for your uses. PLUS it is an awesome laptop, with a magnificent screen and super-nice sound, think and slick.
Pleasant-Shallot-707@reddit
This makes me happy to hear as someone who’s picking up their new m5 max 128gb soon.
drewbiez@reddit
Running Gemma4 moe, it flies, does well for my use case and I’m done paying for ai plans for a while :-)
_derpiii_@reddit (OP)
What's your use cases and workflows like?
drewbiez@reddit
I'm also experimenting with the lower end models too, but I figured, my laptop has the ram, why not use the big baddie.
drewbiez@reddit
Light automation and just general use really... Web scrapers with n8n, some automation with n8n, some data ETL workflows for prepping data for clients, general questions, light scripting/coding, nothing too wild.
monjodav@reddit
Ok ish but honestly not that fast you’ll need gpus to achieve anything opus-ready at more than 40t/s
Been using qwen 122b and it’s incredibly slow but makes the job
Lets see which models are coming next
_derpiii_@reddit (OP)
What's your use case and applications?
And even if it's slow, How's the quality of the output?
If it's 1/10th the speed of opus with 90% the results, I would just toss it in a ralph loop and sleep in a toasty room.
New_Public_2828@reddit
No no. Hold up 3 fingers in front of your face. Only way to know for sure this isn't AI.
Commenting because I'm also curious
Hey_Gonzo@reddit
I almost died reading this. That was the dumbest interviewee.
_derpiii_@reddit (OP)
it's an interview? I thought it was from the Indian filter scammer?
_derpiii_@reddit (OP)
Instructions unclear, face stuck in 3 dicks
lolexecs@reddit
3 dicks? Shouldn’t this be base 2 - like four dicks - you know going tip to tip?
-Crash_Override-@reddit
Ok grok
StandardKey7566@reddit
Common problem, wait it out, it'll either get better or a lot worse!
MiaBchDave@reddit
The M5 Max is the first system that can sorta work locally with large enough code bases. I currently am trying a few things. Qwen3.5 122B is “ok” for getting one-shots done. Will be trying new Gemma4 26B MoE as well.
Harness stack: OpenCode > oMLX > https://huggingface.co/andrzejmontano/Qwen3.5-122B-A10B-Vision-MLX-Mixed-5bit
If you pull up oMLX’s website, it has a great amount of uploaded model benchmarks (which the UI can run) to get an idea of PP speeds… just filter by M5 Max (40 core GPU): https://omlx.ai/benchmarks
I find the context cache in oMLX makes relatively quick work with 100-200k context sizes.
BidWestern1056@reddit
i use some of the 120s but they arent enough of a jump in intelligence over the 30b class to justify the drop in speed usually
somerussianbear@reddit
But 120s MoE are way faster than 30s dense right?
Themash360@reddit
Depends on active parameter count, but yeah recent models go for like 10b active for instance , so if all 120b fits in fast memory speeds will be really good.
I do unfortunately think qwen 27b is better quality wise, but if you’re low on compute it will be a tough model to run.
BidWestern1056@reddit
yeea but like qwen 35b is also moe so its way faster lol
_derpiii_@reddit (OP)
Oh wow, that's a nuance I would not have expected.
So it's diminishing returns.
So it's more beneficial from the standpoint of having multiple models in memory at once (3x 30b models at each point in pipeline)?
BidWestern1056@reddit
yeah i'd say so, one of the things i'm working on with npcpy and npc ecosystem is to make it possible to achieve greater quality from ensembling responses from 1b-10b class models so you could likewise get 10 responses in parallel and then a smarter (30b-100b) model at the end synthesizes .
https://github.com/npc-worldwide/npcpy
https://github.com/npc-worldwide/npcsh
GymRatNowCovidFat@reddit
I think the qwen 3 next coder 8bit seems like a really good model so far. I almost find myself wishing they gave us the option for 256 GB for the macbook pro. I think I could have been OK with 96 GB if it existed. I don't think 64GB would have worked for me because I'm constantly seeing how far I can push large local models.
wgg_3@reddit
It’s ok
rorowhat@reddit
Atriz halo 💯, more versatile and cheaper.
_derpiii_@reddit (OP)
ahaha, I acknowledge the PCmasterrace :)
At the moment, I get the impression, local LLMs are best suited for brute force coding (vs orchestration). And linux + GPU wins by a landslide.
Apple's unified memory only has the advantage in orchestration, but sounds like the models aren't good enough yet.
rorowhat@reddit
Macs are great for casual use, but for real work the flexibility of the Pac always wins.
Look_0ver_There@reddit
Sadly nowhere near as cheap as they used to be. Started off at $1800 for the 128GB models. Now it's rare to find them for less than $3000. Above the $3000 mark it all gets uncomfortably close to being better off to just throw a bunch of R9700Pro's at a PC instead.
DoorStuckSickDuck@reddit
Eh, the Bosgame M5 used to be $1800 at release, I got mine at $2000 (\~2 months ago), and now it's $2400 for the 128GB model. Will likely grow even higher in the future.
They're very good machines for specific use cases. People here parrot throwing together rigs with 3090's until they look at the power consumption, noise, etc etc. It's not the fastest, but it's very efficient for what it does, and it's great for an always-on server.
Look_0ver_There@reddit
Yeah, Bosgame are the only remaining vendor below $2500. Last I looked two days ago, everyone else is now at the $3000+ mark. It will be surprising if Bosgame don't go to $3000 within a week.
Don't get me wrong. I have two Strix Halo machines myself linked together via USB4, and I can run 400B models on them using rpc-server at 20tg/sec. Fantastic machines. They unfortunately struggle with dense models. I also have a pair of R9700 Pro's in my PC for handling dense models at an acceptable speed.
When I got the Strix Halo's, they were $2000 each (I got them just as the prices started to go up). Today though when I look at what the R9700 Pro's can do, I find myself asking the question if Strix Halo's make sense any more at the $3000+ mark.
That's where I'm coming from. I ain't no parrot. This is based on direct experience.
_derpiii_@reddit (OP)
Wait the M5 Max has gone up? Oh sheesh, I thought Apple would keep the old retail pricing
Look_0ver_There@reddit
I was referring to the Strix Halo's, since the comment I responded to explicitly mentioned the Strix Halo machines.
SwordsAndElectrons@reddit
Read again what this person replied to.
IsThisStillAIIs2@reddit
if your expectation is “near cloud model performance locally,” you’ll be disappointed, but if it’s “fast, private, always-on inference,” it’s actually great. the sweet spot tends to be \~20b–70b quantized models for coding, assistants, and structured tasks, anything bigger starts to feel slow or memory-constrained even with 128gb.
droning-on@reddit
Are you using OpenClaw?
Curious if it's able to handle a decently complex context and still perform some coding tasks.
Ie code and follow a workflow at the same time.
Velocita84@reddit
I'm getting really tired of these "honest take"s
_derpiii_@reddit (OP)
wdym? I did a search and didn't find anything.
Velocita84@reddit
It an overused slop phrase, like "curious what you guys/the community thinks". You might've picked it up from other ai generated posts if you didn't ask an llm to make you a title
brendanl79@reddit
as someone looking to buy a fat Mac in the next few months this question and thread interested me. go sulk in a corner
_derpiii_@reddit (OP)
are you waiting for the M5 Ultra announcement too? 😆
brendanl79@reddit
hahaha exactly
Velocita84@reddit
My problem isn't with the subject matter, it's the literal string "honest take"
_derpiii_@reddit (OP)
idk what's worse: my writing style triggering you (a small minority), or you commenting something that's so negative (which is going to negatively impact more people).
If you don't like it, please go complain on another thread that's actual slop. I wrote this myself, and yes I even use - dashes, long before reddit even existed.
Your problem with me is just you.
_derpiii_@reddit (OP)
Or maybe LLMs are trained on old school redditors with a penchant for this kind of writing style.
Velocita84@reddit
I have NEVER seen it overused this much before LLMs, and it's only in this sub so you know it's because of LLMs
FastDecode1@reddit
Yeah, like wtf are people expecting by saying that? That someone with a dishonest take won't give it to you anyway?
Kinda like combating illegal guns by making it more difficult to obtain a gun legally. I'm sure the criminals will care about the new laws buddy, great job.
_derpiii_@reddit (OP)
hah, fair enough. I agree logically it makes no sense. It's more of a colloquial rhetoric I find helpful in... natural sounding discussions?
Either way, not sure why you're so tilted.
That_Country_7682@reddit
got one last month. 70b quants run surprisingly well, the unified memory is no joke.
nickludlam@reddit
Which 70B models? My impression was that the 70B size had fallen out of fashion, and we're seeing more of a cluster around 30B and 120B.
_derpiii_@reddit (OP)
What applications and workflows do you find it best for? And which models and quantization model would you recommend?
New_Public_2828@reddit
I heard it's ok for large models but if you need speed you need gpus
_derpiii_@reddit (OP)
That's my impression as well, would be curious to hear some concrete examples in this thread
Its_Powerful_Bonus@reddit
MoE ~120B works really well. Prompt processing improved dramatically vs M3 Max. In token generation there is some improvements, but I’ve expected more difference - maybe I’ll see it after software will adapt. For sure it is not rtx 6000 pro 96gb speed, but for device which I can run in travel it’s wonderful.
That_Country_7682@reddit
got one last month. 70b quants run surprisingly well, the unified memory is no joke.