Is a 128 GB MacBook Pro M5 Max actually too slow for large-context local LLM coding workflows?
Posted by bajis12870@reddit | LocalLLaMA | View on Reddit | 17 comments
People are warning me about the prompt-processing speed of a MacBook Pro M5 Max with 128 GB RAM.
My main concern is prompt ingestion / prefill latency and large-context handling — not raw token generation speed (which I think is OK).
I only plan to use Qwen 3.5 / 3.6 / 3.7 models or similar mostly coding-focused MoE or dense variants with MTP (Multi-Token Prediction) and TurboQuant (or similar) for agentic coding workflows:
- OpenCode
- Claude Code–style agents
- custom tooling
No image/video generation.
I'm especially interested in real-world performance on:
- large Rust / Go / Python / TypeScript repos
- ~300k LOC projects
- long-running agent sessions
- heavy tool usage
- RAG/codebase indexing
- multi-file edits
- context windows in the 32k–256k+ range
What I'm trying to understand is:
- What are the actual prompt-processing / prefill speeds (tokens/sec)?
- How does TTFT feel in practice once contexts become large?
- Does performance collapse at larger context sizes?
- How much does MLX vs llama.cpp?
- How usable is it for real coding-agent workflows compared to cloud models?
- Does prompt caching materially improve the experience?
- At what repo/context size does the experience become frustrating?
If possible, can you please include the following?
- exact model + quantization
- runtime (MLX, llama.cpp, Ollama, LM Studio, etc.)
- context size
- prompt-processing speed
- generation speed
- RAM usage
- real workflow examples
- whether the bottleneck was compute, memory bandwidth, or context compaction
- M3/M4/M5 comparisons if available
THAAAANKS!
Toastti@reddit
The main thing is to use oMLX and you must have caching enabled. Your first message will take a bit to process but once it starts using cache it's pretty decent in things like opencode or even better pi
looctonmi@reddit
oMLX is not a serious project
JLeonsarmiento@reddit
Why you say that? I find it quite good tbh.
looctonmi@reddit
I frequently get system popups asking for access for “H___ Kim” because the author signed the cert with his own name. one of the last updates broke the app and prevented it from loading certain models. if you just look at the repo, there are tons of valid PRs open that aren’t being looked at while the project is actively getting vibe coded. There are no CI tests at all. For me it’s spooky and i don’t really want this software getting updates on my mac
dametsumari@reddit
I have not seen single pop up and I have used it almost from the start. I am using the app version, with occasional manually triggered update. Not sure if brew is different or not.
looctonmi@reddit
https://www.reddit.com/r/LocalLLaMA/s/dNIfLkiVqN
Hydroskeletal@reddit
This. You may need to tune to get your cache hit rate up but it makes a huge difference
MrPecunius@reddit
LM Studio has prompt caching for MLX too.
Embarrassed-Rich3397@reddit
Moe would definitely be the better option for you, unless you want to wait on very long wait times on a dense model running on unified memory. Maybe try 122b qwen3.5 or qwen3 coder next at higher precision.
Hyiazakite@reddit
Agreed. You have use MoE for speed with the trade off that you'll have to run a larger model for the equivalent quality (122B ~ 27B dense, at least according to benchmarks, i'd argue 27B dense is better from my experience though.) However the larger the routing model is the slower it gets and I think 10B is the top limit on my M3 ultra, from what I can tell by the oMLX benchmarks M5 Max should be ok at 10B. Qwen3 coder 80BA3B or Qwen 3.6 35A3B if you want the most speed.
catplusplusok@reddit
These days modern Mac hardware is faster and prompt caching relieves long context concerns, so should do fine. Also look into Gemma 4 31B, it has efficient MTP.
JLeonsarmiento@reddit
Check oMLX public benchmarks, very likely the system configuration you’re asking have data there already.
I’m rocking M4Pro 48gb ram with Qwen3.6-MoE-6bit and Gemma4-Moe-8bit with context windows of 64K with Pi/Opencode/Vibe and even Cline and I’m more than happy.
M5 max should be a beast.
havnar-@reddit
With omlx it’s pretty fast, one of the new things in m5
Fit_Concept5220@reddit
Pp on dense models you listed will be 50-400ts, no way to use that in agentic workflows. MoE would be fine.
MrPecunius@reddit
You can explore Apple Silicon performance here:
https://omlx.ai/compare
I'm pretty happy with my M5 Pro/64GB with Qwen3.6 models (27b/35b a3b) @ 8-bit.
brycesub@reddit
If you're getting a 128GB Macbook Pro M5 you should check out https://github.com/antirez/ds4 . IMO this is the state of the art for agentic workflows on your hardware. Run DeepSeek 4 Flash which is going to crush Qwen.
Electrical_Rub_6009@reddit
It’s probably not great for where you need to sit there and interact with it, but if you need local workers who run on a box you can ssh into, it’s amazing.
For my projects, I really need offline work that can run 24/7 in the background, and then to send the results back into my main box. Token generation speed is irrelevant.
So the M5 lineup for me is really far, far better than CUDA workstations in terms of value per dollar spent.