Is a 128 GB MacBook Pro M5 Max actually too slow for large-context local LLM coding workflows?

Posted by bajis12870@reddit | LocalLLaMA | View on Reddit | 17 comments

People are warning me about the prompt-processing speed of a MacBook Pro M5 Max with 128 GB RAM.

My main concern is prompt ingestion / prefill latency and large-context handling — not raw token generation speed (which I think is OK).

I only plan to use Qwen 3.5 / 3.6 / 3.7 models or similar mostly coding-focused MoE or dense variants with MTP (Multi-Token Prediction) and TurboQuant (or similar) for agentic coding workflows:

OpenCode
Claude Code–style agents
custom tooling

No image/video generation.

I'm especially interested in real-world performance on:

large Rust / Go / Python / TypeScript repos
~300k LOC projects
long-running agent sessions
heavy tool usage
RAG/codebase indexing
multi-file edits
context windows in the 32k–256k+ range

What I'm trying to understand is:

What are the actual prompt-processing / prefill speeds (tokens/sec)?
How does TTFT feel in practice once contexts become large?
Does performance collapse at larger context sizes?
How much does MLX vs llama.cpp?
How usable is it for real coding-agent workflows compared to cloud models?
Does prompt caching materially improve the experience?
At what repo/context size does the experience become frustrating?

If possible, can you please include the following?

exact model + quantization
runtime (MLX, llama.cpp, Ollama, LM Studio, etc.)
context size
prompt-processing speed
generation speed
RAM usage
real workflow examples
whether the bottleneck was compute, memory bandwidth, or context compaction
M3/M4/M5 comparisons if available

THAAAANKS!

[-]

Toastti@reddit

The main thing is to use oMLX and you must have caching enabled. Your first message will take a bit to process but once it starts using cache it's pretty decent in things like opencode or even better pi

looctonmi@reddit

oMLX is not a serious project

JLeonsarmiento@reddit

Why you say that? I find it quite good tbh.

I frequently get system popups asking for access for “H___ Kim” because the author signed the cert with his own name. one of the last updates broke the app and prevented it from loading certain models. if you just look at the repo, there are tons of valid PRs open that aren’t being looked at while the project is actively getting vibe coded. There are no CI tests at all. For me it’s spooky and i don’t really want this software getting updates on my mac

dametsumari@reddit

I have not seen single pop up and I have used it almost from the start. I am using the app version, with occasional manually triggered update. Not sure if brew is different or not.

https://www.reddit.com/r/LocalLLaMA/s/dNIfLkiVqN

Hydroskeletal@reddit

This. You may need to tune to get your cache hit rate up but it makes a huge difference

MrPecunius@reddit

LM Studio has prompt caching for MLX too.

Embarrassed-Rich3397@reddit

Moe would definitely be the better option for you, unless you want to wait on very long wait times on a dense model running on unified memory. Maybe try 122b qwen3.5 or qwen3 coder next at higher precision.

Hyiazakite@reddit

Agreed. You have use MoE for speed with the trade off that you'll have to run a larger model for the equivalent quality (122B ~ 27B dense, at least according to benchmarks, i'd argue 27B dense is better from my experience though.) However the larger the routing model is the slower it gets and I think 10B is the top limit on my M3 ultra, from what I can tell by the oMLX benchmarks M5 Max should be ok at 10B. Qwen3 coder 80BA3B or Qwen 3.6 35A3B if you want the most speed.

catplusplusok@reddit

These days modern Mac hardware is faster and prompt caching relieves long context concerns, so should do fine. Also look into Gemma 4 31B, it has efficient MTP.

Check oMLX public benchmarks, very likely the system configuration you’re asking have data there already.

I’m rocking M4Pro 48gb ram with Qwen3.6-MoE-6bit and Gemma4-Moe-8bit with context windows of 64K with Pi/Opencode/Vibe and even Cline and I’m more than happy.

M5 max should be a beast.

havnar-@reddit

With omlx it’s pretty fast, one of the new things in m5

Fit_Concept5220@reddit

Pp on dense models you listed will be 50-400ts, no way to use that in agentic workflows. MoE would be fine.

You can explore Apple Silicon performance here:

https://omlx.ai/compare

I'm pretty happy with my M5 Pro/64GB with Qwen3.6 models (27b/35b a3b) @ 8-bit.

brycesub@reddit

If you're getting a 128GB Macbook Pro M5 you should check out https://github.com/antirez/ds4 . IMO this is the state of the art for agentic workflows on your hardware. Run DeepSeek 4 Flash which is going to crush Qwen.

Electrical_Rub_6009@reddit

It’s probably not great for where you need to sit there and interact with it, but if you need local workers who run on a box you can ssh into, it’s amazing.

For my projects, I really need offline work that can run 24/7 in the background, and then to send the results back into my main box. Token generation speed is irrelevant.

So the M5 lineup for me is really far, far better than CUDA workstations in terms of value per dollar spent.