Is a 128 GB MacBook Pro M5 Max actually too slow for large-context local LLM coding workflows?

Posted by bajis12870@reddit | LocalLLaMA | View on Reddit | 17 comments

People are warning me about the prompt-processing speed of a MacBook Pro M5 Max with 128 GB RAM.

My main concern is prompt ingestion / prefill latency and large-context handling — not raw token generation speed (which I think is OK).

I only plan to use Qwen 3.5 / 3.6 / 3.7 models or similar mostly coding-focused MoE or dense variants with MTP (Multi-Token Prediction) and TurboQuant (or similar) for agentic coding workflows:

No image/video generation.

I'm especially interested in real-world performance on:

What I'm trying to understand is:

  1. What are the actual prompt-processing / prefill speeds (tokens/sec)?
  2. How does TTFT feel in practice once contexts become large?
  3. Does performance collapse at larger context sizes?
  4. How much does MLX vs llama.cpp?
  5. How usable is it for real coding-agent workflows compared to cloud models?
  6. Does prompt caching materially improve the experience?
  7. At what repo/context size does the experience become frustrating?

If possible, can you please include the following?

THAAAANKS!