Best local LLM for Mac Mini M4 (16GB) with 128k+ Context? Gemma 4 runs well but context is too tight

Posted by pepediaz130@reddit | LocalLLaMA | View on Reddit | 9 comments

Hi everyone,

I’m currently running an OpenClaw setup on a Mac Mini M4 with 16GB of RAM, and I’m looking for recommendations for a local model that can handle large context windows (ideally 100k-128k+) without crashing or becoming painfully slow.

What I’ve tried:

The Goal:
I need a model that I can "talk to" about large codebases or system logs locally.

My Questions:

  1. Is it even realistic to aim for 128k context on 16GB of Unified Memory with a 20B+ model?
  2. Are there specific "Small Language Models" (SLMs) like Phi-4 or Mistral 7B variants that excel at long-context retrieval on Apple Silicon?
  3. Should I be looking into specific optimizations like Flash Attention (already enabled) or more aggressive KV Cache quantization?

Any advice on model choice or configuration for this specific hardware would be greatly appreciated!