Best local LLM for Mac Mini M4 (16GB) with 128k+ Context? Gemma 4 runs well but context is too tight
Posted by pepediaz130@reddit | LocalLLaMA | View on Reddit | 9 comments
Hi everyone,
I’m currently running an OpenClaw setup on a Mac Mini M4 with 16GB of RAM, and I’m looking for recommendations for a local model that can handle large context windows (ideally 100k-128k+) without crashing or becoming painfully slow.
What I’ve tried:
- Gemma 4 (26B) via Unsloth/llama.cpp: I’m using the IQ3_XXS quantization with Q4_1 KV cache. The performance is surprisingly smooth for its size, but I’m hitting a hard wall with the context window. After just a few messages, the context fills up, and the model loses track or fails.
- Qwen 3.5 (27B) via Ollama: Better context handling (32k), but still not enough for my technical workflows which involve long logs and code documentation.
The Goal:
I need a model that I can "talk to" about large codebases or system logs locally.
My Questions:
- Is it even realistic to aim for 128k context on 16GB of Unified Memory with a 20B+ model?
- Are there specific "Small Language Models" (SLMs) like Phi-4 or Mistral 7B variants that excel at long-context retrieval on Apple Silicon?
- Should I be looking into specific optimizations like Flash Attention (already enabled) or more aggressive KV Cache quantization?
Any advice on model choice or configuration for this specific hardware would be greatly appreciated!
kickerua@reddit
I couldn't run properly good enough model with 8GB GPU + 32GB RAM with higher context than 32K tokens.
Locally you can run properly something like gemma-4-E4B-it, which is not what you're looking for.
Blackdragon1400@reddit
You need much much more RAM for this.
draconisx4@reddit
For larger contexts on your Mac Mini M4 setup, try models with efficient attention like those supporting paged attention in llama.cpp; in my runs, bumping to Q5 quantization has handled 128k smoothly without much slowdown.
ea_man@reddit
Mediocre_Paramedic22@reddit
You don’t have enough ram to do it effectively. Look at using some openrouter or ollama cloud options for free models that can do work.
Your only realistic local option is a smaller model like an 8b to get that much context, and whether that works for your use case or not is something you can decide.
Or get a second Mac with more ram so you can stuff locally. Personally I run Linux and a system with 128gb unified ram.
tayarndt@reddit
Honestly, you are in the same position I am currently in. I am using the gemma 4 e2 or 4b moddles. not the best for large codebases but they can help do small tasks as well as doing visual reasoning. I would use Ollama or you can use Huggingface threw MLX and use the cli
SexyAlienHotTubWater@reddit
The method of KV cache quantization you're using right now will destroy performance - TurboQuant is *way* better for the same compression level. But it'll still struggle to perform at 4 bits (if you read the numbers, they say 6.5 bits equivalent is the max you can really push it to without massive degradation.
Why don't you try Bonsai? 8b is something like 1.1gb, and with an aggressive TurboQuant (I don't know if it's implemented in any Bonsai runners yet) the KV cache only gets to 14gb after a lot of context.
Pitpeaches@reddit
Are you using turboquant? Other than that not much else you can do
idiotiesystemique@reddit