Gemma 4 31B on M5 Max — Ollama or raw MLX?
Posted by Excellent_Koala769@reddit | LocalLLaMA | View on Reddit | 11 comments
Hey Guys,
Running Gemma 4 31B 4-bit on a MacBook Pro M5 Max (128GB) as a local inference server. Currently using mlx_lm.server (raw MLX) and it works well for text + tool calling at \~25 tok/s.
Now I need to add vision/image input. Gemma 4 is multimodal but mlx_lm.server only supports text — returns "Only text content type supported" for image inputs. Tried mlx-vlm.generate() with the same model and got garbage output (known vision tower overflow bug).
So I'm at a crossroads: do I stick with raw MLX and keep troubleshooting, or switch to Ollama which handles updates and model compatibility for me?
What I care about:
- Vision + text + tool calling on the same model
- Stable, maintained, don't want to fight framework bugs
- Concurrent request support
- Some control over memory/cache (128GB is shared across multiple services)
For those running Gemma 4 31B locally on Apple Silicon — are you using Ollama or raw MLX? Is Ollama's Apple Silicon performance comparable? Do you get vision and tool calling working reliably through Ollama?
Desperate_Device_908@reddit
I have the exact same setup, what do you use it for? Do you have your m5 max on 24/7 when you use it as a local ai inference server
Excellent_Koala769@reddit (OP)
I basically run my full openclaw setup locally on my laptop. Embeddings, chat ui, llm, stt, tts, db etc. Trying to get off of other providers servers. It is crazy how powerful this laptop is.. you should be running everything local. DM me if you want specifics. Happy to share more.
Desperate_Device_908@reddit
Dmed you!
MattOnePointO@reddit
I use oMLX.
Shoulon@reddit
oMLX all the way.
Excellent_Koala769@reddit (OP)
Great advice. I am going to test this out now. Is this a pretty popular option? How long have you been using oMLX?
Excellent_Koala769@reddit (OP)
Fuck yea dude! oMLX works with no performance drop and has solved the vision problem. going to use this instead of raw MLX. Thanks!
MattOnePointO@reddit
Glad it worked for you too! Enjoy.
Danfhoto@reddit
I don’t think MLX is as stable as llama.cpp on Gemma4 yet. Google itself is still updating prompt template and supporting issues.
330d@reddit
Why just these two options? Try latest build of llama.cpp, you don't need ollama for anything
Excellent_Koala769@reddit (OP)
I don't want to give up the native speed advantage of mlx.