Gemma 4 31B on M5 Max — Ollama or raw MLX?

Posted by Excellent_Koala769@reddit | LocalLLaMA | View on Reddit | 11 comments

Hey Guys,

Running Gemma 4 31B 4-bit on a MacBook Pro M5 Max (128GB) as a local inference server. Currently using mlx_lm.server (raw MLX) and it works well for text + tool calling at \~25 tok/s.

Now I need to add vision/image input. Gemma 4 is multimodal but mlx_lm.server only supports text — returns "Only text content type supported" for image inputs. Tried mlx-vlm.generate() with the same model and got garbage output (known vision tower overflow bug).

So I'm at a crossroads: do I stick with raw MLX and keep troubleshooting, or switch to Ollama which handles updates and model compatibility for me?

What I care about:

Vision + text + tool calling on the same model
Stable, maintained, don't want to fight framework bugs
Concurrent request support
Some control over memory/cache (128GB is shared across multiple services)

For those running Gemma 4 31B locally on Apple Silicon — are you using Ollama or raw MLX? Is Ollama's Apple Silicon performance comparable? Do you get vision and tool calling working reliably through Ollama?

[-]

Desperate_Device_908@reddit

I have the exact same setup, what do you use it for? Do you have your m5 max on 24/7 when you use it as a local ai inference server

[-]

Excellent_Koala769@reddit (OP)

I basically run my full openclaw setup locally on my laptop. Embeddings, chat ui, llm, stt, tts, db etc. Trying to get off of other providers servers. It is crazy how powerful this laptop is.. you should be running everything local. DM me if you want specifics. Happy to share more.

[-]