Gemma 4 31B on M5 Max — Ollama or raw MLX?

Posted by Excellent_Koala769@reddit | LocalLLaMA | View on Reddit | 11 comments

Hey Guys,

Running Gemma 4 31B 4-bit on a MacBook Pro M5 Max (128GB) as a local inference server. Currently using mlx_lm.server (raw MLX) and it works well for text + tool calling at \~25 tok/s.

Now I need to add vision/image input. Gemma 4 is multimodal but mlx_lm.server only supports text — returns "Only text content type supported" for image inputs. Tried mlx-vlm.generate() with the same model and got garbage output (known vision tower overflow bug).

So I'm at a crossroads: do I stick with raw MLX and keep troubleshooting, or switch to Ollama which handles updates and model compatibility for me?

What I care about:

For those running Gemma 4 31B locally on Apple Silicon — are you using Ollama or raw MLX? Is Ollama's Apple Silicon performance comparable? Do you get vision and tool calling working reliably through Ollama?