Gemma4:31b-coding-mtp-bf16 - slow on Macbook M5 128gb
Posted by chimph@reddit | LocalLLaMA | View on Reddit | 15 comments
Very quick initial test of Gemma 4 new MTP model via Ollama (llama.cpp doesnt support yet)
https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/
Running in Open Webui to view token/s output and I get 10-12 tok/s
Will have to wait for benchmarks to see if this is worth running instead of Qwen3.6 27b or Qwen3 Coder Next for tasks that dont need babysat.

ConversationNice3225@reddit
Ollama uses llamacpp under the hood, and as you already noted they haven't implemented MTP.
To run the new MTP model you probably have to run MTPLX? Unfortunately I don't follow the Mac ecosystem, so I don't know more.
pkief@reddit
Ollama already has support it seems, check this PR: https://github.com/ollama/ollama/pull/15980
pkief@reddit
Just noticed, it's actually also mentioned in their release notes: https://github.com/ollama/ollama/releases/tag/v0.23.1
chimph@reddit (OP)
Read the release article I linked. It specifically links to Ollama and the model
redmctrashface@reddit
That's barely usable. It's quite deceiving for the M5 max but I guess the bandwidth culprit hits hard.
DragonfruitIll660@reddit
12 is not bad, the BF 16 is like 60ish Gbs right? Not too bad overall.
chimph@reddit (OP)
I was under the impression (perhaps wrongly) that MTP would give a boost to dense model
Accomplished_Ad9530@reddit
Looking at your PP rate it definitely should. What TG rate do you get without MTP?
chimph@reddit (OP)
ah. so I pulled gemma4:31b-mlx-bf16 (3 weeks old) which is clearly the exact same model as it instantly resolved. And generation is actually a lot faster with the MTP version.
For the same test I only got 7 tok/s for the non MTP
Accomplished_Ad9530@reddit
Huh, I wouldn’t expect MTP to increase PP speed by 4x. I wonder what’s going on there.
chimph@reddit (OP)
tested again properly in a new chat within open webui:
MTP: PP 402. TG 13.64
non MTP: PP 436 TG 7.24
So a decent improvement in TG but no difference for PP
Accomplished_Ad9530@reddit
Ah, cool, that matches my expectations well. Thanks for rerunning your tests.
chimph@reddit (OP)
Oh probably my bad. I ran the new test in the same context. Let me test properly in a bit
FrozenFishEnjoyer@reddit
This must be an M5 Max right?
Also what's the quant here?
chimph@reddit (OP)
yes M5 Max. Model is unquantised. Have edited the post with new findings.