Gemma4:31b-coding-mtp-bf16 - slow on Macbook M5 128gb

Posted by chimph@reddit | LocalLLaMA | View on Reddit | 15 comments

Very quick initial test of Gemma 4 new MTP model via Ollama (llama.cpp doesnt support yet)

https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/

Running in Open Webui to view token/s output and I get 10-12 tok/s

Will have to wait for benchmarks to see if this is worth running instead of Qwen3.6 27b or Qwen3 Coder Next for tasks that dont need babysat.

[-]

ConversationNice3225@reddit

Ollama uses llamacpp under the hood, and as you already noted they haven't implemented MTP.

To run the new MTP model you probably have to run MTPLX? Unfortunately I don't follow the Mac ecosystem, so I don't know more.

[-]

pkief@reddit

Ollama already has support it seems, check this PR: https://github.com/ollama/ollama/pull/15980

[-]

pkief@reddit

Just noticed, it's actually also mentioned in their release notes: https://github.com/ollama/ollama/releases/tag/v0.23.1

[-]

chimph@reddit (OP)

Read the release article I linked. It specifically links to Ollama and the model

[-]

redmctrashface@reddit

That's barely usable. It's quite deceiving for the M5 max but I guess the bandwidth culprit hits hard.

[-]

DragonfruitIll660@reddit

12 is not bad, the BF 16 is like 60ish Gbs right? Not too bad overall.

[-]

chimph@reddit (OP)

I was under the impression (perhaps wrongly) that MTP would give a boost to dense model

[-]

Accomplished_Ad9530@reddit

Looking at your PP rate it definitely should. What TG rate do you get without MTP?

[-]

ah. so I pulled gemma4:31b-mlx-bf16 (3 weeks old) which is clearly the exact same model as it instantly resolved. And generation is actually a lot faster with the MTP version.
For the same test I only got 7 tok/s for the non MTP

[-]

Accomplished_Ad9530@reddit

Huh, I wouldn’t expect MTP to increase PP speed by 4x. I wonder what’s going on there.

[-]

chimph@reddit (OP)

tested again properly in a new chat within open webui:

MTP: PP 402. TG 13.64

non MTP: PP 436 TG 7.24

So a decent improvement in TG but no difference for PP

[-]

Accomplished_Ad9530@reddit

Ah, cool, that matches my expectations well. Thanks for rerunning your tests.

[-]

chimph@reddit (OP)

Oh probably my bad. I ran the new test in the same context. Let me test properly in a bit

[-]

FrozenFishEnjoyer@reddit

This must be an M5 Max right?

Also what's the quant here?

[-]

chimph@reddit (OP)

yes M5 Max. Model is unquantised. Have edited the post with new findings.