As MTP prepares to land in llama.cpp, Models that support MTP

Posted by segmond@reddit | LocalLLaMA | View on Reddit | 28 comments

DeepSeekv3 OG

DeepSeekv3.2/4

Qwen3.5

GLM4.5+

MiniMax2.5+

Step3.5Flash

Mimo v2+

Until we get mtp weights, you need to download HF weights and convert to gguf. I think I'm going to try either qwen3.5-122b or glm4.5-air first.

[-]

oxygen_addiction@reddit

Real shame that StepFun seems to have turned into a closed lab. Their updated Step3.5 and Step Image Edit 2 have not been open-weighted and they do not reply to any messages asking about these, so it's clear they've pivoted.

[-]

mintybadgerme@reddit

When will Qwen3.6 27B GGUFs with MTP be available? Or is that not a thing?

[-]

_wOvAN_@reddit

would be better to have stable tensor split-mode

[-]

330d@reddit

Gemma4 no?

[-]

kuhunaxeyive@reddit

If I remember well what I read somewhere else:

Google trained MTP into Gemma-4-31. But before publishing, they removed it. Google doesn't want us to have the speed for their Gomma-4 model.

[-]

GrungeWerX@reddit

How long before it comes to lm-studio? And do we have to re-download our quants? Or do they have to be requanted in case they removed mtp? Not sure how the unsloth ud quants handled that...

[-]

suprjami@reddit

All pre-existing quants stripped MTP so a re-quant and re-download will be necessary.

[-]

GrungeWerX@reddit

Sounds good. Can't wait. 😄

[-]

El_90@reddit

But we need to wait for vulkan support ?

[-]

Beginning-Window-115@reddit

and metal support

[-]

One-Replacement-37@reddit

Qwen3.6 … ? 😂

[-]

segmond@reddit (OP)

I obviously missed adding the + after 3.5

With that said, Minimax says they support MTP

https://www.minimax.io/news/forge-scalable-agent-rl-framework-and-algorithm

"MTP-based Speculative Decoding: Instead of static draft models, we use Multi-Token Prediction (MTP) heads continuously fine-tuned via Top-K KL loss. This ensures alignment with the evolving RL policy, sustaining high acceptance rates and significant speedup by mitigating distribution shifts."

[-]

One-Replacement-37@reddit

It's not because their closed-model may have MTP, that it is available for anyone to use.

Minimax themselves were very clear on their M2.5 model that MTP was not available. And very clear to my own question that they are NOT open-sourcing any MTP layers.

[-]

DeltaSqueezer@reddit

Withholding MTP weights could be one way to differentiate their own offerings from other hosted providers, I guess.

[-]

MarkoMarjamaa@reddit

What I understand (with not the full meaning of the word), you can post-train (lora) LLM to achieve x2 speed-up?
https://arxiv.org/html/2603.23911v1

[-]

Ok_Warning2146@reddit

Well, this beta is only for Qwen3.5/6. Each architecture has their own MTP implementation. So it is not an once for all thing.

[-]

rerri@reddit

I think I'm going to try either qwen3.5-122b or glm4.5-air first.

Are you sure these are supported yet?

Initially the PR only supported Qwen 3.5/3.6 27B and 35B MoE support was added later. So I'm thinking maybe support for the models you mention would also need to be added separately. Not sure.

[-]

Moscato359@reddit

What does this even mean

[-]

streppelchen@reddit

multi token prediction, models take an educated guess on the next 1-n tokens based on their training, instead of executing the full chain for each. with high acceptance rates, it can increase your decode (token generation) speed without any further changes than having a compatible model

[-]

Noiselexer@reddit

Ow neat

[-]

Moscato359@reddit

Does that reduce intelligence in any way

[-]

streppelchen@reddit

No, it uses the same quantization and verification pipeline

[-]

Formal-Exam-8767@reddit

A good analogy would be number factorization: finding factors is laborious, but verifying them is quite fast.

[-]

segmond@reddit (OP)

"Multi-token prediction (MTP) is a training objective for Large Language Models (LLMs) that predicts several future tokens simultaneously at each step, rather than just the single next token (NTP). This approach increases inference speed, boosts coding/math performance, and improves training efficiency." -

Oh yeah, let's say thanks to Meta for introducing this to us.

https://venturebeat.com/ai/metas-new-multi-token-prediction-makes-ai-models-up-to-3x-faster (non tech description)

https://github.com/ggml-org/llama.cpp/pull/22673 (pr change)

https://arxiv.org/pdf/2404.19737 (first article I know of from Meta researchers)

[-]