As MTP prepares to land in llama.cpp, Models that support MTP
Posted by segmond@reddit | LocalLLaMA | View on Reddit | 28 comments
DeepSeekv3 OG
DeepSeekv3.2/4
Qwen3.5
GLM4.5+
MiniMax2.5+
Step3.5Flash
Mimo v2+
Until we get mtp weights, you need to download HF weights and convert to gguf. I think I'm going to try either qwen3.5-122b or glm4.5-air first.
oxygen_addiction@reddit
Real shame that StepFun seems to have turned into a closed lab. Their updated Step3.5 and Step Image Edit 2 have not been open-weighted and they do not reply to any messages asking about these, so it's clear they've pivoted.
mintybadgerme@reddit
When will Qwen3.6 27B GGUFs with MTP be available? Or is that not a thing?
_wOvAN_@reddit
would be better to have stable tensor split-mode
330d@reddit
Gemma4 no?
kuhunaxeyive@reddit
If I remember well what I read somewhere else:
Google trained MTP into Gemma-4-31. But before publishing, they removed it. Google doesn't want us to have the speed for their Gomma-4 model.
GrungeWerX@reddit
How long before it comes to lm-studio? And do we have to re-download our quants? Or do they have to be requanted in case they removed mtp? Not sure how the unsloth ud quants handled that...
suprjami@reddit
All pre-existing quants stripped MTP so a re-quant and re-download will be necessary.
GrungeWerX@reddit
Sounds good. Can't wait. 😄
El_90@reddit
But we need to wait for vulkan support ?
Beginning-Window-115@reddit
and metal support
One-Replacement-37@reddit
Qwen3.6 … ? 😂
segmond@reddit (OP)
I obviously missed adding the + after 3.5
With that said, Minimax says they support MTP
https://www.minimax.io/news/forge-scalable-agent-rl-framework-and-algorithm
"MTP-based Speculative Decoding: Instead of static draft models, we use Multi-Token Prediction (MTP) heads continuously fine-tuned via Top-K KL loss. This ensures alignment with the evolving RL policy, sustaining high acceptance rates and significant speedup by mitigating distribution shifts."
One-Replacement-37@reddit
It's not because their closed-model may have MTP, that it is available for anyone to use.
Minimax themselves were very clear on their M2.5 model that MTP was not available. And very clear to my own question that they are NOT open-sourcing any MTP layers.
DeltaSqueezer@reddit
Withholding MTP weights could be one way to differentiate their own offerings from other hosted providers, I guess.
MarkoMarjamaa@reddit
What I understand (with not the full meaning of the word), you can post-train (lora) LLM to achieve x2 speed-up?
https://arxiv.org/html/2603.23911v1
Ok_Warning2146@reddit
Well, this beta is only for Qwen3.5/6. Each architecture has their own MTP implementation. So it is not an once for all thing.
rerri@reddit
Are you sure these are supported yet?
Initially the PR only supported Qwen 3.5/3.6 27B and 35B MoE support was added later. So I'm thinking maybe support for the models you mention would also need to be added separately. Not sure.
Moscato359@reddit
What does this even mean
streppelchen@reddit
multi token prediction, models take an educated guess on the next 1-n tokens based on their training, instead of executing the full chain for each. with high acceptance rates, it can increase your decode (token generation) speed without any further changes than having a compatible model
Noiselexer@reddit
Ow neat
Moscato359@reddit
Does that reduce intelligence in any way
streppelchen@reddit
No, it uses the same quantization and verification pipeline
Formal-Exam-8767@reddit
A good analogy would be number factorization: finding factors is laborious, but verifying them is quite fast.
segmond@reddit (OP)
"Multi-token prediction (MTP) is a training objective for Large Language Models (LLMs) that predicts several future tokens simultaneously at each step, rather than just the single next token (NTP). This approach increases inference speed, boosts coding/math performance, and improves training efficiency." -
Oh yeah, let's say thanks to Meta for introducing this to us.
https://venturebeat.com/ai/metas-new-multi-token-prediction-makes-ai-models-up-to-3x-faster (non tech description)
https://github.com/ggml-org/llama.cpp/pull/22673 (pr change)
https://arxiv.org/pdf/2404.19737 (first article I know of from Meta researchers)
Powerful_Ad8150@reddit
OCR use cases - is there any specialized model that support?
doradus_novae@reddit
Fire
GrungeWerX@reddit
Doesn't Qwen 3.6 support it as well?
One-Replacement-37@reddit
It does.