I still see a slight decrease in prefill (pp) on an RTX5090, but it's not terrible.
For 30k tokens prefill + 5k token generation I'm getting:
Average TPS: 98 (Vs 52 with no MTP)
Average prefill: 2150 (vs 2600 with no MTP)
Seeing the same on a Strix Halo. Went from 40 t/s to ~53 t/s with the qwen3.6 35b-a3b q8.
Weirdly I get just a very mild improvement with the Q6... From ~53 t/s to ~57 t/s.
I saw a graph somewhere on this subreddit, it doesn't work so well for MoEs, it's much better for dense models.
Your situation was what was shown on the graph as well.
I am on AMD + Vulkan too (9070 XT). My TG has dropped from 60+ to the 45-52 range. But PP no longer takes a hit and is noticeable faster.
(Could be the slight variances in my workflow 😅)
Yes it is but claims that 2x improved is observed appear to be exaggerated ones. Also adding ngram mod with mtp draft decreased tg for me. Best speed 13 to 17 tps is achieved with n max value 2. BTW I used unsloth iq4_nl quant.
If your model has MTP layers, this lets llama.cpp use them for speculative decoding. You could expect a speedup of around 50 to 80% in tokens generated per second. This is probably the biggest speedup we'll see in llama.cpp for token generation until Eagle3 or DFlash become available.
I don't know about others experience, but I'm my case, I tried vllm dflash and at first looked great, once real work has to be done, it is just slow as shit.
My experience with mtp in llama.cpp has being awesome tbh
> This particular implementation originally made prompt processing slower, but hopefully they've since fixed that issue.
I'm seeing around a halving of PP speed, from 1100 to 550 on 3090 and 250 to 170 on P40.
Also seeing context drop from ~150k to ~110k (still finding out where the limits are exactly, it oom's now and then).
At least vision works now.
Same I was thinking. Using Kilo in Code mode the avg initial prompt is 14k tokens. If token generation is 2x ( ex 30 to 60 t/s ) but prefill pass from 1500 to 700 t/s I can't see it as an advantage.
The strange thing about the whole LLM community is the constant search for faster token generation, but in my opinion the prompt processing speed is also important ( if not more ) than token generation, especially for coding agents, since the system prompt is usually huge.
I think it helps that the system prompt can often be cached across sessions, so if you are just using one agent harness throughout the day with the same system prompt and tools, you might only pay that prompt processing once, whereas you always pay the token generation cost for every single turn.
It should give you a speedup in any case where the decoding is memory bandwidth constrained, but I have not tested CPU decoding with it myself.
I would expect on some slower machines that CPU decoding might be CPU compute bound.
Partial offloading makes that even messier to talk about abstractly. I am sure token gen benchmarks will be flooding the usual channels soon.
It's been put on hold pending the merge of this PR. It will be now merged on top of the standard MTP support. Georgi has been keeping the branch up to date with the changes to facilitate quick adoption.
I'm hoping it might have been waiting for the same speculative parallel refactoring work to merge as this MTP one was. Hopefully we'll see it soon-ish. The DFlash one didn't seem nearly as close as the Eagle3 work.
\> Has there been a decline in quality?
If there was, it would be considered a bug. MTP doesn't require sacrificing any inference quality.
MTP models are marginally larger than non MTP models and will use more VRAM at runtime.
I tested the new MTP feature on Qwen 3.6 35B and 27B. Generation speed is definitely faster, but prompt processing speed dropped by about 2.5x in my case (from 6500 t/s down to 2000 t/s). Also, the `-fit` argument seems to have stopped working — it looks like it doesn’t recognize MTP at all. On longer contexts, I also ran into a “CUDA error: out of memory.” Hopefully these are all things that can be fixed.
Similar to my experience. Drop in PP (albeit not quite as dramatic), -fit not understanding anything, and oom crashes on longer contexts. That's despite cutting context length by one third.
Tokens are typically generated one at a time, which involves lots of reading from memory, hence slow.
MTP tries to generate multiple tokens at a time by "guessing" the next few tokens with draft layers. If guesses are correct, massive speed up; otherwise, the compute spent on guessing is wasted.
If your next tokens often vary a lot (like in creative writing), speed up is then small. But if previously generated tokens are likely to appear again (like code refactoring, for example), then speed up is bigger.
To me, this feels **a bit** like how branch prediction works in microchips.
Hope this helps!
There's a speedup in generation but a drop in prompt prefill, probably due to the face that another small models is loaded ( it's bundled in the gguf ) and it needs to be loaded in the VRAM as for the KV Cache
Can I use this with lmstudio yet? I had the newest llama cpp runtimes available to download in lmstudio, got them, but now I'm not sure if there is a compatible mtp gguf available yet, anyone got it working yet?
LM Studio isnt bleeding edge, in the oast Ive waited days to weeks for the newest thing to be supported. You can check the run time drivers to see what version LM Studio is using.
LM Studio beta has that new feature you can enable in developer settings to use the latest llamacpp always, I don't know how to test it as it seem counter-intuitive and I cba, but just FYI
The pull request has a checked box next to "Support separate GGUF for `mtp`", so I'd say you can download the MTP layers as their own GGUF. I'm going to have to try that, because I don't want two copies of almost the same file, one for agentic coding and one for batch processing.
[https://github.com/ggml-org/llama.cpp/pull/22673#issuecomment-4456979078](https://github.com/ggml-org/llama.cpp/pull/22673#issuecomment-4456979078)
|yes it can be loaded separately using `--spec-draft-model`. The `convert_hf_to_gguf.py` changes have an option of `--mtp` which just outputs the MTP gguf. Using the "grafted" on MTP is more VRAM efficient though. Another thing is that `-hf` option will try to look for the MTP gguf like it does for `mmproj` in case `spec-draft-type draft-mtp` is mentioned.|
|:-|
But it doesn't work with [https://huggingface.co/IHaveNoClueAndIMustPost/Qwen3.6-27b-MTP-TENSORS-ONLY](https://huggingface.co/IHaveNoClueAndIMustPost/Qwen3.6-27b-MTP-TENSORS-ONLY), probably because that's missing some GGUF metadata. So I tried `convert_hf_to_gguf.py --remote --mtp --outtype q8_0 Qwen/Qwen3.6-27B`, and that required admin privileges to make a symlink on Windows. The file it produced was 2.94 GB, compared to 430 MB for the above, and yeah, it uses a few GB extra VRAM.
llmfan already has uncensored MTP gguf versions of
* Qwen3.6-35B-A3B
* Qwen3.6-27B
...here if that's what you're asking for...
https://huggingface.co/llmfan46/models?search=gguf
I haven't tested them yet so I don't know if they need tweaking to work with this official release of MTP support in llama.cpp
Moar tokens? Why yes please!!
Thanks to all the hard working developers on the llama.cpp team and ofcause the 1000's of researchers that keep finding new ways of improving things!!
I tested it with chain of speculators ngram-mod just before the merge. 75 tok/s q5 k m qwen3.6 27b on a 61k input 5000 tok output on an rtx 5090. vLLM still wins with 105 tok/s sadly
I'll retest now after the merge
101 Comments
rm-rf-rm@reddit
Outside_Reindeer_713@reddit
luckyj@reddit
GlobalLadder9461@reddit
EugenePopcorn@reddit
RnRau@reddit
Combinatorilliance@reddit
Valuable_Touch5670@reddit (OP)
RnRau@reddit
Valuable_Touch5670@reddit (OP)
taking_bullet@reddit
GlobalLadder9461@reddit
DeSibyl@reddit
wllmsaccnt@reddit
robertpro01@reddit
adssidhu86@reddit
robertpro01@reddit
TheTerrasque@reddit
kamtar@reddit
Pleasant-Shallot-707@reddit
Material_Tone_6855@reddit
wektor420@reddit
wllmsaccnt@reddit
Material_Tone_6855@reddit
wllmsaccnt@reddit
dbzunicorn@reddit
Caffdy@reddit
wllmsaccnt@reddit
ilintar@reddit
spaceman_@reddit
ilintar@reddit
wllmsaccnt@reddit
Antique_Dot_5513@reddit
wllmsaccnt@reddit
maximus_reborn@reddit
tempedbyfate@reddit
LagOps91@reddit
soldture@reddit
Shoddy_Bed3240@reddit
TheTerrasque@reddit
pjdonovan@reddit
urarthur@reddit
DrAlexander@reddit
Material_Tone_6855@reddit
Odd-Environment-7193@reddit
Valuable_Touch5670@reddit (OP)
ceo_of_banana@reddit
Valuable_Touch5670@reddit (OP)
ceo_of_banana@reddit
Material_Tone_6855@reddit
Fringolicious@reddit
lemondrops9@reddit
Dany0@reddit
Fringolicious@reddit
No_Algae1753@reddit
CalligrapherFar7833@reddit
MotokoAGI@reddit
SkyFeistyLlama8@reddit
DoorStuckSickDuck@reddit
314kabinet@reddit
Material_Tone_6855@reddit
No_Algae1753@reddit
van-dame@reddit
Miserable-Dare5090@reddit
No_Algae1753@reddit
rngesius@reddit
Address-Street@reddit
Consumerbot37427@reddit
crapaud_dindon@reddit
TheWaffleKingg@reddit
SkyFeistyLlama8@reddit
Odd-Ordinary-5922@reddit
DeProgrammer99@reddit
SkyFeistyLlama8@reddit
ilintar@reddit
Ambitious_Fold_2874@reddit
ilintar@reddit
coder543@reddit
Ambitious_Fold_2874@reddit
freehuntx@reddit
Zc5Gwu@reddit
Goldandsilverape99@reddit
SmoothCCriminal@reddit
shapic@reddit
UmpireBorn3719@reddit
oxygen_addiction@reddit
fragment_me@reddit
Odd-Ordinary-5922@reddit
Force88@reddit
Antop90@reddit
anykeyh@reddit
TurnOffAutoCorrect@reddit
RnRau@reddit
TurnOffAutoCorrect@reddit
Dany0@reddit
imp_12189@reddit
Xonzo@reddit
wgaca2@reddit
ghulamalchik@reddit
1FNn4@reddit
LosEagle@reddit