TheaterFire

MTP PR Merged!!!

Posted by Valuable_Touch5670@reddit | LocalLLaMA | View on Reddit | 101 comments

MTP PR Merged!!!
Llamas, LFG!!! 🎉🎉🎉

Reply to Post

101 Comments

rm-rf-rm@reddit

2 threads on the same topic, locking this smaller one. Please use this: https://old.reddit.com/r/LocalLLaMA/comments/1teqnf2/thats_a_good_news/
View on Reddit #86183324

Outside_Reindeer_713@reddit

suddenly the RTX 3060 is a beast .
View on Reddit #86182724

luckyj@reddit

I still see a slight decrease in prefill (pp) on an RTX5090, but it's not terrible. For 30k tokens prefill + 5k token generation I'm getting: Average TPS: 98 (Vs 52 with no MTP) Average prefill: 2150 (vs 2600 with no MTP)
View on Reddit #86182061

GlobalLadder9461@reddit

On vulkan backend on AMD APU, I am observing maximum 30% increase. What are the results from other vulkan folks.
View on Reddit #86164318

EugenePopcorn@reddit

27B on MI60 goes from 25 to 45 tok/s with MTP, but only with simper quants like Q4_1. Q4_K_XL runs closer to 35 tok/s.
View on Reddit #86181467

RnRau@reddit

Seeing the same on a Strix Halo. Went from 40 t/s to ~53 t/s with the qwen3.6 35b-a3b q8. Weirdly I get just a very mild improvement with the Q6... From ~53 t/s to ~57 t/s.
View on Reddit #86167161

Combinatorilliance@reddit

I saw a graph somewhere on this subreddit, it doesn't work so well for MoEs, it's much better for dense models. Your situation was what was shown on the graph as well.
View on Reddit #86174914

Valuable_Touch5670@reddit (OP)

I am on AMD + Vulkan too (9070 XT). My TG has dropped from 60+ to the 45-52 range. But PP no longer takes a hit and is noticeable faster. (Could be the slight variances in my workflow 😅)
View on Reddit #86165225

RnRau@reddit

I had the same when I put --spec-draft-n-max to values higher than 2.
View on Reddit #86167262

Valuable_Touch5670@reddit (OP)

Sadly I was already setting to 2 :(
View on Reddit #86168485

taking_bullet@reddit

I'm also using Vulkan, but with RTX 5070 TI & RX 9070 combined. Gonna test it later, but even 30% speed increase is a nice thing. 
View on Reddit #86165180

GlobalLadder9461@reddit

Yes it is but claims that 2x improved is observed appear to be exaggerated ones. Also adding ngram mod with mtp draft decreased tg for me. Best speed 13 to 17 tps is achieved with n max value 2. BTW I used unsloth iq4_nl quant.
View on Reddit #86168198

DeSibyl@reddit

Does this have the fix to allow vision to work with it?
View on Reddit #86181390

wllmsaccnt@reddit

If your model has MTP layers, this lets llama.cpp use them for speculative decoding. You could expect a speedup of around 50 to 80% in tokens generated per second. This is probably the biggest speedup we'll see in llama.cpp for token generation until Eagle3 or DFlash become available.
View on Reddit #86162106

robertpro01@reddit

I don't know about others experience, but I'm my case, I tried vllm dflash and at first looked great, once real work has to be done, it is just slow as shit. My experience with mtp in llama.cpp has being awesome tbh
View on Reddit #86164179

adssidhu86@reddit

Still curious where is it super useful? Refactoring projects?
View on Reddit #86180264

robertpro01@reddit

I have no idea, but I'm sticking to llama.cpp now
View on Reddit #86180907

TheTerrasque@reddit

> This particular implementation originally made prompt processing slower, but hopefully they've since fixed that issue. I'm seeing around a halving of PP speed, from 1100 to 550 on 3090 and 250 to 170 on P40. Also seeing context drop from ~150k to ~110k (still finding out where the limits are exactly, it oom's now and then). At least vision works now.
View on Reddit #86175526

kamtar@reddit

Is there something in sight for prompt processing? 
View on Reddit #86171822

Pleasant-Shallot-707@reddit

Watch what the effects are on PP for your setup. MTP might not be helpful for total time in all workloads.
View on Reddit #86162304

Material_Tone_6855@reddit

Same I was thinking. Using Kilo in Code mode the avg initial prompt is 14k tokens. If token generation is 2x ( ex 30 to 60 t/s ) but prefill pass from 1500 to 700 t/s I can't see it as an advantage. The strange thing about the whole LLM community is the constant search for faster token generation, but in my opinion the prompt processing speed is also important ( if not more ) than token generation, especially for coding agents, since the system prompt is usually huge.
View on Reddit #86164106

wektor420@reddit

If you have large static system prompt then cache it permanently...
View on Reddit #86170545

wllmsaccnt@reddit

I think it helps that the system prompt can often be cached across sessions, so if you are just using one agent harness throughout the day with the same system prompt and tools, you might only pay that prompt processing once, whereas you always pay the token generation cost for every single turn.
View on Reddit #86165158

Material_Tone_6855@reddit

That's true, in fact I'm still trying to understand over the total workload how much is divided between prompt processing and token generation.
View on Reddit #86165279

wllmsaccnt@reddit

Good reminder. I added a blurb with a warning to my comment.
View on Reddit #86162781

dbzunicorn@reddit

prompt processing is the biggest gap between apple silicon and nvidia
View on Reddit #86170415

Caffdy@reddit

this only get you faster if the models fits in VRAM, right? what about 122B+ sized models with CPU-offloading?
View on Reddit #86168759

wllmsaccnt@reddit

It should give you a speedup in any case where the decoding is memory bandwidth constrained, but I have not tested CPU decoding with it myself. I would expect on some slower machines that CPU decoding might be CPU compute bound. Partial offloading makes that even messier to talk about abstractly. I am sure token gen  benchmarks will be flooding the usual channels soon.
View on Reddit #86170102

ilintar@reddit

No, you can actually use it with CPU offloading and people have reported decent gains there.
View on Reddit #86169720

spaceman_@reddit

What's going on with the Eagle3 PR these days?
View on Reddit #86167229

ilintar@reddit

It's been put on hold pending the merge of this PR. It will be now merged on top of the standard MTP support. Georgi has been keeping the branch up to date with the changes to facilitate quick adoption.
View on Reddit #86169686

wllmsaccnt@reddit

I'm hoping it might have been waiting for the same speculative parallel refactoring work to merge as this MTP one was. Hopefully we'll see it soon-ish. The DFlash one didn't seem nearly as close as the Eagle3 work.
View on Reddit #86167593

Antique_Dot_5513@reddit

Au niveau qualité est ce qu’il y a une baisse ?
View on Reddit #86164474

wllmsaccnt@reddit

\> Has there been a decline in quality? If there was, it would be considered a bug. MTP doesn't require sacrificing any inference quality. MTP models are marginally larger than non MTP models and will use more VRAM at runtime.
View on Reddit #86165043

maximus_reborn@reddit

has anyone tried it with omlx? any gains as compared to mlx models with no mtp?
View on Reddit #86177124

tempedbyfate@reddit

There's like 5 posts on r/LocalLLaMA for MTP branch being merged, never seen so much enthusiasm over a PR.
View on Reddit #86164721

LagOps91@reddit

i was promissed this PR 3000 years ago!
View on Reddit #86176282

soldture@reddit

It's a great achievement!
View on Reddit #86166299

Shoddy_Bed3240@reddit

I tested the new MTP feature on Qwen 3.6 35B and 27B. Generation speed is definitely faster, but prompt processing speed dropped by about 2.5x in my case (from 6500 t/s down to 2000 t/s). Also, the `-fit` argument seems to have stopped working — it looks like it doesn’t recognize MTP at all. On longer contexts, I also ran into a “CUDA error: out of memory.” Hopefully these are all things that can be fixed.
View on Reddit #86173399

TheTerrasque@reddit

Similar to my experience. Drop in PP (albeit not quite as dramatic), -fit not understanding anything, and oom crashes on longer contexts. That's despite cutting context length by one third.
View on Reddit #86176009

pjdonovan@reddit

this speeds up token generation, right?
View on Reddit #86160392

urarthur@reddit

yes for those gguf's that support it
View on Reddit #86160676

DrAlexander@reddit

So new ggufs need to be released with MTP support? Are they larger that the non-MTP ones?
View on Reddit #86161287

Material_Tone_6855@reddit

They're already released, search for unslot qwen 3.6 MTP, the MTP model is bundled into the guff, I'm using the Q4\_K\_XL and the size is 22gb
View on Reddit #86163906

Odd-Environment-7193@reddit

How’s the performance?
View on Reddit #86175459

Valuable_Touch5670@reddit (OP)

Yes, but depends on your work type. It works best for coding.
View on Reddit #86160603

ceo_of_banana@reddit

Could you elaborate? Does it enable token prediction for multiple instances or multiple tokens of the same prompt?
View on Reddit #86172852

Valuable_Touch5670@reddit (OP)

Tokens are typically generated one at a time, which involves lots of reading from memory, hence slow. MTP tries to generate multiple tokens at a time by "guessing" the next few tokens with draft layers. If guesses are correct, massive speed up; otherwise, the compute spent on guessing is wasted. If your next tokens often vary a lot (like in creative writing), speed up is then small. But if previously generated tokens are likely to appear again (like code refactoring, for example), then speed up is bigger. To me, this feels **a bit** like how branch prediction works in microchips. Hope this helps!
View on Reddit #86174471

ceo_of_banana@reddit

thanks a lot!
View on Reddit #86175284

Material_Tone_6855@reddit

There's a speedup in generation but a drop in prompt prefill, probably due to the face that another small models is loaded ( it's bundled in the gguf ) and it needs to be loaded in the VRAM as for the KV Cache
View on Reddit #86161170

Fringolicious@reddit

Can I use this with lmstudio yet? I had the newest llama cpp runtimes available to download in lmstudio, got them, but now I'm not sure if there is a compatible mtp gguf available yet, anyone got it working yet?
View on Reddit #86163683

lemondrops9@reddit

LM Studio isnt bleeding edge, in the oast Ive waited days to weeks for the newest thing to be supported. You can check the run time drivers to see what version LM Studio is using. 
View on Reddit #86174803

Dany0@reddit

LM Studio beta has that new feature you can enable in developer settings to use the latest llamacpp always, I don't know how to test it as it seem counter-intuitive and I cba, but just FYI
View on Reddit #86164011

Fringolicious@reddit

Thanks, I'll check it out
View on Reddit #86164551

No_Algae1753@reddit

Have they fixed slow pp ?
View on Reddit #86160863

CalligrapherFar7833@reddit

Small pp
View on Reddit #86172421

MotokoAGI@reddit

900 tk/s for PP on 3090s. Is that slow or fast for you? For Qwen3.6-35b
View on Reddit #86163600

SkyFeistyLlama8@reddit

I'm happy jumping from 15 t/s to 28 t/s on that model and you're past lightspeed, you're on ludicrous speed!
View on Reddit #86169497

DoorStuckSickDuck@reddit

That's what I get right now on strix halo without it, which is worrying.
View on Reddit #86165130

314kabinet@reddit

Slow. That’s over 30 seconds to first token in my coding setup.
View on Reddit #86164159

Material_Tone_6855@reddit

You've always to check it on your machine, cannot compare it with other ppl hardware.
View on Reddit #86163946

No_Algae1753@reddit

Depends if it was previously slower or faster
View on Reddit #86163838

van-dame@reddit

That sounds like a medical issue
View on Reddit #86160982

Miserable-Dare5090@reddit

peyronie’s disease… bentcarrot.con
View on Reddit #86162225

No_Algae1753@reddit

😭😭😭😭😭
View on Reddit #86161516

rngesius@reddit

Yes
View on Reddit #86160933

Address-Street@reddit

Hope they’ll add support for Gemma soon.
View on Reddit #86171944

Consumerbot37427@reddit

Does anybody know if we need to download special MTP-enabled GGUFs?
View on Reddit #86162399

crapaud_dindon@reddit

Yes, previous gguf had the mtp layers removed
View on Reddit #86162893

TheWaffleKingg@reddit

And the mtp ggus are larger, iirc 27b q8 is like 3 gigs bigger
View on Reddit #86171613

SkyFeistyLlama8@reddit

So I need to download all Qwen 3.6 MTP GGUFs from Unsloth, what about Gemma 4's MTP layers that are in separate files?
View on Reddit #86165667

Odd-Ordinary-5922@reddit

you need a gguf model that has mtp applied to it and if gemma 4s mtp layers are in separate files it wont work until you merge them
View on Reddit #86165750

DeProgrammer99@reddit

The pull request has a checked box next to "Support separate GGUF for `mtp`", so I'd say you can download the MTP layers as their own GGUF. I'm going to have to try that, because I don't want two copies of almost the same file, one for agentic coding and one for batch processing. [https://github.com/ggml-org/llama.cpp/pull/22673#issuecomment-4456979078](https://github.com/ggml-org/llama.cpp/pull/22673#issuecomment-4456979078) |yes it can be loaded separately using `--spec-draft-model`. The `convert_hf_to_gguf.py` changes have an option of `--mtp` which just outputs the MTP gguf. Using the "grafted" on MTP is more VRAM efficient though. Another thing is that `-hf` option will try to look for the MTP gguf like it does for `mmproj` in case `spec-draft-type draft-mtp` is mentioned.| |:-| But it doesn't work with [https://huggingface.co/IHaveNoClueAndIMustPost/Qwen3.6-27b-MTP-TENSORS-ONLY](https://huggingface.co/IHaveNoClueAndIMustPost/Qwen3.6-27b-MTP-TENSORS-ONLY), probably because that's missing some GGUF metadata. So I tried `convert_hf_to_gguf.py --remote --mtp --outtype q8_0 Qwen/Qwen3.6-27B`, and that required admin privileges to make a symlink on Windows. The file it produced was 2.94 GB, compared to 430 MB for the above, and yeah, it uses a few GB extra VRAM.
View on Reddit #86169021

SkyFeistyLlama8@reddit

I gotta figure out how to get it working with Gemma 4 Assistants MTP models. Like this: https://huggingface.co/google/gemma-4-26B-A4B-it-assistant
View on Reddit #86169611

ilintar@reddit

I told you guys it was the real beta, but noooo, skeptics gonna whine 😛
View on Reddit #86169776

Ambitious_Fold_2874@reddit

Vision capabilities working with MTP?
View on Reddit #86161241

ilintar@reddit

Yes, it's been fixed since the beta.
View on Reddit #86169743

coder543@reddit

The PR description says it is. > MTP is compatible with Vision input and Tensor/Pipeline Parallelism
View on Reddit #86163087

Ambitious_Fold_2874@reddit

Sick
View on Reddit #86163141

freehuntx@reddit

nope
View on Reddit #86162220

Zc5Gwu@reddit

But you can use the MTP gguf, right? You'd just have to disable it I assume if you wanted vision...?
View on Reddit #86162315

Goldandsilverape99@reddit

For me, adding vision makes the model stupid. (it fails a puzzle question consistently) Works: path\\llama-server.exe -m path\\unsloth\\Qwen3.6-35B-A3B-MTP-GGUF\\Qwen3.6-27B-UD-Q6\_K\_XL.gguf --flash-attn on --ctx-size 32768 --threads 12 --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00 --presence\_penalty 1.5 --repeat\_penalty 1.0 --jinja --no-mmap -np 1 -ctk q8\_0 -ctv q8\_0 --chat-template-kwargs "{\\"preserve\_thinking\\":true}" --spec-type draft-mtp --spec-draft-n-max 2 --kv-unified Does not work: path\\llama-server.exe -m path\\unsloth\\Qwen3.6-35B-A3B-MTP-GGUF\\Qwen3.6-27B-UD-Q6\_K\_XL.gguf --mmproj path\\unsloth\\Qwen3.6-35B-A3B-MTP-GGUF\\mmproj-BF16.gguf --flash-attn on --ctx-size 32768 --threads 12 --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00 --presence\_penalty 1.5 --repeat\_penalty 1.0 --jinja --no-mmap -np 1 -ctk q8\_0 -ctv q8\_0 --chat-template-kwargs "{\\"preserve\_thinking\\":true}" --spec-type draft-mtp --spec-draft-n-max 2 --kv-unified --image-min-tokens 1024 Any one that can reproduce the problem?
View on Reddit #86161384

SmoothCCriminal@reddit

Does this have any benefit to RAM poor folks running 9b models (omnicoder) on mac ?
View on Reddit #86161421

shapic@reddit

Model should have mtp in first place
View on Reddit #86168182

UmpireBorn3719@reddit

It consumes more vram
View on Reddit #86163745

oxygen_addiction@reddit

It should help with dense models. Prompt prefill might be a bit worse though.
View on Reddit #86162096

fragment_me@reddit

One thing I noticed is I could drop each 3090 down to 200 Watts and still get a speed up in token generation compared to no MTP.
View on Reddit #86167936

Odd-Ordinary-5922@reddit

anyone know how to use ngram for both mtp and the normal version at the same time?
View on Reddit #86165915

Force88@reddit

Can it work with my igpu 780m? I'm tinkering this mini pc and surprisingly get 17t/s for qwen 3.6 35b a3b
View on Reddit #86162536

Antop90@reddit

Yes
View on Reddit #86165445

anykeyh@reddit

Does MTP kept enabled in quantized and uncensored model or should we wait for a new release?
View on Reddit #86161468

TurnOffAutoCorrect@reddit

llmfan already has uncensored MTP gguf versions of * Qwen3.6-35B-A3B * Qwen3.6-27B ...here if that's what you're asking for... https://huggingface.co/llmfan46/models?search=gguf I haven't tested them yet so I don't know if they need tweaking to work with this official release of MTP support in llama.cpp
View on Reddit #86164773

RnRau@reddit

Moar tokens? Why yes please!! Thanks to all the hard working developers on the llama.cpp team and ofcause the 1000's of researchers that keep finding new ways of improving things!!
View on Reddit #86160610

TurnOffAutoCorrect@reddit

> Moar tokens? Why yes please!! MTP stands for **M**ore **T**okens **P**lease!
View on Reddit #86164591

Dany0@reddit

I tested it with chain of speculators ngram-mod just before the merge. 75 tok/s q5 k m qwen3.6 27b on a 61k input 5000 tok output on an rtx 5090. vLLM still wins with 105 tok/s sadly I'll retest now after the merge
View on Reddit #86163394

imp_12189@reddit

I have to wait for docker image.. Is it in 15h or so?..
View on Reddit #86161848

Xonzo@reddit

You could always build the docker image? Not trying to be snarky or anything but Claude / ChatGPT can help you do this in about 10 minutes.
View on Reddit #86162126

wgaca2@reddit

True but instead of 1000 people rebuilding it would be nice if they could just download it
View on Reddit #86163272

ghulamalchik@reddit

Heck yeah
View on Reddit #86161053

1FNn4@reddit

Anyone know how to use gemma 4 with gcuf?
View on Reddit #86160368

LosEagle@reddit

Beat me to it. But love this!
View on Reddit #86160161