MTP PR Merged!!! | TheaterFire

[-]

rm-rf-rm@reddit

2 threads on the same topic, locking this smaller one. Please use this: https://old.reddit.com/r/LocalLLaMA/comments/1teqnf2/thats_a_good_news/

Reply

[-]

Outside_Reindeer_713@reddit

suddenly the RTX 3060 is a beast .

Reply

[-]

I still see a slight decrease in prefill (pp) on an RTX5090, but it's not terrible. For 30k tokens prefill + 5k token generation I'm getting: Average TPS: 98 (Vs 52 with no MTP) Average prefill: 2150 (vs 2600 with no MTP)

Reply

[-]

GlobalLadder9461@reddit

On vulkan backend on AMD APU, I am observing maximum 30% increase. What are the results from other vulkan folks.

Reply

[-]

EugenePopcorn@reddit

27B on MI60 goes from 25 to 45 tok/s with MTP, but only with simper quants like Q4_1. Q4_K_XL runs closer to 35 tok/s.

Reply

[-]

RnRau@reddit

Seeing the same on a Strix Halo. Went from 40 t/s to ~53 t/s with the qwen3.6 35b-a3b q8. Weirdly I get just a very mild improvement with the Q6... From ~53 t/s to ~57 t/s.

Reply

[-]

Combinatorilliance@reddit

I saw a graph somewhere on this subreddit, it doesn't work so well for MoEs, it's much better for dense models. Your situation was what was shown on the graph as well.

Reply

[-]

Valuable_Touch5670@reddit (OP)

I am on AMD + Vulkan too (9070 XT). My TG has dropped from 60+ to the 45-52 range. But PP no longer takes a hit and is noticeable faster. (Could be the slight variances in my workflow 😅)

Reply

[-]

RnRau@reddit

I had the same when I put --spec-draft-n-max to values higher than 2.

Reply

[-]

Valuable_Touch5670@reddit (OP)

Sadly I was already setting to 2 :(

Reply

[-]

taking_bullet@reddit

I'm also using Vulkan, but with RTX 5070 TI & RX 9070 combined. Gonna test it later, but even 30% speed increase is a nice thing.

Reply

[-]

GlobalLadder9461@reddit

Yes it is but claims that 2x improved is observed appear to be exaggerated ones. Also adding ngram mod with mtp draft decreased tg for me. Best speed 13 to 17 tps is achieved with n max value 2. BTW I used unsloth iq4_nl quant.

Reply

[-]

DeSibyl@reddit

Does this have the fix to allow vision to work with it?

Reply

[-]

wllmsaccnt@reddit

If your model has MTP layers, this lets llama.cpp use them for speculative decoding. You could expect a speedup of around 50 to 80% in tokens generated per second. This is probably the biggest speedup we'll see in llama.cpp for token generation until Eagle3 or DFlash become available.

Reply

[-]

robertpro01@reddit

I don't know about others experience, but I'm my case, I tried vllm dflash and at first looked great, once real work has to be done, it is just slow as shit. My experience with mtp in llama.cpp has being awesome tbh

Reply

[-]

adssidhu86@reddit

Still curious where is it super useful? Refactoring projects?

Reply

[-]

robertpro01@reddit

I have no idea, but I'm sticking to llama.cpp now

Reply

[-]

TheTerrasque@reddit

> This particular implementation originally made prompt processing slower, but hopefully they've since fixed that issue. I'm seeing around a halving of PP speed, from 1100 to 550 on 3090 and 250 to 170 on P40. Also seeing context drop from ~150k to ~110k (still finding out where the limits are exactly, it oom's now and then). At least vision works now.

Reply

[-]

kamtar@reddit

Is there something in sight for prompt processing?

Reply

[-]

Pleasant-Shallot-707@reddit

Watch what the effects are on PP for your setup. MTP might not be helpful for total time in all workloads.

Reply

[-]

Material_Tone_6855@reddit

Same I was thinking. Using Kilo in Code mode the avg initial prompt is 14k tokens. If token generation is 2x ( ex 30 to 60 t/s ) but prefill pass from 1500 to 700 t/s I can't see it as an advantage. The strange thing about the whole LLM community is the constant search for faster token generation, but in my opinion the prompt processing speed is also important ( if not more ) than token generation, especially for coding agents, since the system prompt is usually huge.

Reply

[-]

wektor420@reddit

If you have large static system prompt then cache it permanently...

Reply

[-]

wllmsaccnt@reddit

I think it helps that the system prompt can often be cached across sessions, so if you are just using one agent harness throughout the day with the same system prompt and tools, you might only pay that prompt processing once, whereas you always pay the token generation cost for every single turn.

Reply

[-]

Material_Tone_6855@reddit

That's true, in fact I'm still trying to understand over the total workload how much is divided between prompt processing and token generation.

Reply

[-]

wllmsaccnt@reddit

Good reminder. I added a blurb with a warning to my comment.

Reply

[-]

dbzunicorn@reddit

prompt processing is the biggest gap between apple silicon and nvidia

Reply

[-]

Caffdy@reddit

this only get you faster if the models fits in VRAM, right? what about 122B+ sized models with CPU-offloading?

Reply

[-]

wllmsaccnt@reddit

It should give you a speedup in any case where the decoding is memory bandwidth constrained, but I have not tested CPU decoding with it myself. I would expect on some slower machines that CPU decoding might be CPU compute bound. Partial offloading makes that even messier to talk about abstractly. I am sure token gen benchmarks will be flooding the usual channels soon.

Reply

[-]

ilintar@reddit

No, you can actually use it with CPU offloading and people have reported decent gains there.

Reply

[-]

spaceman_@reddit

What's going on with the Eagle3 PR these days?

Reply

[-]

ilintar@reddit

It's been put on hold pending the merge of this PR. It will be now merged on top of the standard MTP support. Georgi has been keeping the branch up to date with the changes to facilitate quick adoption.

Reply

[-]

wllmsaccnt@reddit

I'm hoping it might have been waiting for the same speculative parallel refactoring work to merge as this MTP one was. Hopefully we'll see it soon-ish. The DFlash one didn't seem nearly as close as the Eagle3 work.

Reply

[-]

Antique_Dot_5513@reddit

Au niveau qualité est ce qu’il y a une baisse ?

Reply

[-]

wllmsaccnt@reddit

\> Has there been a decline in quality? If there was, it would be considered a bug. MTP doesn't require sacrificing any inference quality. MTP models are marginally larger than non MTP models and will use more VRAM at runtime.

Reply

[-]

maximus_reborn@reddit

has anyone tried it with omlx? any gains as compared to mlx models with no mtp?

Reply

[-]

tempedbyfate@reddit

There's like 5 posts on r/LocalLLaMA for MTP branch being merged, never seen so much enthusiasm over a PR.

Reply

[-]

LagOps91@reddit

i was promissed this PR 3000 years ago!

Reply

[-]

soldture@reddit

It's a great achievement!

Reply

[-]

Shoddy_Bed3240@reddit

I tested the new MTP feature on Qwen 3.6 35B and 27B. Generation speed is definitely faster, but prompt processing speed dropped by about 2.5x in my case (from 6500 t/s down to 2000 t/s). Also, the `-fit` argument seems to have stopped working — it looks like it doesn’t recognize MTP at all. On longer contexts, I also ran into a “CUDA error: out of memory.” Hopefully these are all things that can be fixed.

Reply

[-]

TheTerrasque@reddit

Similar to my experience. Drop in PP (albeit not quite as dramatic), -fit not understanding anything, and oom crashes on longer contexts. That's despite cutting context length by one third.

Reply

[-]

pjdonovan@reddit

this speeds up token generation, right?

Reply

[-]

urarthur@reddit

yes for those gguf's that support it

Reply

[-]

DrAlexander@reddit

So new ggufs need to be released with MTP support? Are they larger that the non-MTP ones?

Reply

[-]

Material_Tone_6855@reddit

They're already released, search for unslot qwen 3.6 MTP, the MTP model is bundled into the guff, I'm using the Q4\_K\_XL and the size is 22gb

Reply

[-]

Odd-Environment-7193@reddit

How’s the performance?

Reply

[-]

Valuable_Touch5670@reddit (OP)

Yes, but depends on your work type. It works best for coding.

Reply

[-]

ceo_of_banana@reddit

Could you elaborate? Does it enable token prediction for multiple instances or multiple tokens of the same prompt?

Reply

[-]

Valuable_Touch5670@reddit (OP)

Tokens are typically generated one at a time, which involves lots of reading from memory, hence slow. MTP tries to generate multiple tokens at a time by "guessing" the next few tokens with draft layers. If guesses are correct, massive speed up; otherwise, the compute spent on guessing is wasted. If your next tokens often vary a lot (like in creative writing), speed up is then small. But if previously generated tokens are likely to appear again (like code refactoring, for example), then speed up is bigger. To me, this feels **a bit** like how branch prediction works in microchips. Hope this helps!

Reply

[-]

ceo_of_banana@reddit

thanks a lot!

Reply

[-]

Material_Tone_6855@reddit

There's a speedup in generation but a drop in prompt prefill, probably due to the face that another small models is loaded ( it's bundled in the gguf ) and it needs to be loaded in the VRAM as for the KV Cache

Reply

[-]

Fringolicious@reddit

Can I use this with lmstudio yet? I had the newest llama cpp runtimes available to download in lmstudio, got them, but now I'm not sure if there is a compatible mtp gguf available yet, anyone got it working yet?

Reply

[-]

lemondrops9@reddit

LM Studio isnt bleeding edge, in the oast Ive waited days to weeks for the newest thing to be supported. You can check the run time drivers to see what version LM Studio is using.

Reply

[-]

Dany0@reddit

LM Studio beta has that new feature you can enable in developer settings to use the latest llamacpp always, I don't know how to test it as it seem counter-intuitive and I cba, but just FYI

Reply

[-]

Fringolicious@reddit

Thanks, I'll check it out

Reply

[-]

No_Algae1753@reddit

Have they fixed slow pp ?

Reply

[-]

CalligrapherFar7833@reddit

Small pp

Reply

[-]

MotokoAGI@reddit

900 tk/s for PP on 3090s. Is that slow or fast for you? For Qwen3.6-35b

Reply

[-]

SkyFeistyLlama8@reddit

I'm happy jumping from 15 t/s to 28 t/s on that model and you're past lightspeed, you're on ludicrous speed!

Reply

[-]

DoorStuckSickDuck@reddit

That's what I get right now on strix halo without it, which is worrying.

Reply

[-]

314kabinet@reddit

Slow. That’s over 30 seconds to first token in my coding setup.

Reply

[-]

Material_Tone_6855@reddit

You've always to check it on your machine, cannot compare it with other ppl hardware.

Reply

[-]

No_Algae1753@reddit

Depends if it was previously slower or faster

Reply

[-]

van-dame@reddit

That sounds like a medical issue

Reply

[-]

Miserable-Dare5090@reddit

peyronie’s disease… bentcarrot.con

Reply

[-]

No_Algae1753@reddit

😭😭😭😭😭

Reply

[-]

rngesius@reddit

Yes

Reply

[-]

Address-Street@reddit

Hope they’ll add support for Gemma soon.

Reply

[-]

Consumerbot37427@reddit

Does anybody know if we need to download special MTP-enabled GGUFs?

Reply

[-]

crapaud_dindon@reddit

Yes, previous gguf had the mtp layers removed

Reply

[-]

TheWaffleKingg@reddit

And the mtp ggus are larger, iirc 27b q8 is like 3 gigs bigger

Reply

[-]

SkyFeistyLlama8@reddit

So I need to download all Qwen 3.6 MTP GGUFs from Unsloth, what about Gemma 4's MTP layers that are in separate files?

Reply

[-]

Odd-Ordinary-5922@reddit

you need a gguf model that has mtp applied to it and if gemma 4s mtp layers are in separate files it wont work until you merge them

Reply

[-]

DeProgrammer99@reddit

The pull request has a checked box next to "Support separate GGUF for `mtp`", so I'd say you can download the MTP layers as their own GGUF. I'm going to have to try that, because I don't want two copies of almost the same file, one for agentic coding and one for batch processing. [https://github.com/ggml-org/llama.cpp/pull/22673#issuecomment-4456979078](https://github.com/ggml-org/llama.cpp/pull/22673#issuecomment-4456979078) |yes it can be loaded separately using `--spec-draft-model`. The `convert_hf_to_gguf.py` changes have an option of `--mtp` which just outputs the MTP gguf. Using the "grafted" on MTP is more VRAM efficient though. Another thing is that `-hf` option will try to look for the MTP gguf like it does for `mmproj` in case `spec-draft-type draft-mtp` is mentioned.| |:-| But it doesn't work with [https://huggingface.co/IHaveNoClueAndIMustPost/Qwen3.6-27b-MTP-TENSORS-ONLY](https://huggingface.co/IHaveNoClueAndIMustPost/Qwen3.6-27b-MTP-TENSORS-ONLY), probably because that's missing some GGUF metadata. So I tried `convert_hf_to_gguf.py --remote --mtp --outtype q8_0 Qwen/Qwen3.6-27B`, and that required admin privileges to make a symlink on Windows. The file it produced was 2.94 GB, compared to 430 MB for the above, and yeah, it uses a few GB extra VRAM.

Reply

[-]

SkyFeistyLlama8@reddit

I gotta figure out how to get it working with Gemma 4 Assistants MTP models. Like this: https://huggingface.co/google/gemma-4-26B-A4B-it-assistant

Reply

[-]

ilintar@reddit

I told you guys it was the real beta, but noooo, skeptics gonna whine 😛

Reply

[-]

Ambitious_Fold_2874@reddit

Vision capabilities working with MTP?

Reply

[-]

ilintar@reddit

Yes, it's been fixed since the beta.

Reply

[-]

coder543@reddit

The PR description says it is. > MTP is compatible with Vision input and Tensor/Pipeline Parallelism

Reply

[-]

Ambitious_Fold_2874@reddit

Sick

Reply

[-]

freehuntx@reddit

nope

Reply

[-]

Zc5Gwu@reddit

But you can use the MTP gguf, right? You'd just have to disable it I assume if you wanted vision...?

Reply

[-]

Goldandsilverape99@reddit

For me, adding vision makes the model stupid. (it fails a puzzle question consistently) Works: path\\llama-server.exe -m path\\unsloth\\Qwen3.6-35B-A3B-MTP-GGUF\\Qwen3.6-27B-UD-Q6\_K\_XL.gguf --flash-attn on --ctx-size 32768 --threads 12 --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00 --presence\_penalty 1.5 --repeat\_penalty 1.0 --jinja --no-mmap -np 1 -ctk q8\_0 -ctv q8\_0 --chat-template-kwargs "{\\"preserve\_thinking\\":true}" --spec-type draft-mtp --spec-draft-n-max 2 --kv-unified Does not work: path\\llama-server.exe -m path\\unsloth\\Qwen3.6-35B-A3B-MTP-GGUF\\Qwen3.6-27B-UD-Q6\_K\_XL.gguf --mmproj path\\unsloth\\Qwen3.6-35B-A3B-MTP-GGUF\\mmproj-BF16.gguf --flash-attn on --ctx-size 32768 --threads 12 --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00 --presence\_penalty 1.5 --repeat\_penalty 1.0 --jinja --no-mmap -np 1 -ctk q8\_0 -ctv q8\_0 --chat-template-kwargs "{\\"preserve\_thinking\\":true}" --spec-type draft-mtp --spec-draft-n-max 2 --kv-unified --image-min-tokens 1024 Any one that can reproduce the problem?

Reply

[-]

SmoothCCriminal@reddit

Does this have any benefit to RAM poor folks running 9b models (omnicoder) on mac ?

Reply

[-]

shapic@reddit

Model should have mtp in first place

Reply

[-]

UmpireBorn3719@reddit

It consumes more vram

Reply

[-]

oxygen_addiction@reddit

It should help with dense models. Prompt prefill might be a bit worse though.

Reply

[-]

fragment_me@reddit

One thing I noticed is I could drop each 3090 down to 200 Watts and still get a speed up in token generation compared to no MTP.

Reply

[-]

Odd-Ordinary-5922@reddit

anyone know how to use ngram for both mtp and the normal version at the same time?

Reply

[-]

Force88@reddit

Can it work with my igpu 780m? I'm tinkering this mini pc and surprisingly get 17t/s for qwen 3.6 35b a3b

Reply

[-]

Antop90@reddit

Yes

Reply

[-]

anykeyh@reddit

Does MTP kept enabled in quantized and uncensored model or should we wait for a new release?

Reply

[-]

TurnOffAutoCorrect@reddit

llmfan already has uncensored MTP gguf versions of * Qwen3.6-35B-A3B * Qwen3.6-27B ...here if that's what you're asking for... https://huggingface.co/llmfan46/models?search=gguf I haven't tested them yet so I don't know if they need tweaking to work with this official release of MTP support in llama.cpp

Reply

[-]

RnRau@reddit

Moar tokens? Why yes please!! Thanks to all the hard working developers on the llama.cpp team and ofcause the 1000's of researchers that keep finding new ways of improving things!!

Reply

[-]

TurnOffAutoCorrect@reddit

> Moar tokens? Why yes please!! MTP stands for **M**ore **T**okens **P**lease!

Reply

[-]

Dany0@reddit

I tested it with chain of speculators ngram-mod just before the merge. 75 tok/s q5 k m qwen3.6 27b on a 61k input 5000 tok output on an rtx 5090. vLLM still wins with 105 tok/s sadly I'll retest now after the merge

Reply

[-]