Llama.cpp MTP support now in beta!
Posted by ilintar@reddit | LocalLLaMA | View on Reddit | 231 comments
Happy to report that llama.cpp MTP support is now in beta, thanks to Aman (and all the others that have pushed the various issues in the meantime). This has the potential to actually get merged soon-ish. Currently contains support for Qwen3.5 MTP, but other models are likely to follow suit.
Between this and the maturing tensor-parallel support, expect most performance gaps between llama.cpp and vLLM, at least when it comes to token generation speeds, to be erased.
Synor@reddit
Not beta at all as it's not even in the main code yet.
And for some testers it does not work at all.
MisticRain69@reddit
Tried doing the PR with vulkan but no luck just errors hope they get it stable soon
Thomasedv@reddit
I'd love a breakdown of the speculative moethds and which to choose and pros/cons of each. It's quite hard to find out.
MTP (multi token prediction), Eagle-3, DFlash, DTree, ngram? Some needs extra draft models, some do not, some are better suited for "reusing" context like ngram I think.
Anyone got a comparison somewhere or willing to create one?
pkmxtw@reddit
All of them work on the principle of generating draft tokens cheaply and then verifying with the full model. The main difference is how those draft tokens they are generated.
N-gram: Use and match strings in the context. Pros: extremely fast to compute, works on any models. Cons: only good for applications where the data is repeated verbatim like coding.
Draft model: Use a small model of the same family to quickly generate tokens. Pros: easy to implement (just run two models concurrently). Cons: requires a matching model, acceptable rate depends on how well they match.
MTP: The full model itself is pre-trained to output draft tokens on auxiliary heads. Pros: potentially the best. Cons: requires the model to be trained for it.
Eagle3: This is kinda like MTP except that it is bolted on to a pre-trained model. Pros: good speed-up and likely the widely-used SOTA technique. Cons: you need to spend $$$ to train the eagle3 model.
DFlash: Use a block diffusion model for prediction. Pros: speed goes brrr (if you have the compute). Cons: same issue with eagle3, still new and experimental.
Basically it comes down what your engine / model supports, and how much leftover compute you have. My pick would be:
Silver-Champion-4846@reddit
Can you combine Eagle3 with MTP?
Chromix_@reddit
That might not make a lot of sense, unless you want to make both training and inference even more expensive, for the chance of getting a correct prediction when either model guesses correctly.
Combining with N-gram can make sense though, to skip the speculation model in specific situations and gain some added speed.
Silver-Champion-4846@reddit
Has it been done before?
Chromix_@reddit
It apparently works just fine:
llama-server -m large.gguf -fa on -ngl 99 -md small.gguf -ngld 99 -c 20000 -np 1 --spec-type ngram-modThis was for a case where I asked to make a small change in a 8k token code file.
Silver-Champion-4846@reddit
I had to spell that letter by letter because screen reader reads the whole thing like a spell from the dark ages. Forget the dashes and that little pause you think of when you see a new arg. It's all llama server M large dot gguf fa (fa as in father) on ngl ninety-nine c twenty thousand md small dot gguf ngld ninety-nine etc etc
perkia@reddit
Interesting. I'd have thought a short inlined
markup (correctly used in the parent comment) would automagically make the screen reader pay more attention to syntax not less, but I guess that's accessibility for you... is there a special aria attribute to add toHTML tags that would help consider the content a function or command call ?Thomasedv@reddit
Thanks, that's a good summary. Now to research what works, seems like most of these aren't that good for MoE models. Qwen 3.6 35B A3B is blazing fast already but I'd be very amazed if I could make it even faster. But so far, tests imply most speculative decoding slows it down.
Look forward to trying these on 27B when I get time.
pkmxtw@reddit
Yeah, MoE usually have small enough active parameters so you won't gain as much.
--spec-type ngram-moddoes still provide some speed-up if you have repeating text, and it is close to free lunch.unjustifiably_angry@reddit
dflash, from what I've read, can be anything from a huge speedup or a huge speed loss depending on the model and the hardware it's running on.
Anbeeld@reddit
I mean the main problem is that you don't have much choice, not how to make the choice. There's only a handful of inference implementations for e.g. Qwen 3.6 that support prediction without breaking everything else, vLLM being the main one probably with MTP?
I'm working on adding one more option to the list right now by the way, stay tuned.
oxygen_addiction@reddit
Lay off the cyber psychosis, buddy.
Anbeeld@reddit
What's wrong, exactly?
No_Weather8173@reddit
Self explanatory really. The model is trained to predict several future tokens, not just the next one. At inference those extra predictions can be used as built-in draft tokens. You don’t need a separate draft model, but you do need a base model trained with MTP heads, so you can’t just enable it for every model.
This is speculative decoding with a learned draft model/head. EAGLE uses hidden-state information from the target model to draft likely future tokens more accurately than a tiny standalone draft model. EAGLE-3 improves that by using multiple features from the target model and directly predicting tokens. It needs a compatible trained EAGLE checkpoint and backend support.
DFlash is also a learned drafter, but instead of autoregressively drafting token by token, it uses a block diffusion-style drafter to propose a whole block of tokens more in parallel. The goal is to make drafting cheaper for longer candidate blocks. Very promising, but more specialized and newer than EAGLE methods.
Basically an extension of DFlash. DFlash gives you distributions over possible block tokens. DDTree uses those distributions to build a tree of likely continuations, then verifies the tree with the target model. This is useful because if the single best draft path fails early, another branch may still be accepted. IMO this one has huge potential.
This one doesnt need a draft model, it just looks for repeated patterns in the prompt or generated context and proposes the continuation that followed the same n-gram earlier. It’s great for code editing, summarization, RAG, etc. It’s weak for novel generation because it has no semantic understanding, it’s mostly exploiting repetition.
Ideally you'd try cheap draftless methods first, like n-gram when there’s a strong context match, then fall back to a learned method like DDTree or EAGLE-3. Even better would be to merge them, put an n-gram continuation as a high-priority branch in the draft tree, and fill the rest of the tree with your DDTree candidates.
No_Algae1753@reddit
https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090
phhusson@reddit
This, plus have an up-to-date list of which inference framework supports which speculation
It sounds like a very hard task tbh since it is moving continuously
radlinsky@reddit
Can someone ELI5 what MTP is and what this means?
Baldur-Norddahl@reddit
Models predict the next token. To do so, every weight needs to be accessed once (for a dense model). Therefore the maximum rate of tokens generated is equal to the number of times the total of the model weights can be read from RAM. For example if RAM bandwidth is 500 GB/s and the model is 50 GB/s, we can never generate more than 10 tokens per second. Usually it is even slower, but that would be the theoretical max.
Now lets say we generate tokens for multiple unrelated prompts. We can read the weights once and do all the prompts in parallel. Each time the total of the weights get processed, we would generate X tokens instead of just one. Instead of 10 tokens per second, we could do 100 by processing 10 users in parallel. The limit becomes compute instead of bandwidth.
That is all good, but doesn't help a single user/prompt. But what if we get a guess on the next token and then process the current context in parallel with the context + the guess. Then we check if guess was correct. If it was, then we already calculated the next next token and we got two for the price of one. If the guess is wrong, then the calculated next next token is also wrong and we need to discard it.
To make the guess we can use a smaller model. Usually 10 times smaller, because it must be much faster than the main model. MTP is usually a term used for main models that have built in guess generators. It has a few layers that will produce the guess alongside the actual next token.
ilintar@reddit (OP)
Not exactly ELI5 but a technically very good explanation :)
superdariom@reddit
Big wise bear can find his way through the woods faster when helped by little bear who is quicker and more nimble but sometimes makes mistakes leading big bear. But together they make a better team than either one on their own.
vick2djax@reddit
Isn’t this basically MoE except with MoE, the little bear tells the big bear where to go in the woods and big bear doesn’t check little bear’s direction?
No_Afternoon_4260@reddit
Not at all.
MoE is like you slice each layer. When you start a layer a router decides which slice to activate. Thus MoE come with an indication of the number of active parameters.
A model like deepseekv4 flash comes with 284B total params but actives only activates 13B of these for each tokens.
Large ram foot print for large knowledge and capabilities, but small compute footprint for runtime efficiency.
MTP is about more like speculative decoding. Not sure how is it different besides having the smaller weights embedded in the big model?
Cast-Iron_Nephilim@reddit
Sooo, many little bears, and the group goes with whichever bears feels the most confident about the current bit of forest?
sergeant113@reddit
Yes, many little bears, but an elder bear decides which little bear gets to dictate the next step. Every step potentially could be decided by a different likely bear.
Sometimes the elder bear gets lazy or plays favoritism and keeps choosing a particular little bear, but i digress.
z_latent@reddit
It's a stretch, MoE does not tell big bear where to go, just how to decide where to go, in a more internal way. Like little bear guiding big bear's attention so big bear can think only about what's important now.
darwinanim8or@reddit
grug thank bear man
pab_guy@reddit
So… speculative decoding, but in parallel.
Baldur-Norddahl@reddit
Speculative decoding is the exact same. Only difference is that you have to supply an external prediction model.
More_Feature8687@reddit
Is this same as speculative decoding?
Baldur-Norddahl@reddit
Yes with a build in prediction model.
ROS_SDN@reddit
The model has a built in "sub-model" for speculative decoding?
How does that architecture look on the qwen3.5+? How big is this segment?
GergelyKiss@reddit
This sounds very similar to branch predictors in CPUs... Thanks for the explanation!
4onen@reddit
And it's even called speculative decoding, so yeah, spot on. We speculate these guesses through one means or another. MTP being one specific means. If we happen to guess right, then we save time, otherwise the extra work is kinda negligible if we tune everything right.
ProfessionalJackals@reddit
So this explain why something like a 5090 is not running circles around a 3090 in token generation, and people ended up running models in parallel to get the most out of it?
Polite_Jello_377@reddit
Sounds kinda like CPU branch prediction
Obvious_Equivalent_1@reddit
Wanted to convey a quick message of gratitude. It’s good to see people taking time to make their private knowledge public, it’s maybe small but these messages make it a joy to continue reading these open source subs!
radlinsky@reddit
Thank you, this is a nice high level overview I can understand :)
Eyelbee@reddit
The important part is that they train the model with MTP considerations, it makes them smarter. Other than that I don't care about the MTP inference honestly.
ilintar@reddit (OP)
big model make tokens slow, small model make tokens fast, big model has small model inside, small model make tokens for big model, big model checks, big model make tokens faster
tf2ftw@reddit
How is babby formed?
hesperaux@reddit
They are taking the babbies bank to new York to lady to rest. My pary are with the father who lost his chrilden.
lolwutdo@reddit
Gregnant
Baul@reddit
Lots of comments asking about Speculative Decoding. This is just like "draft" speculative decoding, but without the need to allocate more VRAM to a smaller model.
BitGreen1270@reddit
So are there models already that support MTP?
OsmanthusBloom@reddit
Qwen3.5 / 3.6 do support it
ShengrenR@reddit
https://www.reddit.com/r/LocalLLaMA/comments/1seqblr/turns_out_gemma_4_had_mtp_multi_token_prediction/
Gemma 4 apparently 'did'..? but not in current release.
4onen@reddit
That's the trick. The only version of the model that Google has released with multi-token prediction (MTP) is the version to run on the liteRT engine that they use for running on phones. Their explanation for why it's not in the other format releases... was that it might confuse runtimes. The problem is, every runtime ignores tensors when it doesn't know what to do with them, so it wouldn't confuse any runtimes.
My speculation is that they are holding the MTP tensors back to make their stuff look better.
BitGreen1270@reddit
That's awesome! So if I load up qwen3.6-27B and use MTP it will run much faster and use the same amount of memory?
OsmanthusBloom@reddit
See the PR linked by OP for some benchmarks. Yes, it will be a lot faster for TG, maybe twice as fast. VRAM usage will increase by around 3GB according to other commenters who have tried it.
DOAMOD@reddit
I haven't tried Llamacpp MTP yet, but I did try MTP in VLLM on Windows on my 5090, and it was a bit disappointing. The memory consumption when exposing the small model doesn't compensate at all for the significant loss of context window. Perhaps in some specific cases for MoEs it could be useful; I think that's the interesting point. But for Dense, I don't see a benefit in my use case. I'll try Llamacpp, though.
SnooPaintings8639@reddit
This is actually true since quite a while for those who use vLLM. The MTP + tensor parallel make the Qwen 3.6 much faster there than in llama.cpp.
audioen@reddit
MTP has been a thing for like a year at least. Some older GLM already shipped with MTP head. People have had the habit of stripping the MTP heads off from the GGUF files because llama.cpp has had no ability to use them for such a long time. We can expect a round of updates to Qwen3.6 due to this -- currently downloading the q8_0 with MTP head in it, though no doubt within the week unsloth will have a new release, and then I'm downloading it one more time...
spaceman_@reddit
Almost everyone.
It's important to note that MTP works differently in all architectures, so while the PR adds support to Qwen3.5 models & a lot of the shared stuff required for MTP, it does not enable MTP for all models.
BitGreen1270@reddit
Qwen is the only one I could probably run dense so that's fine by me!
droptableadventures@reddit
To put that a different way: Speculative decoding has an entirely separate small model that works only on the output tokens of the big model.
For MTP, the small model gets the internal state of the big model as an input, so it can "peek inside" and make more accurate guesses as to what's coming.
Anbeeld@reddit
You still allocate fuckton VRAM for MTP to work.
Baul@reddit
TIL it does take more VRAM, but a fuckton is probably an overstatement:
https://github.com/ggml-org/llama.cpp/pull/22673#issuecomment-4371483712
Anbeeld@reddit
Fuckton because you have to use BF16 or so MTP layer for good results, which combined with everything else bloats VRAM hard if you're on Q4 or so.
letsgoiowa@reddit
RIP so I can't even use it on a q6 4b model? Damn
Anbeeld@reddit
No, it's not like that, peeps are producing quants where e.g. the entire model is Q4 but MTP is BF16 and everything works. It just gets tight quickly if you are on a single 3090 for example.
Glazedoats@reddit
I really appreciate you mentioning this because I also have very small VRAM.
letsgoiowa@reddit
Oh I have a 3070 so only 8 GB
ForsookComparison@reddit
Is it useless in Q8? (~28GB for Qwen3.6 27B) ?
If I have to use some 56GB just to load the model then suddenly 27B doesn't feel as exciting.
Anbeeld@reddit
No, it's not like that, peeps are producing quants where e.g. the entire model is Q4 but MTP is BF16 and everything works. It just gets tight quickly if you are on a single 3090 for example.
GrungeWerX@reddit
Am on a 3090TI. So, you're saying just skip this and keep it moving?
Anbeeld@reddit
It depends if you are on Windows or Linux. If on Linux, you can try it right now using vLLM + MTP. I tried it via Windows 11 + WSL2 which wasted just enough VRAM to make it all unviable. YMMV, might be skill issue.
I'm working on a decent alternative option right now, driven by existing ones not working well for me. :P
GrungeWerX@reddit
Great, I'll wait then. On WIndows 10.
ForsookComparison@reddit
Ohhh that makes sense, thanks
SKirby00@reddit
The kinda glosses over the main thing that tripped me up for a long time: how "big model checks" is faster than just "big model make token". Here, let me try to clear up that tricky part:
Big model slow because takes long time to read all weights for make next token. Big model can make next token and other next tokens with only read weights one time, but each next token only good if token before is also good.
With small model guesses, big model has filler it needs for: - make token n - make token n + 1 (only good if token n = small model guess for token n) - make token n + 2 (only good if token n +1 = small model guess for token n + 1) - (and so on ...)
... with only one time doing slow boring job reading model weights.
Big model can't make token n + 1 at same time as make token n if model doesn't already have guess value for token n. Small model pretty good at guessing, but not perfect. If small model make right guess for token n, big model can use the token n + 1 that it made at the same time. If small model make right guesses for token n and n + 1, big model can use new tokens it made for n + 1 and n + 2.
I know this isn't quite as super duper simple as the comment I'm replying to, but if you were confused (like me) by the part about it being faster for the big model to check than to make the next token, then hopefully this helps.
Key insight (not ELI5): it's not really faster in the sense that it's less computation to check than to produce, but rather in that it can do multiple checks in the same number of weight reads (the slowest part) as it would take to do just a single prediction.
But that's for speculative decoding. I think MTP is more like "big model guess next few tokens each cycle instead of just guess next token". I only just tried to learn this stuff today so I'm definitely not expert, but I think with MTP, the key difference is that "big model not need filler guess from small model to make token n + 1 at same time as big model make token n".
Someone who knows this stuff better can feel free to correct me if I'm wrong.
ParaboloidalCrest@reddit
Ok then please ELI5 what spec decoding was again? XD. Sounds similar.
DeepOrangeSky@reddit
I think that's the idea. It is basically like doing speculative decoding, except, instead of having to use a whole literal separate small model that you run in tandem with your main model, the main model just uses a small portion of itself to perform the function of what that separate small model would've done, to do the speculative decoding for itself.
So, an advancement/evolution of traditional speculative decoding, basically.
ParaboloidalCrest@reddit
But speculative decoding does work already without a draft model.
ilintar@reddit (OP)
Yes, but that's ngram based speculative decoding, which is slightly a different beast, it's basically a lookup cache for common token combinations :)
Silver-Champion-4846@reddit
Can they be combined for faster fastness?
ilintar@reddit (OP)
Possibly, there's a PR out there for chained specdec support.
Silver-Champion-4846@reddit
Specspecdec?
radlinsky@reddit
Lol! That is indeed a 5 year old level explaination
cibernox@reddit
Isn’t that just speculative decoding?
Orolol@reddit
It is, but baked inside the model during the whole training, so you have higher acceptance
cibernox@reddit
That sounds very interesting. Does it require new models or new ggufs of existing models?
audioen@reddit
The GGUF files have had the MTP heads typically stripped to save disk space (and to avoid llama.cpp warning that it isn't going to load the layer) so they will probably get updated for this.
I am going to run this PR right now, this is the most anticipated feature of llama.cpp of all time, at least for me. Ever since GLM-4.5 or such had it, and it was known to approximately double the generation rate... Probably becomes easily the biggest single performance improvement llama.cpp has ever had.
Orolol@reddit
You can't add it on existing models, but some models already ahve it, like Qwen 3.6 / 3.5
pmttyji@reddit
Do we have list of models(comes with this feature) somewhere? It would be nice to have filter this on HuggingFace for same.
stddealer@reddit
It's self-specultive.
Instead of having a whole smaller LLM to predict the next sequence of tokens sequentially, the model has multiple output heads for the final layer, trying to predict probabilities for the next few tokens in one shot, without accessing the last few tokens before it since they haven't been sampled yet.
Meaning in one forward pass, the model can: - predict and sample the next token (like a normal autoregressive LLM) - check if the drafted tokens from last pass match and can be accepted already (like speculative decoding) - draft the sequence of next tokens to be checked in the subsequent pass
_bones__@reddit
So speculative decoding?
Silver-Champion-4846@reddit
Same spirit, different body
Silver-Champion-4846@reddit
All explainers should be like this. Caveman talk most efficient
AlarmingProtection71@reddit
This is more like a "Explain like i'm a cavemen" but me like it. me unterstand now.
Intelligent-Baker448@reddit
ooga booga, zoom zoom zoom
ilintar@reddit (OP)
C'mon, request was for ELI5 not ELI2 :P
Intelligent-Baker448@reddit
It's more like ELI 15,000 B.C.
Everyone is talking like cavemen to save tokens, I thought.
Ariquitaun@reddit
What size small model. Small like child or small like cabbage
ilintar@reddit (OP)
Big model like big human. Small model like small human.
simracerman@reddit
Or, “speculative decoding”. Unless MTP has some amazing leg up over traditional draft.
reery7@reddit
It can make dense models run 1.5-2.0x faster. It makes most sense for a single user local model. I think it’s not that big of a jump for concurrency.
Silver-Champion-4846@reddit
Can it improve qwen3.5-4b on cpu?
Unlucky-Message8866@reddit
now in beta? it's not even merged, still a draft
Beginning-Window-115@reddit
you can still beta test a draft
HiddenMushroom11@reddit
I converted the Q8(https://huggingface.co/am17an/Qwen3.6-27B-MTP-GGUF) MTP to IQ4_XS w/ MTP, and it's super fast on dual 3060s. Thanks for the post OP!
HiddenMushroom11@reddit
One thing to note, is I couldn't get vision(mmproj) working.
Beginning-Window-115@reddit
could you upload it to huggingface please
coder543@reddit
This seriously has the potential to be the biggest game changer llama.cpp has ever seen.
I think MTP will make the biggest difference for dense models, maybe not so much for MoEs, but it will still be exciting.
Orolol@reddit
Yeah on vllm Qwen 27b goes from 55 to 105 tok/s.
apeapebanana@reddit
teach me sensei!! so far getting 30\~40 tok/s...
slow but still great to work with!
Orolol@reddit
uv run vllm serve Lorbus/Qwen3.6-27B-int4-AutoRound --max-model-len "131728" --gpu-memory-utilization "0.93" --attention-backend flashinfer --language-model-only --kv-cache-dtype "fp8_e4m3" --max-num-seqs "16" --skip-mm-profiling --quantization auto_round --reasoning-parser qwen3 --enable-auto-tool-choice --enable-prefix-caching --enable-chunked-prefill --tool-call-parser qwen3_coder --speculative-config '{"method":"mtp","num_speculative_tokens":3}
On a 5090
Ok_Brain_2376@reddit
Can I make the assumption that this is from the base vLLM? No need to find some random PR’s build? (Been struggling to run Qwen Dense models for 100+ Tps for a while
DominusIniquitatis@reddit
(Did you mean to use 131072, by the way?)
Silver-Champion-4846@reddit
This be sorcery. Lol
apeapebanana@reddit
so much this. I was pretty proud of my llamacpp setup until spotting the whispering of high tok/s and vllm scripts.
Silver-Champion-4846@reddit
So many args
takoulseum@reddit
Do you use it for parallel requests?
Orolol@reddit
Yeah I use subagents for coding
rerri@reddit
I am seeing very similar numbers on llama.cpp with this PR on a 5090.
Orolol@reddit
Great ! It still lack the prefix caching tho.
coder543@reddit
What do you mean by this? llama-server has supported checkpointing for these Qwen3.x models for weeks now, which is the way that prefix caching works for these hybrid attention models?
Orolol@reddit
I didn't check for weeks, but last time the checkpointing was quite fuzzy. I have a long context reasoning benchmark (https://github.com/Orolol/familyBench) that reuse a very long context and llama.cpp was giving me horrible performance while vllm could have 16 concurrent requests with 0 prefill and 2k toks/s
Maybe it has improved since, i'll retest
StorageHungry8380@reddit
Just on the off chance you missed it, did you bump the cache size? It's quite small by default, 8GB, so will get trashed if you have multiple long context prompts. I bumped mine up to 48GB and it was a significant improvement for my use-case.
ego100trique@reddit
Doesn't it duplicate the size of the model in vram though?
coder543@reddit
The KV Cache might be twice the size, but not the model.
ego100trique@reddit
Oh ok ok thank you
Orolol@reddit
A small overhead. The MTP part of the model is quite small.
LagOps91@reddit
should also be a big difference for MoE models hopefully. could make hybrid inference much more viable.
oxygen_addiction@reddit
On a 12GB 4070RTX Super, Qwen3.6-35B-A3B-Q4_K_XL went from 49tk/s to 55tk/s with MTP (despite the MTP model being 900mb bigger)
am17an@reddit
I just tested MoE models out, on my DGX spark Qwen35A3B went from 53 toks/second to \~70-75 toks/sec. So you're right, not as much for MoEs as dense
coder543@reddit
What kind of task? I find that specdec is more effective at tasks like "write a react typescript example" than they are at tasks like "what is the LHC?".
am17an@reddit
Here is my super comprehensive benchmark
https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090
coder543@reddit
Isn't that showing MTP losing to the external draft model? That seems odd.
am17an@reddit
It may lose but it won't be super consistent, because the draft model is more powerful requires more VRAM. And I did `--spec-draft-n-max` 16 there which requires a lot of memory for the partial rollback. If you're VRAM rich then the draft model is pretty good already.
lolwutdo@reddit
That's still a decent increase
InuRyu@reddit
I just learned a little bit about MTP. So from what I know, this is only useful if the acceptance rate is high, for predictable tasks like coding, it is good, but what about tasks such as RL or writing a novel? The smaller model would behave very differently than the bigger model expects for creative tasks, so the acceptance rate will be really low.
Pro-Row-335@reddit
I go from 45 to 80 tk/s with this on Japanese->English translation, and its was consistently at 80, pasted a 150k tokens paper and asked for it to evaluate it and it went from 80 to 71, I believe you are confounding MTP with ngram
InuRyu@reddit
oh lol, yeah you're right. After I read a bit more in the comment, I saw someone explain all the concepts (MTP, ngram, DFlash, etc.) and it seems like I confused between MTP with speculative decoding
No_Block8640@reddit
Can someone explain to me how to run models in llama cpp? I’ve tried to install this branch and offloaded expert layers to cpu just like in lm studio, but I usually get 25-30t/s in lm studio vs 14 t/s using this branch of llama cpp, may be my flags are wrong? (Qwen 3.6 35ba3b, rtx3080 and cpu)
Pro-Row-335@reddit
make sure you have a gguf with MTP in it, then --spec-type mtp --spec-draft-n-max 3
No_Block8640@reddit
I suppose unsloth doesn’t have mtp layers? How to find a gguf with mtp layers?
Pro-Row-335@reddit
It doesn't, I just download this and it worked, you can find others by searching the name on huggingface with MTP
https://huggingface.co/brittlewis12/Qwen3.6-27B-MTP-GGUF/tree/main
rerri@reddit
Nice! This seems to be way faster than ik_llama.cpp implementation. Been playing with that the past couple of days.
AzerbaijanNyan@reddit
Very nice, 55-60 t/s at around 80K context on two Mi50s and the 35B-A3B in Q4.1. Not the smartest but works rather well if given a good plan to follow.
shuwatto@reddit
This is huge, thanks a lot!
oxygen_addiction@reddit
Thanks a lot. I used it to make Qwen3.5-4B-Q6_K_L-MTP and it works great.
thoquz@reddit
Brilliant! What's the memory requirement for the MTP layer?
rerri@reddit
I'm seeing \~3.1 GB more VRAM used when comparing MTP to no-MTP and using 128K ctx length, kv q8_0.
At 16K ctx length, the difference is still pretty big at \~2.7 GB.
coder543@reddit
I wonder if it is allocating a separate, draft KV cache for the MTP heads? I didn't think that was needed for MTP.
rerri@reddit
am17an writes in the PR: "it has it's own context/kv-cache etc."
StupidScaredSquirrel@reddit
I have to say it's hard to complain about prices going up when my same hardware becomes so much more capable every month for free.
Travnewmatic@reddit
I've had this same thought so many times over the past few months
ketosoy@reddit
You’re a better man than me, I still manage to be upset by the prices.
StupidScaredSquirrel@reddit
Not better, not a man, but I do shake my fists whenever ai see the price of new hardware lol it's just that everytime I see stuff like this post or run qwen3.6 I remember how kucky I am and didn't really expect it of my pc 2 years ago
Silver-Champion-4846@reddit
Lucky you, but I can't even train a 1 million param tts model. Not even 1 million! Theoretically useless, but my cpu and 8gb ram says "nerrrp"
StupidScaredSquirrel@reddit
You can still run some sub 4b models to do plenty of stuff, local audio transcription, tts, fill-in-the middle for coding, boilerplate email assistant, etc
Silver-Champion-4846@reddit
4b models using Jan (runs on llama.cpp) gives me 3-4tps. Not very usable unless someone invents a small agent harness (small lm friendly)
StupidScaredSquirrel@reddit
What quant are u using? Are you compute bound or memory bandwidth bound?
Silver-Champion-4846@reddit
I'm bothbound. Cpu is Core I5 8th gen U type (lo-power mobile cpu, 4 cores 8 threads), ram 8gb single channel ddr4 ram, storage 256gb. My latitude5590 is upgradable to max 32gb ram 1tb storage. That means potential for a lot more context, a model that uses Gemma4 PerLayer embeddings more extensively, or suffering with Qwen3.5 9B, as cpu stays the same.
StupidScaredSquirrel@reddit
Try marco nano or even mini if u can fit, I edited my previous comment
pmttyji@reddit
Hope we get everything from below thread(and comments) soon or by end of this quarter.
Compilation of recent findings which could save some memory or increase performance
No_Conversation9561@reddit
Does it improve prefill speed too or only decode?
TheTerrasque@reddit
One reported halving prefill speed when this was active, from ~1200 to ~600
lolwutdo@reddit
Damn, I'd rather have faster PP than TG
ilintar@reddit (OP)
Only decode. For prefiil you need matmul kernel optimizations.
Apart_Boat9666@reddit
Just tried 35b with mtp, my current setup is 12gb 6700xt. So old config it was offloading to ram. Just ran this and it needs extra 3 gb on vram. So my "--n-cpu-moe", "23" dropped to "--n-cpu-moe", "36" and it was slower than before. So if your setup is offloading not worth it
Just tried the 35B model with MTP on my current setup: a 12GB RX 6700 XT.
With my old config, it was already offloading some layers to RAM. After enabling MTP, it needed around 3GB extra VRAM, so I had to change:
to:
That made it slower than before. So if your setup already needs offloading, MTP is probably not worth it.
Ok_Warning2146@reddit
Thanks for your info. I think MTP is mainly for llama.cpp to catch up with sglang and vllm on pure nvidia platform.
tarruda@reddit
Is this only for 3.x dense models or does it work with MoEs too?
oxygen_addiction@reddit
Works with MOE that still retain the MTP head. I transplanted one from over to an unsloth quant and it works fine.
unjustifiably_angry@reddit
MoE tends to see lesser or even negative benefit; if you're already operating with a very small active parameter count, going any lower gets exponentially dumber outputs and accordingly worse acceptance rates.
tarruda@reddit
So maybe it will be worth it for the 122B and 397B (if 3.6 for those are released)
unjustifiably_angry@reddit
Likely for 122b, but most certainly for 27b.
ilintar@reddit (OP)
Should work with MoE but I guess it'll need the MoE MTP model support as well.
Ok_Warning2146@reddit
Wow. That's big news. Finally the last piece of puzzle that puts it on par with sglang and vllm
Charming-Author4877@reddit
A draft is not a beta. Can't wait for having this implemented.
ilintar@reddit (OP)
I'm saying this is a beta because my gut feeling tells me that this is close to the production version :)
itsappleseason@reddit
y'all are really downvoting king deltanet
ilintar@reddit (OP)
I'm just the messenger here ;)
Top-Rub-4670@reddit
A messenger conveys the message as is. Here, you've made up the message "It's now in beta" when it's just PR, and a draft one at that.
Pyrolistical@reddit
Doesn’t work on vulkan yet
feckdespez@reddit
It's still not a beta though. Just a draft and PR for it.
EveningIncrease7579@reddit
Really awesome! Any results on a single 3090? I'll extract layers from the original GGUF (from author in PR) to a quantized one and try it in the new llama.cpp. I'll try it at home soon...
EveningIncrease7579@reddit
Make some tests, good results but not very promising....
Prompt: Make a flappy bird in html css and js.
With draft-max-2 (starts slow (30tk/s), but increases to 44\~45 after some seconds)
./build/bin/llama-server \ -m /mnt/hd_geral_1tb/lucebox-hub/dflash/models/qwen3.6-27b-unsloth-Q4_K_M-MTP.gguf \ --host 0.0.0.0 \ --port 9090 \ -ngl 999 \ -np 1 \ --no-mmap \ --no-cache-prompt \ -fa on \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --ctx-size 32768 \ --webui \ --spec-type mtp \ --draft-max 2 \ --chat-template-kwargs '{"preserve_thinking":true}'
prompt eval time = 145.91 ms / 21 tokens ( 6.95 ms per token, 143.93 tokens per second)
eval time = 214175.96 ms / 8041 tokens ( 26.64 ms per token, 37.54 tokens per second)
total time = 214321.87 ms / 8062 tokens
draft acceptance rate = 0.34294 ( 3271 accepted / 9538 generated)
statistics mtp: #calls(b,g,a) = 1 4769 2177, #gen drafts = 4769, #acc drafts = 2177, #gen tokens = 9538, #acc tokens = 3271, dur(b,g,a) = 0.001, 24194.822, 0.311 ms
Without, using original unsloth q4_km
./build/bin/llama-server \ -m /mnt/hd_geral_1tb/lucebox-hub/dflash/models/Qwen3.6-27B-Q4_K_M.gguf \ --host 0.0.0.0 \ --port 9090 \ -ngl 999 \ -np 1 \ --no-mmap \ --no-cache-prompt \ -fa on \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --ctx-size 32768 \ --webui
prompt eval time = 153.83 ms / 21 tokens ( 7.33 ms per token, 136.52 tokens per second)
eval time = 154049.44 ms / 6051 tokens ( 25.46 ms per token, 39.28 tokens per second)
total time = 154203.27 ms / 6072 tokens
Using qwen 0.8b as a draft (using 27B Q4 mtp gguf converted)
./build/bin/llama-server \ -m /mnt/hd_geral_1tb/lucebox-hub/dflash/models/qwen3.6-27b-unsloth-Q4_K_M-MTP.gguf \ -hfd unsloth/Qwen3.5-0.8B-GGUF:Q8_0 \ --host 0.0.0.0 \ --port 9090 \ -ngl 999 \ -ngld 999 \ -np 1 \ --no-mmap \ --no-cache-prompt \ -fa on \ --ctx-size 32768 \ --ctx-size-draft 32768 \ --webui \ --draft-max 16 \ --chat-template-kwargs '{"preserve_thinking":true}'
prompt eval time = 121.30 ms / 21 tokens ( 5.78 ms per token, 173.12 tokens per second)
eval time = 231306.58 ms / 6907 tokens ( 33.49 ms per token, 29.86 tokens per second)
total time = 231427.88 ms / 6928 tokens
draft acceptance rate = 0.81802 ( 7898 accepted / 9655 generated)
Using qwen 0.8b as a draft (using 27B gguf unsloth original K_M)
./build/bin/llama-server \ -m /mnt/hd_geral_1tb/lucebox-hub/dflash/models/Qwen3.6-27B-Q4_K_M.gguf \ -md /mnt/external/models/unsloth/QWEN3.6-27B/Qwen3.5-0.8B-Q8_0.gguf \ --host 0.0.0.0 \ --port 9090 \ -ngl 999 \ -ngld 999 \ -np 1 \ --no-mmap \ --no-cache-prompt \ -fa on \ --ctx-size 32768 \ --ctx-size-draft 32768 \ --webui \ --draft-max 16 \ --chat-template-kwargs '{"preserve_thinking":true}'
prompt eval time = 225.99 ms / 21 tokens ( 10.76 ms per token, 92.92 tokens per second)
eval time = 192988.51 ms / 6243 tokens ( 30.91 ms per token, 32.35 tokens per second)
total time = 193214.51 ms / 6264 tokens
draft acceptance rate = 0.84039 ( 7145 accepted / 8502 generated)
EveningIncrease7579@reddit
In first scenario Using draft-max1 (sometimes it get 50tk/s)
prompt eval time = 220.85 ms / 21 tokens ( 10.52 ms per token, 95.09 tokens per second)
eval time = 138005.83 ms / 6149 tokens ( 22.44 ms per token, 44.56 tokens per second)
total time = 138226.67 ms / 6170 tokens
draft acceptance rate = 0.59151 ( 2285 accepted / 3863 generated)
In first scenario using draft-max 3 have a median of 30 t/ks
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
zenmagnets@reddit
Except for high concurrency output.
TheTerrasque@reddit
I was thinking about this a long time ago, that gguf should have generic support for multiple models. At that time I was thinking especially draft models, but also vision encoders and possibly other encoders / decoders / model types at some point. And image diffusion models with llm's and vae's included as another example.
natermer@reddit
Doesn't appear to supported on Vulkan or Cuda yet. Which is too bad. Hopefully that will come along eventually as well.
The feature report points to: https://github.com/ggml-org/llama.cpp/pull/22400
ilintar@reddit (OP)
It's supported on CUDA, Vulkan support needs a patched GDN kernel.
natermer@reddit
Yeah sorry, I meant ROCM.
ilintar@reddit (OP)
(Actually, CUDA support ALSO needs the patched GDN kernel, which is in another PR - you have to read the thread for details)
waywardspooky@reddit
does this mean it this improvement will trickle it's way down into unsloth studio as well?
Fedor_Doc@reddit
It uses llama.cpp as a backend, so yes, it will trickle down.
nok01101011a@reddit
You think talkie-1930 can also be patched to be working on official llama?
dampflokfreund@reddit
So is this only useful for dense models? If so, does it help with partial offloading?
StupidScaredSquirrel@reddit
Eli5 why is this only useful for dense models? Doesn't it work for a3b just to a much lesser degree?
dry3ss@reddit
From what I've read here, MTP is only really useful with MoE if you have a lot of parallel execution, because it relies on most of the experts being available so you come back to a "dense"model that uses all it's weights.
That explanation does seem weird with qwen3.6 35-a3b that is supposed to have dedicated MTP heads, so if anyone is more knowledgeable don't hesitate to share !
petuman@reddit
To verify speculated tokens engine has to schedule parallel/batched completions for each speculated token (e.g. 4 completions for 3 speculated tokens). Those completions would be routed to different experts, activating more weights that single completion.
If you're serving 100 users/batched requests, then ~all experts are being activated anyway. If you're a single user, then that's more experts activated per request (=> more memory bandwidth usage => slower generation)
dry3ss@reddit
Ahhh yes it's for verification not for computing the MTP that the escorts are required thanks that makes sense !
Farmadupe@reddit
I think that's right? If I understand it right, the main model still has to confirm all of the predicted tokens by doing exactly the same forward passes it was going to do anyway.
Let's 10 token sequence. Without mtp and with mtp (100% accept rate):
With an MoE model, the maths is slightly different. Each token only loads 3 billion params, but you don't know which ones they are:
So in this hypothetical situation, the dense gets a massive speedup from mtp, but the moe gets almost none. You would actually get some speedup when if some of the same experts were pulled in, but nowhere near as much.
coder543@reddit
Because MTP helps during training, and because anyone serving a model in production will be batching large numbers of user requests together, activating all experts with every forward pass anyways.
am17an@reddit
I just updated the PR to also use Qwen3.6 MoE. It results in a 30-40% speed-up in my tests.
StupidScaredSquirrel@reddit
No way! Link?
am17an@reddit
https://huggingface.co/am17an/Qwen3.6-35BA3B-MTP-GGUF
StupidScaredSquirrel@reddit
Wait I' stupid I thought it was a change on llama.cpp not on the gguf file. Don't the gguf files already have the mtp layers just not leveraged by llama.cpp before the merge request?
rerri@reddit
It is a change in llama.cpp, the PR (link in OP) was updated. Old GGUF models of Qwen 3.5/3.6 do not include the MTP layer.
StupidScaredSquirrel@reddit
Thx. Im memory poor so I guess Im gonna have to make my own gguf with heavy quantisation but keeping mtp at 16bits. We'll see how that goes.
rerri@reddit
This script might offer a shortcut if you are planning to use the 27B or 35B models: https://gist.github.com/buzz/1c439684d5e3f36492ae9f64ef7e3f67
It allows you to transplant the MTP from am17an's GGUF's onto whatever old GGUF of those models you already have.
Someone made it for ik_llama.cpp originally, but it seems to work fine with llama.cpp too.
StupidScaredSquirrel@reddit
Thank you so much!
Ueberlord@reddit
When doing inference with a3b you are already only using 3b active parameters, thus to see any benefit you probably need to go to 0.6b as draft model which will most likely have bad acceptance rates and the difference to 3b is not big at all thus speed up is limited.
When using a 2b or 0.6b model as drafter for 27b the difference in active parameters is huge and we should see meaningful speed up, especially for tasks with higher acceptance rates like coding or structured outputs.
So in essence it works to a lesser degree but I think it is hardly meaningful for moe (unless something like 397b a27b).
StupidScaredSquirrel@reddit
Mtp in question doesn't rely on an external draft model though hence my question
gcavalcante8808@reddit
I've been using the original MTP since the first qwen 3.5 models were released since they are a bit slower than the older qwen3 models and they are really good! I also discovered that qwen3-coder-next also supports MTP and its is flying on my machine even with vulkan backend.
I'm very found of mtp as a speculative and simplied method! really nice to see the support becoming official
ea_man@reddit
I get that this would be an opt in with a flag like --mtp so that those of us with small VRAM that won't be able to run MTP anyway (also single user prompting) don't have to load an extra heavy MTP layer?
Due_Net_3342@reddit
does this also work with step 3.5 flash? or only qwen models?
OsmanthusBloom@reddit
Cool! But will enabling MTP increase VRAM usage for, say, Qwen3.6-27B? Does it still fit into 16GB VRAM if you squeeze hard enough?
rerri@reddit
MTP layer of am17an' model is \~440MB. Can maybe be quantized further, dunno.
OsmanthusBloom@reddit
Thanks. The PR says "it has it's own context/kv-cache etc" so I assume that some VRAM will be needed for that as well.
rerri@reddit
Yes, I'm seeing \~3.1 GB more VRAM used when comparing MTP to no-MTP and using 128K ctx length.
At 16K ctx length, the difference is still pretty big at \~2.7 GB.
Not very favorable for 16 GB VRAM :/
pmttyji@reddit
Oops, I thought of trying on my 8GB VRAM 😄
OsmanthusBloom@reddit
Thanks a lot for this. Well, there goes my dream. Let's hope for a few more miracles to happen.
Dany0@reddit
Quantising MTP layer has so far always turned out to be a very, very bad idea
pmttyji@reddit
Nice. Sorry for the dumb question. So this requires mentioned GGUF in PR? Regular GGUFs won't work?
ilintar@reddit (OP)
I think any GGUFs without stripped MTP layers should work.
TheGlobinKing@reddit
Noob question: how do I know if a GGUF I downloaded has MTP layers?
Anbeeld@reddit
There's this complex issue that if MTP is quantized it all goes to shit, which is why people use that specific Lorbus quant with vLLM.
michaelsoft__binbows@reddit
Ok but can you like, elaborate on how that impacts gguf's...
ilintar@reddit (OP)
Standard quantization schemes will quantize certain tensors based on their role in the graph, so eg. all ffn_up tensors get quantized to Q4_K. However, since an MTP layer is small, but you want as few rejections from it as possible, you want it higher quality then the other layers. Existing GGUFs probably don't have that.
michaelsoft__binbows@reddit
Thanks for clarifying. I've already got the lorbus quant working in vllm, regularly hitting 120tok/s on my 5090. It does sound very reasonable that new quants with unquantized MTP layers will be needed for MTP to deliver the goods on llamacpp. It will be very nice for closing the gap and hopefully we will have a proliferation of quants to choose from.
The thing about vllm is i don't think we can get anywhere as close as llamacpp makes it possible to squeeze models into limited vram. So MTP in llamacpp could be a massive game changer for hosting 27B class on 3090 for example, though there seems to already be some ways to squish it in with vllm already as well. But maybe it could do it with higher quality!
ilintar@reddit (OP)
Yeah, the MTP layer should probably be left as BF16.
lolwutdo@reddit
Nice, I wonder how much speed up 27b would get with partial cpu offloads.
LagOps91@reddit
damn you guys are fast, was just about to make a post for this
autonomousdev_@reddit
yo so i spent like a week messing with mtp and chained agents n stuff. batch stuff went way faster like 40% but latency got real weird after 4 tokens lol. had to dial it back to 2 for production. but for basic rag stuff it just works no complaints
bonobomaster@reddit
Holy smokes... this year keeps on giving!
ilintar@reddit (OP)
Same arch.