Llama.cpp MTP support now in beta!

Posted by ilintar@reddit | LocalLLaMA | View on Reddit | 231 comments

Happy to report that llama.cpp MTP support is now in beta, thanks to Aman (and all the others that have pushed the various issues in the meantime). This has the potential to actually get merged soon-ish. Currently contains support for Qwen3.5 MTP, but other models are likely to follow suit.

Between this and the maturing tensor-parallel support, expect most performance gaps between llama.cpp and vLLM, at least when it comes to token generation speeds, to be erased.

[-]

Synor@reddit

Not beta at all as it's not even in the main code yet.

And for some testers it does not work at all.

[-]

MisticRain69@reddit

Tried doing the PR with vulkan but no luck just errors hope they get it stable soon

[-]

Thomasedv@reddit

I'd love a breakdown of the speculative moethds and which to choose and pros/cons of each. It's quite hard to find out.

MTP (multi token prediction), Eagle-3, DFlash, DTree, ngram? Some needs extra draft models, some do not, some are better suited for "reusing" context like ngram I think.

Anyone got a comparison somewhere or willing to create one?

[-]

pkmxtw@reddit

All of them work on the principle of generating draft tokens cheaply and then verifying with the full model. The main difference is how those draft tokens they are generated.

N-gram: Use and match strings in the context. Pros: extremely fast to compute, works on any models. Cons: only good for applications where the data is repeated verbatim like coding.

Draft model: Use a small model of the same family to quickly generate tokens. Pros: easy to implement (just run two models concurrently). Cons: requires a matching model, acceptable rate depends on how well they match.

MTP: The full model itself is pre-trained to output draft tokens on auxiliary heads. Pros: potentially the best. Cons: requires the model to be trained for it.

Eagle3: This is kinda like MTP except that it is bolted on to a pre-trained model. Pros: good speed-up and likely the widely-used SOTA technique. Cons: you need to spend $$$ to train the eagle3 model.

DFlash: Use a block diffusion model for prediction. Pros: speed goes brrr (if you have the compute). Cons: same issue with eagle3, still new and experimental.

Basically it comes down what your engine / model supports, and how much leftover compute you have. My pick would be:

Eagle3/MTP if your model and inference engine have support for that.
DFlash if you want to try the newest.
Separate draft model if you can find a good compatible model.
N-gram is probably a good fallback since it costs almost nothing.

[-]

Silver-Champion-4846@reddit

Can you combine Eagle3 with MTP?

[-]

Chromix_@reddit

That might not make a lot of sense, unless you want to make both training and inference even more expensive, for the chance of getting a correct prediction when either model guesses correctly.

Combining with N-gram can make sense though, to skip the speculation model in specific situations and gain some added speed.

[-]

Silver-Champion-4846@reddit

Has it been done before?

[-]

Chromix_@reddit

It apparently works just fine: llama-server -m large.gguf -fa on -ngl 99 -md small.gguf -ngld 99 -c 20000 -np 1 --spec-type ngram-mod

Just the large model: 30 TPS
With draft model: 100 TPS
With draft + ngram: 330 TPS
Only with ngram: 300 TPS

This was for a case where I asked to make a small change in a 8k token code file.

[-]

Silver-Champion-4846@reddit

I had to spell that letter by letter because screen reader reads the whole thing like a spell from the dark ages. Forget the dashes and that little pause you think of when you see a new arg. It's all llama server M large dot gguf fa (fa as in father) on ngl ninety-nine c twenty thousand md small dot gguf ngld ninety-nine etc etc

[-]

perkia@reddit

Interesting. I'd have thought a short inlined markup (correctly used in the parent comment) would automagically make the screen reader pay more attention to syntax not less, but I guess that's accessibility for you... is there a special aria attribute to add to HTML tags that would help consider the content a function or command call ?


    
        View on Reddit
        #85228555


    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Thomasedv@reddit
    

            
        
    
    
        
    
        Thanks, that's a good summary. Now to research what works, seems like most of these aren't that good for MoE models. Qwen 3.6 35B A3B is blazing fast already but I'd be very amazed if I could make it even faster. But so far, tests imply most speculative decoding slows it down.
Look forward to trying these on 27B when I get time. 
    

        
    
    
        View on Reddit
        #85186033
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        pkmxtw@reddit
    

            
        
    
    
        
    
        Yeah, MoE usually have small enough active parameters so you won't gain as much. --spec-type ngram-mod does still provide some speed-up if you have repeating text, and it is close to free lunch.
    

        
    
    
        View on Reddit
        #85203493
    

        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        unjustifiably_angry@reddit
    

            
        
    
    
        
    
        dflash, from what I've read, can be anything from a huge speedup or a huge speed loss depending on the model and the hardware it's running on.
    

        
    
    
        View on Reddit
        #85173697


    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Anbeeld@reddit
    

            
        
    
    
        
    
        I mean the main problem is that you don't have much choice, not how to make the choice. There's only a handful of inference implementations for e.g. Qwen 3.6 that support prediction without breaking everything else, vLLM being the main one probably with MTP?
I'm working on adding one more option to the list right now by the way, stay tuned.
    

        
    
    
        View on Reddit
        #85169116
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        oxygen_addiction@reddit
    

            
        
    
    
        
    
        Lay off the cyber psychosis, buddy.
    

        
    
    
        View on Reddit
        #85206396
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Anbeeld@reddit
    

            
        
    
    
        
    
        What's wrong, exactly?
    

        
    
    
        View on Reddit
        #85207184
    

        
    
    

                
            
        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        No_Weather8173@reddit
    

            
        
    
    
        
    
        
MTP:

Self explanatory really. The model is trained to predict several future tokens, not just the next one. At inference those extra predictions can be used as built-in draft tokens. You don’t need a separate draft model, but you do need a base model trained with MTP heads, so you can’t just enable it for every model. 

EAGLE-3:

This is speculative decoding with a learned draft model/head. EAGLE uses hidden-state information from the target model to draft likely future tokens more accurately than a tiny standalone draft model. EAGLE-3 improves that by using multiple features from the target model and directly predicting tokens. It needs a compatible trained EAGLE checkpoint and backend support. 

DFlash:

DFlash is also a learned drafter, but instead of autoregressively drafting token by token, it uses a block diffusion-style drafter to propose a whole block of tokens more in parallel. The goal is to make drafting cheaper for longer candidate blocks. Very promising, but more specialized and newer than EAGLE methods. 

DDTree:

Basically an extension of DFlash. DFlash gives you distributions over possible block tokens. DDTree uses those distributions to build a tree of likely continuations, then verifies the tree with the target model. This is useful because if the single best draft path fails early, another branch may still be accepted. IMO this one has huge potential.

n-gram :

This one doesnt need a draft model,  it just looks for repeated patterns in the prompt or generated context and proposes the continuation that followed the same n-gram earlier. It’s great for code editing, summarization, RAG,  etc. It’s weak for novel generation because it has no semantic understanding, it’s mostly exploiting repetition.
Ideally you'd try cheap draftless methods first, like n-gram when there’s a strong context match, then fall back to a learned method like DDTree or EAGLE-3. Even better would be to merge them, put an n-gram continuation as a high-priority branch in the draft tree, and fill the rest of the tree with your DDTree candidates.
    

        
    
    
        View on Reddit
        #85175736
    

        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        No_Algae1753@reddit
    

            
        
    
    
        
    
        https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090
    

        
    
    
        View on Reddit
        #85168151
    

        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        phhusson@reddit
    

            
        
    
    
        
    
        This, plus have an up-to-date list of which inference framework supports which speculation
It sounds like a very hard task tbh since it is moving continuously
    

        
    
    
        View on Reddit
        #85167773


    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        radlinsky@reddit
    

            
        
    
    
        
    
        Can someone ELI5 what MTP is and what this means?
    

        
    
    
        View on Reddit
        #85165796
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Baldur-Norddahl@reddit
    

            
        
    
    
        
    
        Models predict the next token. To do so, every weight needs to be accessed once (for a dense model). Therefore the maximum rate of tokens generated is equal to the number of times the total of the model weights can be read from RAM. For example if RAM bandwidth is 500 GB/s and the model is 50 GB/s, we can never generate more than 10 tokens per second. Usually it is even slower, but that would be the theoretical max.
Now lets say we generate tokens for multiple unrelated prompts. We can read the weights once and do all the prompts in parallel. Each time the total of the weights get processed, we would generate X tokens instead of just one. Instead of 10 tokens per second, we could do 100 by processing 10 users in parallel. The limit becomes compute instead of bandwidth.
That is all good, but doesn't help a single user/prompt. But what if we get a guess on the next token and then process the current context in parallel with the context + the guess. Then we check if guess was correct. If it was, then we already calculated the next next token and we got two for the price of one. If the guess is wrong, then the calculated next next token is also wrong and we need to discard it.
To make the guess we can use a smaller model. Usually 10 times smaller, because it must be much faster than the main model. MTP is usually a term used for main models that have built in guess generators. It has a few layers that will produce the guess alongside the actual next token.
    

        
    
    
        View on Reddit
        #85168377
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        ilintar@reddit
    

            
                (OP)
            
        
    
    
        
    
        Not exactly ELI5 but a technically very good explanation :)
    

        
    
    
        View on Reddit
        #85169768
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        superdariom@reddit
    

            
        
    
    
        
    
        Big wise bear can find his way through the woods faster when helped by little bear who is quicker and more nimble but sometimes makes mistakes leading big bear. But together they make a better team than either one on their own.
    

        
    
    
        View on Reddit
        #85173358
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        vick2djax@reddit
    

            
        
    
    
        
    
        Isn’t this basically MoE except with MoE, the little bear tells the big bear where to go in the woods and big bear doesn’t check little bear’s direction?
    

        
    
    
        View on Reddit
        #85184718
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        No_Afternoon_4260@reddit
    

            
        
    
    
        
    
        Not at all.

MoE is like you slice each layer. When you start a layer a router decides which slice to activate. Thus MoE come with an indication of the number of active parameters.
A model like deepseekv4 flash comes with 284B total params but actives only activates 13B of these for each tokens.

Large ram foot print for large knowledge and capabilities, but small compute footprint for runtime efficiency.
MTP is about more like speculative decoding. Not sure how is it different besides having the smaller weights embedded in the big model?
    

        
    
    
        View on Reddit
        #85209876
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Cast-Iron_Nephilim@reddit
    

            
        
    
    
        
    
        Sooo, many little bears, and the group goes with whichever bears feels the most confident about the current bit of forest?
    

        
    
    
        View on Reddit
        #85220324
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        sergeant113@reddit
    

            
        
    
    
        
    
        Yes, many little bears, but an elder bear decides which little bear gets to dictate the next step. Every step potentially could be decided by a different likely bear. 
Sometimes the elder bear gets lazy or plays favoritism and keeps choosing a particular little bear, but i digress.
    

        
    
    
        View on Reddit
        #85227446
    

        
    
    

                
            
        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        z_latent@reddit
    

            
        
    
    
        
    
        It's a stretch, MoE does not tell big bear where to go, just how to decide where to go, in a more internal way. Like little bear guiding big bear's attention so big bear can think only about what's important now.
    

        
    
    
        View on Reddit
        #85207590
    

        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        darwinanim8or@reddit
    

            
        
    
    
        
    
        grug thank bear man
    

        
    
    
        View on Reddit
        #85184719
    

        
    
    

                
            
        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        pab_guy@reddit
    

            
        
    
    
        
    
        So… speculative decoding, but in parallel.
    

        
    
    
        View on Reddit
        #85211519
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Baldur-Norddahl@reddit
    

            
        
    
    
        
    
        Speculative decoding is the exact same. Only difference is that you have to supply an external prediction model.
    

        
    
    
        View on Reddit
        #85225912
    

        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        More_Feature8687@reddit
    

            
        
    
    
        
    
        Is this same as speculative decoding?
    

        
    
    
        View on Reddit
        #85225646
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Baldur-Norddahl@reddit
    

            
        
    
    
        
    
        Yes with a build in prediction model.
    

        
    
    
        View on Reddit
        #85225826
    

        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        ROS_SDN@reddit
    

            
        
    
    
        
    
        The model has a built in "sub-model" for speculative decoding? 
How does that architecture look on the qwen3.5+? How big is this segment? 
    

        
    
    
        View on Reddit
        #85219975
    

        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        GergelyKiss@reddit
    

            
        
    
    
        
    
        This sounds very similar to branch predictors in CPUs... Thanks for the explanation!
    

        
    
    
        View on Reddit
        #85179112
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        4onen@reddit
    

            
        
    
    
        
    
        And it's even called speculative decoding, so yeah, spot on. We speculate these guesses through one means or another. MTP being one specific means. If we happen to guess right, then we save time, otherwise the extra work is kinda negligible if we tune everything right. 
    

        
    
    
        View on Reddit
        #85218095
    

        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        ProfessionalJackals@reddit
    

            
        
    
    
        
    
        
For example if RAM bandwidth is 500 GB/s and the model is 50 GB/s,

So this explain why something like a 5090 is not running circles around a 3090 in token generation, and people ended up running models in parallel to get the most out of it?
    

        
    
    
        View on Reddit
        #85210239
    

        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Polite_Jello_377@reddit
    

            
        
    
    
        
    
        Sounds kinda like CPU branch prediction
    

        
    
    
        View on Reddit
        #85208171
    

        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Obvious_Equivalent_1@reddit
    

            
        
    
    
        
    
        Wanted to convey a quick message of gratitude. It’s good to see people taking time to make their private knowledge public, it’s maybe small but these messages make it a joy to continue reading these open source subs!
    

        
    
    
        View on Reddit
        #85186625
    

        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        radlinsky@reddit
    

            
        
    
    
        
    
        Thank you, this is a nice high level overview I can understand :)
    

        
    
    
        View on Reddit
        #85178380
    

        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Eyelbee@reddit
    

            
        
    
    
        
    
        The important part is that they train the model with MTP considerations, it makes them smarter. Other than that I don't care about the MTP inference honestly. 
    

        
    
    
        View on Reddit
        #85171224
    

        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        ilintar@reddit
    

            
                (OP)
            
        
    
    
        
    
        big model make tokens slow, small model make tokens fast, big model has small model inside, small model make tokens for big model, big model checks, big model make tokens faster
    

        
    
    
        View on Reddit
        #85166332
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        tf2ftw@reddit
    

            
        
    
    
        
    
        How is babby formed?
    

        
    
    
        View on Reddit
        #85173404
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        hesperaux@reddit
    

            
        
    
    
        
    
        They are taking the babbies bank to new York to lady to rest. 
My pary are with the father who lost his chrilden. 
    

        
    
    
        View on Reddit
        #85221378
    

        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        lolwutdo@reddit
    

            
        
    
    
        
    
        Gregnant 
    

        
    
    
        View on Reddit
        #85206919
    

        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Baul@reddit
    

            
        
    
    
        
    
        Lots of comments asking about Speculative Decoding. This is just like "draft" speculative decoding, but without the need to allocate more VRAM to a smaller model.
    

        
    
    
        View on Reddit
        #85167726
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        BitGreen1270@reddit
    

            
        
    
    
        
    
        So are there models already that support MTP? 
    

        
    
    
        View on Reddit
        #85168929
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        OsmanthusBloom@reddit
    

            
        
    
    
        
    
        Qwen3.5 / 3.6 do support it
    

        
    
    
        View on Reddit
        #85169434
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        ShengrenR@reddit
    

            
        
    
    
        
    
        https://www.reddit.com/r/LocalLLaMA/comments/1seqblr/turns_out_gemma_4_had_mtp_multi_token_prediction/

Gemma 4 apparently 'did'..? but not in current release.
    

        
    
    
        View on Reddit
        #85171652
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        4onen@reddit
    

            
        
    
    
        
    
        That's the trick. The only version of the model that Google has released with multi-token prediction (MTP) is the version to run on the liteRT engine that they use for running on phones. Their explanation for why it's not in the other format releases... was that it might confuse runtimes. The problem is, every runtime ignores tensors when it doesn't know what to do with them, so it wouldn't confuse any runtimes.
My speculation is that they are holding the MTP tensors back to make their stuff look better.
    

        
    
    
        View on Reddit
        #85218234
    

        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        BitGreen1270@reddit
    

            
        
    
    
        
    
        That's awesome! So if I load up qwen3.6-27B and use MTP it will run much faster and use the same amount of memory? 
    

        
    
    
        View on Reddit
        #85175729
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        OsmanthusBloom@reddit
    

            
        
    
    
        
    
        See the PR linked by OP for some benchmarks. Yes, it will be a lot faster for TG, maybe twice as fast. VRAM usage will increase by around 3GB according to other commenters who have tried it.
    

        
    
    
        View on Reddit
        #85176057
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        DOAMOD@reddit
    

            
        
    
    
        
    
        I haven't tried Llamacpp MTP yet, but I did try MTP in VLLM on Windows on my 5090, and it was a bit disappointing. The memory consumption when exposing the small model doesn't compensate at all for the significant loss of context window. Perhaps in some specific cases for MoEs it could be useful; I think that's the interesting point. But for Dense, I don't see a benefit in my use case. I'll try Llamacpp, though.
    

        
    
    
        View on Reddit
        #85191485
    

        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        SnooPaintings8639@reddit
    

            
        
    
    
        
    
        This is actually true since quite a while for those who use vLLM. The MTP + tensor parallel make the Qwen 3.6 much faster there than in llama.cpp. 
    

        
    
    
        View on Reddit
        #85191470
    

        
    
    

                
            
        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        audioen@reddit
    

            
        
    
    
        
    
        MTP has been a thing for like a year at least. Some older GLM already shipped with MTP head. People have had the habit of stripping the MTP heads off from the GGUF files because llama.cpp has had no ability to use them for such a long time. We can expect a round of updates to Qwen3.6 due to this -- currently downloading the q8_0 with MTP head in it, though no doubt within the week unsloth will have a new release, and then I'm downloading it one more time...
    

        
    
    
        View on Reddit
        #85178003
    

        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        spaceman_@reddit
    

            
        
    
    
        
    
        Almost everyone.

GLM has had MTP in every model since 4.5 / 4.5 Air
Qwen since 3.5, Qwen3-Coder-Next
Step 3.5 Flash has MTP predicting 3 tokens at once
Mistral ships the additional predictor layers as a separate EAGLE model, this is also MTP and different from a "classic" drafting model
Deepseek since V3

It's important to note that MTP works differently in all architectures, so while the PR adds support to Qwen3.5 models & a lot of the shared stuff required for MTP, it does not enable MTP for all models.
    

        
    
    
        View on Reddit
        #85175873
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        BitGreen1270@reddit
    

            
        
    
    
        
    
        Qwen is the only one I could probably run dense so that's fine by me! 
    

        
    
    
        View on Reddit
        #85176089
    

        
    
    

                
            
        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        droptableadventures@reddit
    

            
        
    
    
        
    
        To put that a different way: Speculative decoding has an entirely separate small model that works only on the output tokens of the big model.
For MTP, the small model gets the internal state of the big model as an input, so it can "peek inside" and make more accurate guesses as to what's coming.
    

        
    
    
        View on Reddit
        #85210229
    

        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Anbeeld@reddit
    

            
        
    
    
        
    
        You still allocate fuckton VRAM for MTP to work.
    

        
    
    
        View on Reddit
        #85168891
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Baul@reddit
    

            
        
    
    
        
    
        TIL it does take more VRAM, but a fuckton is probably an overstatement:

as of right now it is opt-in via --spec-type mtp, but in terms of memory it should be < 10% of overall memory used (it's just a single layer transformer + kv cache, much lighter than draft models)

https://github.com/ggml-org/llama.cpp/pull/22673#issuecomment-4371483712
    

        
    
    
        View on Reddit
        #85169809
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Anbeeld@reddit
    

            
        
    
    
        
    
        Fuckton because you have to use BF16 or so MTP layer for good results, which combined with everything else bloats VRAM hard if you're on Q4 or so.
    

        
    
    
        View on Reddit
        #85169923
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        letsgoiowa@reddit
    

            
        
    
    
        
    
        RIP so I can't even use it on a q6 4b model? Damn
    

        
    
    
        View on Reddit
        #85171112
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Anbeeld@reddit
    

            
        
    
    
        
    
        No, it's not like that, peeps are producing quants where e.g. the entire model is Q4 but MTP is BF16 and everything works. It just gets tight quickly if you are on a single 3090 for example.
    

        
    
    
        View on Reddit
        #85171710
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Glazedoats@reddit
    

            
        
    
    
        
    
        I really appreciate you mentioning this because I also have very small VRAM.
    

        
    
    
        View on Reddit
        #85209906
    

        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        letsgoiowa@reddit
    

            
        
    
    
        
    
        Oh I have a 3070 so only 8 GB
    

        
    
    
        View on Reddit
        #85173204
    

        
    
    

                
            
        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        ForsookComparison@reddit
    

            
        
    
    
        
    
        Is it useless in Q8? (~28GB for Qwen3.6 27B) ?
If I have to use some 56GB just to load the model then suddenly 27B doesn't feel as exciting.
    

        
    
    
        View on Reddit
        #85170605
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Anbeeld@reddit
    

            
        
    
    
        
    
        No, it's not like that, peeps are producing quants where e.g. the entire model is Q4 but MTP is BF16 and everything works. It just gets tight quickly if you are on a single 3090 for example.
    

        
    
    
        View on Reddit
        #85171051
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        GrungeWerX@reddit
    

            
        
    
    
        
    
        Am on a 3090TI. So, you're saying just skip this and keep it moving?
    

        
    
    
        View on Reddit
        #85175243
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Anbeeld@reddit
    

            
        
    
    
        
    
        It depends if you are on Windows or Linux. If on Linux, you can try it right now using vLLM + MTP. I tried it via Windows 11 + WSL2 which wasted just enough VRAM to make it all unviable. YMMV, might be skill issue.
I'm working on a decent alternative option right now, driven by existing ones not working well for me. :P
    

        
    
    
        View on Reddit
        #85180632
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        GrungeWerX@reddit
    

            
        
    
    
        
    
        Great, I'll wait then. On WIndows 10.
    

        
    
    
        View on Reddit
        #85183503
    

        
    
    

                
            
        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        ForsookComparison@reddit
    

            
        
    
    
        
    
        Ohhh that makes sense, thanks
    

        
    
    
        View on Reddit
        #85171532
    

        
    
    

                
            
        
    
    

                
            
        
    
    

                
            
        
    
    

                
            
        
    
    

                
            
        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        SKirby00@reddit
    

            
        
    
    
        
    
        The kinda glosses over the main thing that tripped me up for a long time: how "big model checks" is faster than just "big model make token". Here, let me try to clear up that tricky part: 
Big model slow because takes long time to read all weights for make next token. Big model can make next token and other next tokens with only read weights one time, but each next token only good if token before is also good. 
With small model guesses, big model has filler it needs for:
 - make token n
 - make token n + 1 (only good if token n = small model guess for token n)
 - make token n + 2 (only good if token n +1 = small model guess for token n + 1)
 - (and so on ...)
... with only one time doing slow boring job reading model weights.
Big model can't make token n + 1 at same time as make token n if model doesn't already have guess value for token n. Small model pretty good at guessing, but not perfect. If small model make right guess for token n, big model can use the token n + 1 that it made at the same time. If small model make right guesses for token n and n + 1, big model can use new tokens it made for n + 1 and n + 2.
I know this isn't quite as super duper simple as the comment I'm replying to, but if you were confused (like me) by the part about it being faster for the big model to check than to make the next token, then hopefully this helps. 
Key insight (not ELI5): it's not really faster in the sense that it's less computation to check than to produce, but rather in that it can do multiple checks in the same number of weight reads (the slowest part) as it would take to do just a single prediction.
But that's for speculative decoding. I think MTP is more like "big model guess next few tokens each cycle instead of just guess next token". I only just tried to learn this stuff today so I'm definitely not expert, but I think with MTP, the key difference is that "big model not need filler guess from small model to make token n + 1 at same time as big model make token n".
Someone who knows this stuff better can feel free to correct me if I'm wrong.
    

        
    
    
        View on Reddit
        #85192390
    

        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        ParaboloidalCrest@reddit
    

            
        
    
    
        
    
        Ok then please ELI5 what spec decoding was again? XD. Sounds similar.
    

        
    
    
        View on Reddit
        #85166879
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        DeepOrangeSky@reddit
    

            
        
    
    
        
    
        I think that's the idea.  It is basically like doing speculative decoding, except, instead of having to use a whole literal separate small model that you run in tandem with your main model, the main model just uses a small portion of itself to perform the function of what that separate small model would've done, to do the speculative decoding for itself.
So, an advancement/evolution of traditional speculative decoding, basically.
    

        
    
    
        View on Reddit
        #85168320
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        ParaboloidalCrest@reddit
    

            
        
    
    
        
    
        But speculative decoding does work already without a draft model.
    

        
    
    
        View on Reddit
        #85169505
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        ilintar@reddit
    

            
                (OP)
            
        
    
    
        
    
        Yes, but that's ngram based speculative decoding, which is slightly a different beast, it's basically a lookup cache for common token combinations :)
    

        
    
    
        View on Reddit
        #85169691
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Silver-Champion-4846@reddit
    

            
        
    
    
        
    
        Can they be combined for faster fastness?
    

        
    
    
        View on Reddit
        #85178191
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        ilintar@reddit
    

            
                (OP)
            
        
    
    
        
    
        Possibly, there's a PR out there for chained specdec support.
    

        
    
    
        View on Reddit
        #85178263
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Silver-Champion-4846@reddit
    

            
        
    
    
        
    
        Specspecdec?
    

        
    
    
        View on Reddit
        #85179675
    

        
    
    

                
            
        
    
    

                
            
        
    
    

                
            
        
    
    

                
            
        
    
    

                
            
        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        radlinsky@reddit
    

            
        
    
    
        
    
        Lol! That is indeed a 5 year old level explaination 
    

        
    
    
        View on Reddit
        #85178261
    

        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        cibernox@reddit
    

            
        
    
    
        
    
        Isn’t that just speculative decoding?
    

        
    
    
        View on Reddit
        #85166986
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Orolol@reddit
    

            
        
    
    
        
    
        It is, but baked inside the model during the whole training, so you have higher acceptance
    

        
    
    
        View on Reddit
        #85169455
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        cibernox@reddit
    

            
        
    
    
        
    
        That sounds very interesting. Does it require new models or new ggufs of existing models?
    

        
    
    
        View on Reddit
        #85169838
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        audioen@reddit
    

            
        
    
    
        
    
        The GGUF files have had the MTP heads typically stripped to save disk space (and to avoid llama.cpp warning that it isn't going to load the layer) so they will probably get updated for this.
I am going to run this PR right now, this is the most anticipated feature of llama.cpp of all time, at least for me. Ever since GLM-4.5 or such had it, and it was known to approximately double the generation rate... Probably becomes easily the biggest single performance improvement llama.cpp has ever had.
    

        
    
    
        View on Reddit
        #85178237
    

        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Orolol@reddit
    

            
        
    
    
        
    
        You can't add it on existing models, but some models already ahve it, like Qwen 3.6 / 3.5
    

        
    
    
        View on Reddit
        #85171868
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        pmttyji@reddit
    

            
        
    
    
        
    
        Do we have list of models(comes with this feature) somewhere? It would be nice to have filter this on HuggingFace for same.
    

        
    
    
        View on Reddit
        #85175028
    

        
    
    

                
            
        
    
    

                
            
        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        stddealer@reddit
    

            
        
    
    
        
    
        It's self-specultive.
Instead of having a whole smaller LLM to predict the next sequence of tokens sequentially, the model has multiple output heads for the final layer, trying to predict probabilities for the next few tokens in one shot, without accessing the last few tokens before it since they haven't been sampled yet.
Meaning in one forward pass, the model can:
- predict and sample the next token (like a normal autoregressive LLM)
- check if the drafted tokens from last pass match and can be accepted already (like speculative decoding)
- draft the sequence of next tokens to be checked in the subsequent pass
    

        
    
    
        View on Reddit
        #85170397
    

        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        _bones__@reddit
    

            
        
    
    
        
    
        So speculative decoding?
    

        
    
    
        View on Reddit
        #85172277
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Silver-Champion-4846@reddit
    

            
        
    
    
        
    
        Same spirit, different body
    

        
    
    
        View on Reddit
        #85178233
    

        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Silver-Champion-4846@reddit
    

            
        
    
    
        
    
        All explainers should be like this. Caveman talk most efficient
    

        
    
    
        View on Reddit
        #85177928
    

        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        AlarmingProtection71@reddit
    

            
        
    
    
        
    
        This is more like a "Explain like i'm a cavemen" but me like it. me unterstand now.
    

        
    
    
        View on Reddit
        #85177308
    

        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Intelligent-Baker448@reddit
    

            
        
    
    
        
    
        ooga booga, zoom zoom zoom
    

        
    
    
        View on Reddit
        #85170731
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        ilintar@reddit
    

            
                (OP)
            
        
    
    
        
    
        C'mon, request was for ELI5 not ELI2 :P
    

        
    
    
        View on Reddit
        #85172447
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Intelligent-Baker448@reddit
    

            
        
    
    
        
    
        It's more like ELI 15,000 B.C.
Everyone is talking like cavemen to save tokens, I thought. 
    

        
    
    
        View on Reddit
        #85172996
    

        
    
    

                
            
        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Ariquitaun@reddit
    

            
        
    
    
        
    
        What size small model. Small like child or small like cabbage 
    

        
    
    
        View on Reddit
        #85171387
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        ilintar@reddit
    

            
                (OP)
            
        
    
    
        
    
        Big model like big human. Small model like small human.
    

        
    
    
        View on Reddit
        #85172411
    

        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        simracerman@reddit
    

            
        
    
    
        
    
        Or, “speculative decoding”. Unless MTP has some amazing leg up over traditional draft.
    

        
    
    
        View on Reddit
        #85166951
    

        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        reery7@reddit
    

            
        
    
    
        
    
        It can make dense models run 1.5-2.0x faster. It makes most sense for a single user local model. I think it’s not that big of a jump for concurrency.
    

        
    
    
        View on Reddit
        #85172977
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Silver-Champion-4846@reddit
    

            
        
    
    
        
    
        Can it improve qwen3.5-4b on cpu?
    

        
    
    
        View on Reddit
        #85178371
    

        
    
    

                
            
        
    
    

                
            
        
    
    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        Unlucky-Message8866@reddit
    

            
        
    
    
        
    
        now in beta? it's not even merged, still a draft
    

        
    
    
        View on Reddit
        #85202550
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Beginning-Window-115@reddit
    

            
        
    
    
        
    
        you can still beta test a draft
    

        
    
    
        View on Reddit
        #85225350
    

        
    
    

                
            
        
    
    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        HiddenMushroom11@reddit
    

            
        
    
    
        
    
        I converted the Q8(https://huggingface.co/am17an/Qwen3.6-27B-MTP-GGUF) MTP to IQ4_XS w/ MTP, and it's super fast on dual 3060s. Thanks for the post OP! 
    

        
    
    
        View on Reddit
        #85196011
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        HiddenMushroom11@reddit
    

            
        
    
    
        
    
        One thing to note, is I couldn't get vision(mmproj) working.
    

        
    
    
        View on Reddit
        #85200351
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Beginning-Window-115@reddit
    

            
        
    
    
        
    
        could you upload it to huggingface please
    

        
    
    
        View on Reddit
        #85225334
    

        
    
    

                
            
        
    
    

                
            
        
    
    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        coder543@reddit
    

            
        
    
    
        
    
        This seriously has the potential to be the biggest game changer llama.cpp has ever seen.
I think MTP will make the biggest difference for dense models, maybe not so much for MoEs, but it will still be exciting.
    

        
    
    
        View on Reddit
        #85164584
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Orolol@reddit
    

            
        
    
    
        
    
        Yeah on vllm Qwen 27b goes from 55 to 105 tok/s.
    

        
    
    
        View on Reddit
        #85169384
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        apeapebanana@reddit
    

            
        
    
    
        
    
        teach me sensei!! so far getting 30\~40 tok/s...  
slow but still great to work with!
    

        
    
    
        View on Reddit
        #85173208
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Orolol@reddit
    

            
        
    
    
        
    
        uv run vllm serve Lorbus/Qwen3.6-27B-int4-AutoRound --max-model-len "131728" --gpu-memory-utilization "0.93" --attention-backend flashinfer --language-model-only --kv-cache-dtype "fp8_e4m3" --max-num-seqs "16" --skip-mm-profiling --quantization auto_round --reasoning-parser qwen3 --enable-auto-tool-choice --enable-prefix-caching --enable-chunked-prefill --tool-call-parser qwen3_coder --speculative-config '{"method":"mtp","num_speculative_tokens":3}
On a 5090
    

        
    
    
        View on Reddit
        #85173611
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Ok_Brain_2376@reddit
    

            
        
    
    
        
    
        Can I make the assumption that this is from the base vLLM? No need to find some random PR’s build? (Been struggling to run Qwen Dense models for 100+ Tps for a while 
    

        
    
    
        View on Reddit
        #85225102
    

        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        DominusIniquitatis@reddit
    

            
        
    
    
        
    
        (Did you mean to use 131072, by the way?)
    

        
    
    
        View on Reddit
        #85220332
    

        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Silver-Champion-4846@reddit
    

            
        
    
    
        
    
        This be sorcery. Lol
    

        
    
    
        View on Reddit
        #85177788
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        apeapebanana@reddit
    

            
        
    
    
        
    
        so much this. I was pretty proud of my llamacpp setup until spotting the whispering of high tok/s and vllm scripts.
    

        
    
    
        View on Reddit
        #85178135
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Silver-Champion-4846@reddit
    

            
        
    
    
        
    
        So many args
    

        
    
    
        View on Reddit
        #85179654
    

        
    
    

                
            
        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        takoulseum@reddit
    

            
        
    
    
        
    
        Do you use it for parallel requests?
    

        
    
    
        View on Reddit
        #85174126
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Orolol@reddit
    

            
        
    
    
        
    
        Yeah I use subagents for coding
    

        
    
    
        View on Reddit
        #85176120
    

        
    
    

                
            
        
    
    

                
            
        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        rerri@reddit
    

            
        
    
    
        
    
        I am seeing very similar numbers on llama.cpp with this PR on a 5090.
    

        
    
    
        View on Reddit
        #85173310
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Orolol@reddit
    

            
        
    
    
        
    
        Great ! It still lack the prefix caching tho.
    

        
    
    
        View on Reddit
        #85173673
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        coder543@reddit
    

            
        
    
    
        
    
        What do you mean by this? llama-server has supported checkpointing for these Qwen3.x models for weeks now, which is the way that prefix caching works for these hybrid attention models?
    

        
    
    
        View on Reddit
        #85182255
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Orolol@reddit
    

            
        
    
    
        
    
        I didn't check for weeks, but last time the checkpointing was quite fuzzy. I have a long context reasoning benchmark (https://github.com/Orolol/familyBench) that reuse a very long context and llama.cpp was giving me horrible performance while vllm could have 16 concurrent requests with 0 prefill and 2k toks/s   
Maybe it has improved since, i'll retest
    

        
    
    
        View on Reddit
        #85200916
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        StorageHungry8380@reddit
    

            
        
    
    
        
    
        Just on the off chance you missed it, did you bump the cache size? It's quite small by default, 8GB, so will get trashed if you have multiple long context prompts. I bumped mine up to 48GB and it was a significant improvement for my use-case.
    

        
    
    
        View on Reddit
        #85223083
    

        
    
    

                
            
        
    
    

                
            
        
    
    

                
            
        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        ego100trique@reddit
    

            
        
    
    
        
    
        Doesn't it duplicate the size of the model in vram though? 
    

        
    
    
        View on Reddit
        #85176180
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        coder543@reddit
    

            
        
    
    
        
    
        The KV Cache might be twice the size, but not the model.
    

        
    
    
        View on Reddit
        #85177363
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        ego100trique@reddit
    

            
        
    
    
        
    
        Oh ok ok thank you 
    

        
    
    
        View on Reddit
        #85179331
    

        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Orolol@reddit
    

            
        
    
    
        
    
        A small overhead. The MTP part of the model is quite small.
    

        
    
    
        View on Reddit
        #85178066
    

        
    
    

                
            
        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        LagOps91@reddit
    

            
        
    
    
        
    
        should also be a big difference for MoE models hopefully. could make hybrid inference much more viable.
    

        
    
    
        View on Reddit
        #85165805
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        oxygen_addiction@reddit
    

            
        
    
    
        
    
        On a 12GB 4070RTX Super, Qwen3.6-35B-A3B-Q4_K_XL went from 49tk/s to 55tk/s with MTP (despite the MTP model being 900mb bigger)
    

        
    
    
        View on Reddit
        #85206350
    

        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        am17an@reddit
    

            
        
    
    
        
    
        I just tested MoE models out, on my DGX spark Qwen35A3B went from 53 toks/second to \~70-75 toks/sec. So you're right, not as much for MoEs as dense
    

        
    
    
        View on Reddit
        #85172666
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        coder543@reddit
    

            
        
    
    
        
    
        What kind of task? I find that specdec is more effective at tasks like "write a react typescript example" than they are at tasks like "what is the LHC?".
    

        
    
    
        View on Reddit
        #85173312
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        am17an@reddit
    

            
        
    
    
        
    
        Here is my super comprehensive benchmark 

https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090
    

        
    
    
        View on Reddit
        #85173903
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        coder543@reddit
    

            
        
    
    
        
    
        Isn't that showing MTP losing to the external draft model? That seems odd.
    

        
    
    
        View on Reddit
        #85174202
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        am17an@reddit
    

            
        
    
    
        
    
        It may lose but it won't be super consistent, because the draft model is more powerful requires more VRAM. And I did `--spec-draft-n-max` 16 there which requires a lot of memory for the partial rollback. If you're VRAM rich then the draft model is pretty good already.
    

        
    
    
        View on Reddit
        #85174493
    

        
    
    

                
            
        
    
    

                
            
        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        lolwutdo@reddit
    

            
        
    
    
        
    
        That's still a decent increase
    

        
    
    
        View on Reddit
        #85172833
    

        
    
    

                
            
        
    
    

                
            
        
    
    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        InuRyu@reddit
    

            
        
    
    
        
    
        I just learned a little bit about MTP. So from what I know, this is only useful if the acceptance rate is high, for predictable tasks like coding, it is good, but what about tasks such as RL or writing a novel? The smaller model would behave very differently than the bigger model expects for creative tasks, so the acceptance rate will be really low.
    

        
    
    
        View on Reddit
        #85200165
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Pro-Row-335@reddit
    

            
        
    
    
        
    
        I go from 45 to 80 tk/s with this on Japanese->English translation, and its was consistently at 80, pasted a 150k tokens paper and asked for it to evaluate it and it went from 80 to 71, I believe you are confounding MTP with ngram
    

        
    
    
        View on Reddit
        #85219176
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        InuRyu@reddit
    

            
        
    
    
        
    
        oh lol, yeah you're right. After I read a bit more in the comment, I saw someone explain all the concepts (MTP, ngram, DFlash, etc.) and it seems like I confused between MTP with speculative decoding
    

        
    
    
        View on Reddit
        #85219653
    

        
    
    

                
            
        
    
    

                
            
        
    
    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        No_Block8640@reddit
    

            
        
    
    
        
    
        Can someone explain to me how to run models in llama cpp? I’ve tried to install this branch and offloaded expert layers to cpu just like in lm studio, but I usually get 25-30t/s in lm studio vs 14 t/s using this branch of llama cpp, may be my flags are wrong? (Qwen 3.6 35ba3b, rtx3080 and cpu)
    

        
    
    
        View on Reddit
        #85218794
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Pro-Row-335@reddit
    

            
        
    
    
        
    
        make sure you have a gguf with MTP in it, then --spec-type mtp --spec-draft-n-max 3
    

        
    
    
        View on Reddit
        #85218894
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        No_Block8640@reddit
    

            
        
    
    
        
    
        I suppose unsloth doesn’t have mtp layers? How to find a gguf with mtp layers?
    

        
    
    
        View on Reddit
        #85218979
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Pro-Row-335@reddit
    

            
        
    
    
        
    
        It doesn't, I just download this and it worked, you can find others by searching the name on huggingface with MTP

https://huggingface.co/brittlewis12/Qwen3.6-27B-MTP-GGUF/tree/main
    

        
    
    
        View on Reddit
        #85219044
    

        
    
    

                
            
        
    
    

                
            
        
    
    

                
            
        
    
    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        rerri@reddit
    

            
        
    
    
        
    
        Nice! This seems to be way faster than ik_llama.cpp implementation. Been playing with that the past couple of days.
    

        
    
    
        View on Reddit
        #85166382
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        AzerbaijanNyan@reddit
    

            
        
    
    
        
    
        Very nice, 55-60 t/s at around 80K context on two Mi50s and the 35B-A3B in Q4.1. Not the smartest but works rather well if given a good plan to follow.
    

        
    
    
        View on Reddit
        #85217596
    

        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        shuwatto@reddit
    

            
        
    
    
        
    
        This is huge, thanks a lot!
    

        
    
    
        View on Reddit
        #85206630
    

        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        oxygen_addiction@reddit
    

            
        
    
    
        
    
        Thanks a lot. I used it to make Qwen3.5-4B-Q6_K_L-MTP and it works great.
    

        
    
    
        View on Reddit
        #85203521
    

        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        thoquz@reddit
    

            
        
    
    
        
    
        Brilliant! What's the memory requirement for the MTP layer?
    

        
    
    
        View on Reddit
        #85170318
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        rerri@reddit
    

            
        
    
    
        
    
        I'm seeing \~3.1 GB more VRAM used when comparing MTP to no-MTP and using 128K ctx length, kv q8_0.
At 16K ctx length, the difference is still pretty big at \~2.7 GB.
    

        
    
    
        View on Reddit
        #85170655
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        coder543@reddit
    

            
        
    
    
        
    
        I wonder if it is allocating a separate, draft KV cache for the MTP heads? I didn't think that was needed for MTP.
    

        
    
    
        View on Reddit
        #85172306
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        rerri@reddit
    

            
        
    
    
        
    
        am17an writes in the PR: "it has it's own context/kv-cache etc."
    

        
    
    
        View on Reddit
        #85172807
    

        
    
    

                
            
        
    
    

                
            
        
    
    

                
            
        
    
    

                
            
        
    
    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        StupidScaredSquirrel@reddit
    

            
        
    
    
        
    
        I have to say it's hard to complain about prices going up when my same hardware becomes so much more capable every month for free. 
    

        
    
    
        View on Reddit
        #85164792
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Travnewmatic@reddit
    

            
        
    
    
        
    
        I've had this same thought so many times over the past few months
    

        
    
    
        View on Reddit
        #85210916
    

        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        ketosoy@reddit
    

            
        
    
    
        
    
        You’re a better man than me, I still manage to be upset by the prices.  
    

        
    
    
        View on Reddit
        #85166308
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        StupidScaredSquirrel@reddit
    

            
        
    
    
        
    
        Not better, not a man, but I do shake my fists whenever ai see the price of new hardware lol it's just that everytime I see stuff like this post or run qwen3.6 I remember how kucky I am and didn't really expect it of my pc 2 years ago
    

        
    
    
        View on Reddit
        #85166518
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Silver-Champion-4846@reddit
    

            
        
    
    
        
    
        Lucky you, but I can't even train a 1 million param tts model. Not even 1 million! Theoretically useless, but my cpu and 8gb ram says "nerrrp"
    

        
    
    
        View on Reddit
        #85178664
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        StupidScaredSquirrel@reddit
    

            
        
    
    
        
    
        You can still run some sub 4b models to do plenty of stuff, local audio transcription, tts, fill-in-the middle for coding, boilerplate email assistant, etc
    

        
    
    
        View on Reddit
        #85179067
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Silver-Champion-4846@reddit
    

            
        
    
    
        
    
        4b models  using Jan (runs on llama.cpp) gives me 3-4tps. Not very usable unless someone invents a small agent harness (small lm friendly)
    

        
    
    
        View on Reddit
        #85179784
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        StupidScaredSquirrel@reddit
    

            
        
    
    
        
    
        What quant are u using? Are you compute bound or memory bandwidth bound? 
    

        
    
    
        View on Reddit
        #85179898
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Silver-Champion-4846@reddit
    

            
        
    
    
        
    
        I'm bothbound. Cpu is Core I5 8th gen U type (lo-power mobile cpu, 4 cores 8 threads), ram 8gb single channel ddr4 ram, storage 256gb.
My latitude5590 is upgradable to max 32gb ram 1tb storage. That means potential for a lot more context, a model that uses Gemma4 PerLayer embeddings more extensively, or suffering with Qwen3.5 9B, as cpu stays the same.
    

        
    
    
        View on Reddit
        #85180324
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        StupidScaredSquirrel@reddit
    

            
        
    
    
        
    
        Try marco nano or even mini if u can fit,  I edited my previous comment
    

        
    
    
        View on Reddit
        #85180590
    

        
    
    

                
            
        
    
    

                
            
        
    
    

                
            
        
    
    

                
            
        
    
    

                
            
        
    
    

                
            
        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        pmttyji@reddit
    

            
        
    
    
        
    
        Hope we get everything from below thread(and comments) soon or by end of this quarter.
Compilation of recent findings which could save some memory or increase performance
    

        
    
    
        View on Reddit
        #85165497
    

        
    
    

                
            
        
    
    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        No_Conversation9561@reddit
    

            
        
    
    
        
    
        Does it improve prefill speed too or only decode?
    

        
    
    
        View on Reddit
        #85170649
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        TheTerrasque@reddit
    

            
        
    
    
        
    
        One reported halving prefill speed when this was active, from ~1200 to ~600
    

        
    
    
        View on Reddit
        #85199238
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        lolwutdo@reddit
    

            
        
    
    
        
    
        Damn, I'd rather have faster PP than TG
    

        
    
    
        View on Reddit
        #85209859
    

        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        ilintar@reddit
    

            
                (OP)
            
        
    
    
        
    
        Only decode. For prefiil you need matmul kernel optimizations.
    

        
    
    
        View on Reddit
        #85172481
    

        
    
    

                
            
        
    
    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        Apart_Boat9666@reddit
    

            
        
    
    
        
    
        Just tried 35b with mtp, my current setup is 12gb 6700xt. So old config it was offloading to ram. Just ran this and it needs extra 3 gb on vram. So my "--n-cpu-moe", "23" dropped to "--n-cpu-moe", "36" and it was slower than before. So if your setup is offloading not worth it
Just tried the 35B model with MTP on my current setup: a 12GB RX 6700 XT.
With my old config, it was already offloading some layers to RAM. After enabling MTP, it needed around 3GB extra VRAM, so I had to change:
"--n-cpu-moe", "23"

to:
"--n-cpu-moe", "36"

That made it slower than before. So if your setup already needs offloading, MTP is probably not worth it.
    

        
    
    
        View on Reddit
        #85198984
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Ok_Warning2146@reddit
    

            
        
    
    
        
    
        Thanks for your info. I think MTP is mainly for llama.cpp to catch up with sglang and vllm on pure nvidia platform.
    

        
    
    
        View on Reddit
        #85207172
    

        
    
    

                
            
        
    
    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        tarruda@reddit
    

            
        
    
    
        
    
        Is this only for 3.x dense models or does it work with MoEs too?
    

        
    
    
        View on Reddit
        #85168615
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        oxygen_addiction@reddit
    

            
        
    
    
        
    
        Works with MOE that still retain the MTP head. I transplanted one from over to an unsloth quant and it works fine.
    

        
    
    
        View on Reddit
        #85206447
    

        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        unjustifiably_angry@reddit
    

            
        
    
    
        
    
        MoE tends to see lesser or even negative benefit; if you're already operating with a very small active parameter count, going any lower gets exponentially dumber outputs and accordingly worse acceptance rates.
    

        
    
    
        View on Reddit
        #85173934
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        tarruda@reddit
    

            
        
    
    
        
    
        So maybe it will be worth it for the 122B and 397B (if 3.6 for those are released)
    

        
    
    
        View on Reddit
        #85174906
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        unjustifiably_angry@reddit
    

            
        
    
    
        
    
        Likely for 122b, but most certainly for 27b.
    

        
    
    
        View on Reddit
        #85183623
    

        
    
    

                
            
        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        ilintar@reddit
    

            
                (OP)
            
        
    
    
        
    
        Should work with MoE but I guess it'll need the MoE MTP model support as well.
    

        
    
    
        View on Reddit
        #85169543
    

        
    
    

                
            
        
    
    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        Ok_Warning2146@reddit
    

            
        
    
    
        
    
        Wow. That's big news. Finally the last piece of puzzle that puts it on par with sglang and vllm
    

        
    
    
        View on Reddit
        #85206274
    

        
    
    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        Charming-Author4877@reddit
    

            
        
    
    
        
    
        A draft is not a beta. Can't wait for having this implemented.
    

        
    
    
        View on Reddit
        #85165618
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        ilintar@reddit
    

            
                (OP)
            
        
    
    
        
    
        I'm saying this is a beta because my gut feeling tells me that this is close to the production version :)
    

        
    
    
        View on Reddit
        #85166441
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        itsappleseason@reddit
    

            
        
    
    
        
    
        y'all are really downvoting king deltanet
    

        
    
    
        View on Reddit
        #85175699
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        ilintar@reddit
    

            
                (OP)
            
        
    
    
        
    
        I'm just the messenger here ;)
    

        
    
    
        View on Reddit
        #85176131
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Top-Rub-4670@reddit
    

            
        
    
    
        
    
        A messenger conveys the message as is. Here, you've made up the message "It's now in beta" when it's just PR, and a draft one at that.
    

        
    
    
        View on Reddit
        #85205313
    

        
    
    

                
            
        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Pyrolistical@reddit
    

            
        
    
    
        
    
        Doesn’t work on vulkan yet
    

        
    
    
        View on Reddit
        #85167570
    

        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        feckdespez@reddit
    

            
        
    
    
        
    
        It's still not a beta though. Just a draft and PR for it.
    

        
    
    
        View on Reddit
        #85166730
    

        
    
    

                
            
        
    
    

                
            
        
    
    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        EveningIncrease7579@reddit
    

            
        
    
    
        
    
        Really awesome! Any results on a single 3090? I'll extract layers from the original GGUF (from author in PR) to a quantized one and try it in the new llama.cpp. I'll try it at home soon...
    

        
    
    
        View on Reddit
        #85175792
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        EveningIncrease7579@reddit
    

            
        
    
    
        
    
        Make some tests, good results but not very promising....

Prompt: Make a flappy bird in html css and js.
With draft-max-2 (starts slow (30tk/s), but increases to 44\~45 after some seconds)

./build/bin/llama-server \ -m /mnt/hd_geral_1tb/lucebox-hub/dflash/models/qwen3.6-27b-unsloth-Q4_K_M-MTP.gguf \ --host 0.0.0.0 \ --port 9090 \ -ngl 999 \ -np 1 \ --no-mmap \ --no-cache-prompt \ -fa on \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --ctx-size 32768 \ --webui \ --spec-type mtp \ --draft-max 2 \ --chat-template-kwargs '{"preserve_thinking":true}'
prompt eval time =     145.91 ms /    21 tokens (    6.95 ms per token,   143.93 tokens per second)
eval time =  214175.96 ms /  8041 tokens (   26.64 ms per token,    37.54 tokens per second)
total time =  214321.87 ms /  8062 tokens
draft acceptance rate = 0.34294 ( 3271 accepted /  9538 generated)
statistics mtp: #calls(b,g,a) = 1 4769 2177, #gen drafts = 4769, #acc drafts = 2177, #gen tokens = 9538, #acc tokens = 3271, dur(b,g,a) = 0.001, 24194.822, 0.311 ms
Without, using original unsloth q4_km

./build/bin/llama-server \ -m /mnt/hd_geral_1tb/lucebox-hub/dflash/models/Qwen3.6-27B-Q4_K_M.gguf \ --host 0.0.0.0 \ --port 9090 \ -ngl 999 \ -np 1 \ --no-mmap \ --no-cache-prompt \ -fa on \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --ctx-size 32768 \ --webui
prompt eval time =     153.83 ms /    21 tokens (    7.33 ms per token,   136.52 tokens per second)
eval time =  154049.44 ms /  6051 tokens (   25.46 ms per token,    39.28 tokens per second)
total time =  154203.27 ms /  6072 tokens
Using qwen 0.8b as a draft (using 27B Q4 mtp gguf converted)

./build/bin/llama-server \ -m /mnt/hd_geral_1tb/lucebox-hub/dflash/models/qwen3.6-27b-unsloth-Q4_K_M-MTP.gguf \ -hfd unsloth/Qwen3.5-0.8B-GGUF:Q8_0 \ --host 0.0.0.0 \ --port 9090 \ -ngl 999 \ -ngld 999 \ -np 1 \ --no-mmap \ --no-cache-prompt \ -fa on \ --ctx-size 32768 \ --ctx-size-draft 32768 \ --webui \ --draft-max 16 \ --chat-template-kwargs '{"preserve_thinking":true}'
prompt eval time =     121.30 ms /    21 tokens (    5.78 ms per token,   173.12 tokens per second)
eval time =  231306.58 ms /  6907 tokens (   33.49 ms per token,    29.86 tokens per second)
total time =  231427.88 ms /  6928 tokens
draft acceptance rate = 0.81802 ( 7898 accepted /  9655 generated)
Using qwen 0.8b as a draft (using 27B gguf unsloth original K_M)
./build/bin/llama-server \ -m /mnt/hd_geral_1tb/lucebox-hub/dflash/models/Qwen3.6-27B-Q4_K_M.gguf \ -md /mnt/external/models/unsloth/QWEN3.6-27B/Qwen3.5-0.8B-Q8_0.gguf \ --host 0.0.0.0 \ --port 9090 \ -ngl 999 \ -ngld 999 \ -np 1 \ --no-mmap \ --no-cache-prompt \ -fa on \ --ctx-size 32768 \ --ctx-size-draft 32768 \ --webui \ --draft-max 16 \ --chat-template-kwargs '{"preserve_thinking":true}'
prompt eval time =     225.99 ms /    21 tokens (   10.76 ms per token,    92.92 tokens per second)
eval time =  192988.51 ms /  6243 tokens (   30.91 ms per token,    32.35 tokens per second)
total time =  193214.51 ms /  6264 tokens
draft acceptance rate = 0.84039 ( 7145 accepted /  8502 generated)
    

        
    
    
        View on Reddit
        #85201392
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        EveningIncrease7579@reddit
    

            
        
    
    
        
    
        In first scenario Using draft-max1 (sometimes it get 50tk/s)

prompt eval time =     220.85 ms /    21 tokens (   10.52 ms per token,    95.09 tokens per second)
eval time =  138005.83 ms /  6149 tokens (   22.44 ms per token,    44.56 tokens per second)
total time =  138226.67 ms /  6170 tokens
draft acceptance rate = 0.59151 ( 2285 accepted /  3863 generated)  
In first scenario using draft-max 3 have a median of 30 t/ks
    

        
    
    
        View on Reddit
        #85202042
    

        
    
    

                
            
        
    
    

                
            
        
    
    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        WithoutReason1729@reddit
    

            
        
    
    
        
    
        Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
    

        
    
    
        View on Reddit
        #85200586
    

        
    
    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        zenmagnets@reddit
    

            
        
    
    
        
    
        Except for high concurrency output.
    

        
    
    
        View on Reddit
        #85198755
    

        
    
    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        TheTerrasque@reddit
    

            
        
    
    
        
    
        
The MTP model is a separate model which loads from the same GGUF, the idea is that MTP should automatically start and we shouldn't need to distribute the MTP gguf separately but also it has it's own context/kv-cache etc.

I was thinking about this a long time ago, that gguf should have generic support for multiple models. At that time I was thinking especially draft models, but also vision encoders and possibly other encoders / decoders / model types at some point. And image diffusion models with llm's and vae's included as another example.
    

        
    
    
        View on Reddit
        #85192531
    

        
    
    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        natermer@reddit
    

            
        
    
    
        
    
        Doesn't appear to supported on Vulkan or Cuda yet. Which is too bad. Hopefully that will come along eventually as well. 
The feature report points to: https://github.com/ggml-org/llama.cpp/pull/22400
    

        
    
    
        View on Reddit
        #85175026
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        ilintar@reddit
    

            
                (OP)
            
        
    
    
        
    
        It's supported on CUDA, Vulkan support needs a patched GDN kernel.
    

        
    
    
        View on Reddit
        #85175360
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        natermer@reddit
    

            
        
    
    
        
    
        Yeah sorry, I meant ROCM.
    

        
    
    
        View on Reddit
        #85189284
    

        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        ilintar@reddit
    

            
                (OP)
            
        
    
    
        
    
        (Actually, CUDA support ALSO needs the patched GDN kernel, which is in another PR - you have to read the thread for details)
    

        
    
    
        View on Reddit
        #85175413
    

        
    
    

                
            
        
    
    

                
            
        
    
    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        waywardspooky@reddit
    

            
        
    
    
        
    
        does this mean it this improvement will trickle it's way down into unsloth studio as well? 
    

        
    
    
        View on Reddit
        #85186895
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Fedor_Doc@reddit
    

            
        
    
    
        
    
        It uses llama.cpp as a backend, so yes, it will trickle down.
    

        
    
    
        View on Reddit
        #85188106
    

        
    
    

                
            
        
    
    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        nok01101011a@reddit
    

            
        
    
    
        
    
        You think talkie-1930 can also be patched to be working on official llama?
    

        
    
    
        View on Reddit
        #85185731
    

        
    
    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        dampflokfreund@reddit
    

            
        
    
    
        
    
        So is this only useful for dense models? If so, does it help with partial offloading?
    

        
    
    
        View on Reddit
        #85164628
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        StupidScaredSquirrel@reddit
    

            
        
    
    
        
    
        Eli5 why is this only useful for dense models? Doesn't it work for a3b just to a much lesser degree? 
    

        
    
    
        View on Reddit
        #85164872
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        dry3ss@reddit
    

            
        
    
    
        
    
        From what I've read here, MTP is only really useful with MoE if you have a lot of parallel execution, because it relies on most of the experts being available so you come back to a "dense"model that uses all it's weights. 
That explanation does seem weird with qwen3.6 35-a3b that is supposed to have dedicated MTP heads, so if anyone is more knowledgeable don't hesitate to share !
    

        
    
    
        View on Reddit
        #85165948
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        petuman@reddit
    

            
        
    
    
        
    
        To verify speculated tokens engine has to schedule parallel/batched completions for each speculated token (e.g. 4 completions for 3 speculated tokens). Those completions would be routed to different experts, activating more weights that single completion. 
If you're serving 100 users/batched requests, then ~all experts are being activated anyway. If you're a single user, then that's more experts activated per request (=> more memory bandwidth usage => slower generation)
    

        
    
    
        View on Reddit
        #85170611
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        dry3ss@reddit
    

            
        
    
    
        
    
        Ahhh yes it's for verification not for computing the MTP that the escorts are required thanks that makes sense !
    

        
    
    
        View on Reddit
        #85184137
    

        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Farmadupe@reddit
    

            
        
    
    
        
    
        I think that's right? If I understand it right, the main model still has to confirm all of the predicted tokens by doing exactly the same forward passes it was going to do anyway.   
Let's 10 token sequence. Without mtp and with mtp (100% accept rate):   

27b model (no mtp): You have to load all 27b params for each token. 27b * 10 = 270 billion params loaded from vram
27b model (10 tokens mtp): You can process all 10 tokens in loading of the weights. 27b * 1 = 27 billion params loaded from vram. 

With an MoE model, the maths is slightly different. Each token only loads 3 billion params, but you don't know which ones they are:

a3b model (no mtp): You have to load 3b parameters for each token. There will be some overlap, but let's assume no overlap for now. 3b * 10 = 30 billion params loaded from vram
a3b model (10 tokens mtp): You still have to load 3b parameters for each token. 3b * 10 = 30 billion params loaded from vram.

So in this hypothetical situation, the dense gets a massive speedup from mtp, but the moe gets almost none. You would actually get some speedup when if some of the same experts were pulled in, but nowhere near as much.
    

        
    
    
        View on Reddit
        #85166932
    

        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        coder543@reddit
    

            
        
    
    
        
    
        
That explanation does seem weird with qwen3.6 35-a3b that is supposed to have dedicated MTP heads

Because MTP helps during training, and because anyone serving a model in production will be batching large numbers of user requests together, activating all experts with every forward pass anyways.
    

        
    
    
        View on Reddit
        #85166629
    

        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        am17an@reddit
    

            
        
    
    
        
    
        I just updated the PR to also use Qwen3.6 MoE. It results in a 30-40% speed-up in my tests.
    

        
    
    
        View on Reddit
        #85174790
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        StupidScaredSquirrel@reddit
    

            
        
    
    
        
    
        No way! Link? 
    

        
    
    
        View on Reddit
        #85174971
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        am17an@reddit
    

            
        
    
    
        
    
        https://huggingface.co/am17an/Qwen3.6-35BA3B-MTP-GGUF
    

        
    
    
        View on Reddit
        #85175146
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        StupidScaredSquirrel@reddit
    

            
        
    
    
        
    
        Wait I' stupid I thought it was a change on llama.cpp not on the gguf file. Don't the gguf files already have the mtp layers just not leveraged by llama.cpp before the merge request? 
    

        
    
    
        View on Reddit
        #85175447
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        rerri@reddit
    

            
        
    
    
        
    
        It is a change in llama.cpp, the PR (link in OP) was updated. Old GGUF models of Qwen 3.5/3.6 do not include the MTP layer.
    

        
    
    
        View on Reddit
        #85177502
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        StupidScaredSquirrel@reddit
    

            
        
    
    
        
    
        Thx. Im memory poor so I guess Im gonna have to make my own gguf with heavy quantisation but keeping mtp at 16bits. We'll see how that goes. 
    

        
    
    
        View on Reddit
        #85178298
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        rerri@reddit
    

            
        
    
    
        
    
        This script might offer a shortcut if you are planning to use the 27B or 35B models: https://gist.github.com/buzz/1c439684d5e3f36492ae9f64ef7e3f67
It allows you to transplant the MTP from am17an's GGUF's onto whatever old GGUF of those models you already have.
Someone made it for ik_llama.cpp originally, but it seems to work fine with llama.cpp too.
    

        
    
    
        View on Reddit
        #85179039
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        StupidScaredSquirrel@reddit
    

            
        
    
    
        
    
        Thank you so much! 
    

        
    
    
        View on Reddit
        #85179413
    

        
    
    

                
            
        
    
    

                
            
        
    
    

                
            
        
    
    

                
            
        
    
    

                
            
        
    
    

                
            
        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Ueberlord@reddit
    

            
        
    
    
        
    
        When doing inference with a3b you are already only using 3b active parameters, thus to see any benefit you probably need to go to 0.6b as draft model which will most likely have bad acceptance rates and the difference to 3b is not big at all thus speed up is limited.
When using a 2b or 0.6b model as drafter for 27b the difference in active parameters is huge and we should see meaningful speed up, especially for tasks with higher acceptance rates like coding or structured outputs.
So in essence it works to a lesser degree but I think it is hardly meaningful for moe (unless something like 397b a27b).
    

        
    
    
        View on Reddit
        #85166394
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        StupidScaredSquirrel@reddit
    

            
        
    
    
        
    
        Mtp in question doesn't rely on an external draft model though hence my question
    

        
    
    
        View on Reddit
        #85166693
    

        
    
    

                
            
        
    
    

                
            
        
    
    

                
            
        
    
    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        gcavalcante8808@reddit
    

            
        
    
    
        
    
        I've been using the original MTP since the first qwen 3.5 models were released since they are a bit slower than the older qwen3 models and they are really good! I also discovered that qwen3-coder-next also supports MTP and its is flying on my machine even with vulkan backend.
I'm very found of mtp as a speculative and simplied method! really nice to see the support becoming official 
    

        
    
    
        View on Reddit
        #85179945
    

        
    
    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        ea_man@reddit
    

            
        
    
    
        
    
        I get that this would be an opt in with a flag like --mtp so that those of us with small VRAM that won't be able to run MTP anyway (also single user prompting) don't have to load an extra heavy MTP layer?
    

        
    
    
        View on Reddit
        #85175407
    

        
    
    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        Due_Net_3342@reddit
    

            
        
    
    
        
    
        does this also work with step 3.5 flash? or only qwen models?
    

        
    
    
        View on Reddit
        #85174652
    

        
    
    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        OsmanthusBloom@reddit
    

            
        
    
    
        
    
        Cool! But will enabling MTP increase VRAM usage for, say, Qwen3.6-27B? Does it still fit into 16GB VRAM if you squeeze hard enough?
    

        
    
    
        View on Reddit
        #85167414
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        rerri@reddit
    

            
        
    
    
        
    
        MTP layer of am17an' model is \~440MB. Can maybe be quantized further, dunno.
    

        
    
    
        View on Reddit
        #85168356
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        OsmanthusBloom@reddit
    

            
        
    
    
        
    
        Thanks. The PR says "it has it's own context/kv-cache etc" so I assume that some VRAM will be needed for that as well.
    

        
    
    
        View on Reddit
        #85169351
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        rerri@reddit
    

            
        
    
    
        
    
        Yes, I'm seeing \~3.1 GB more VRAM used when comparing MTP to no-MTP and using 128K ctx length.
At 16K ctx length, the difference is still pretty big at \~2.7 GB.
Not very favorable for 16 GB VRAM :/
    

        
    
    
        View on Reddit
        #85170461
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        pmttyji@reddit
    

            
        
    
    
        
    
        Oops, I thought of trying on my 8GB VRAM 😄
    

        
    
    
        View on Reddit
        #85171636
    

        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        OsmanthusBloom@reddit
    

            
        
    
    
        
    
        Thanks a lot for this. Well, there goes my dream. Let's hope for a few more miracles to happen.
    

        
    
    
        View on Reddit
        #85170632
    

        
    
    

                
            
        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Dany0@reddit
    

            
        
    
    
        
    
        Quantising MTP layer has so far always turned out to be a very, very bad idea
    

        
    
    
        View on Reddit
        #85169696
    

        
    
    

                
            
        
    
    

                
            
        
    
    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        pmttyji@reddit
    

            
        
    
    
        
    
        Nice. Sorry for the dumb question. So this requires mentioned GGUF in PR? Regular GGUFs won't work?
    

        
    
    
        View on Reddit
        #85164912
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        ilintar@reddit
    

            
                (OP)
            
        
    
    
        
    
        I think any GGUFs without stripped MTP layers should work.
    

        
    
    
        View on Reddit
        #85164949
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        TheGlobinKing@reddit
    

            
        
    
    
        
    
        Noob question: how do I know if a GGUF I downloaded has MTP layers?
    

        
    
    
        View on Reddit
        #85170710
    

        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        Anbeeld@reddit
    

            
        
    
    
        
    
        There's this complex issue that if MTP is quantized it all goes to shit, which is why people use that specific Lorbus quant with vLLM.
    

        
    
    
        View on Reddit
        #85165196
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        michaelsoft__binbows@reddit
    

            
        
    
    
        
    
        Ok but can you like, elaborate on how that impacts gguf's...
    

        
    
    
        View on Reddit
        #85165350
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        ilintar@reddit
    

            
                (OP)
            
        
    
    
        
    
        Standard quantization schemes will quantize certain tensors based on their role in the graph, so eg. all ffn_up tensors get quantized to Q4_K. However, since an MTP layer is small, but you want as few rejections from it as possible, you want it higher quality then the other layers. Existing GGUFs probably don't have that.
    

        
    
    
        View on Reddit
        #85165755
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        michaelsoft__binbows@reddit
    

            
        
    
    
        
    
        Thanks for clarifying. I've already got the lorbus quant working in vllm, regularly hitting 120tok/s on my 5090. It does sound very reasonable that new quants with unquantized MTP layers will be needed for MTP to deliver the goods on llamacpp. It will be very nice for closing the gap and hopefully we will have a proliferation of quants to choose from. 
The thing about vllm is i don't think we can get anywhere as close as llamacpp makes it possible to squeeze models into limited vram. So MTP in llamacpp could be a massive game changer for hosting 27B class on 3090 for example, though there seems to already be some ways to squish it in with vllm already as well. But maybe it could do it with higher quality!
    

        
    
    
        View on Reddit
        #85169544
    

        
    
    

                
            
        
    
    

                
            
        
    
    

                
                    
    
    
    
        
        [-]
        
    
        
            
    
        ilintar@reddit
    

            
                (OP)
            
        
    
    
        
    
        Yeah, the MTP layer should probably be left as BF16.
    

        
    
    
        View on Reddit
        #85165291
    

        
    
    

                
            
        
    
    

                
            
        
    
    

                
            
        
    
    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        lolwutdo@reddit
    

            
        
    
    
        
    
        Nice, I wonder how much speed up 27b would get with partial cpu offloads.
    

        
    
    
        View on Reddit
        #85169630
    

        
    
    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        LagOps91@reddit
    

            
        
    
    
        
    
        damn you guys are fast, was just about to make a post for this
    

        
    
    
        View on Reddit
        #85165729
    

        
    
    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        autonomousdev_@reddit
    

            
        
    
    
        
    
        yo so i spent like a week messing with mtp and chained agents n stuff. batch stuff went way faster like 40% but latency got real weird after 4 tokens lol. had to dial it back to 2 for production. but for basic rag stuff it just works no complaints
    

        
    
    
        View on Reddit
        #85165596
    

        
    
    

    
        
    
    
    
        
        [-]
        
    
        
            
    
        bonobomaster@reddit
    

            
        
    
    
        
    
        Holy smokes... this year keeps on giving!
    

        
    
    
        View on Reddit
        #85164746
    

        
            
                
                    
    
    
    
        
        [-]
        
    
        
            
    
        ilintar@reddit
    

            
                (OP)
            
        
    
    
        
    
        Same arch.
    

        
    
    
        View on Reddit
        #85164970