Why GLM on llama.cpp has no MTP?

Posted by Expensive-Paint-9490@reddit | LocalLLaMA | View on Reddit | 9 comments

I have searched through the repo discussions and PRs but I can't find references. GLM models have embedded layers for multi-token prediction and speculative decoding. They can be used with vLLM - if you have hundreds GB VRAM, of course. Does anybody know why llama.cpp chose to not support this feature?

Reply to Post

9 Comments

[-]

coder543@reddit

If it makes you feel better, MTP won't help you in batch size 1... MoEs just don't work that way. MTP is only useful for people that are batching large numbers of requests on very large servers.

[-]

coder543@reddit

If it makes you feel better, MTP won't help you in batch size 1... MoEs just don't work that way. MTP is only useful for people that are batching large numbers of requests on very large servers.

[-]

Karyo_Ten@reddit

I can use them in SGLang but never managed to make them work in vLLM for what's worth

[-]

-dysangel-@reddit

I spent a while last summer reverse engineering GLM 4.5's MTP layer with Claude. I was able to poke around in these layers and see the model thinking ahead about concepts, and get some level of next token prediction - but the extra computation needed to validate the tokens meant that I needed to consistently predict 2 tokens ahead to beat baseline speeds - because the next pass would be predicting the next token anyway, so if you're going to do a forward pass for the validation step, you need to have an extra token to check for it to end up being faster. So, normal speculative decoding with a smaller base model more effective really. I considered setting up a system to be able to do speculative decoding with different architectures, but then I got distracted building a KV caching system, since that saves you \*minutes\* of processing the system prompt every time you boot up a local coding agent or switch mode, and GLM Air was already giving me me \~60tps on inference. Of course I'm not an ML expert so maybe I was also doing something wrong, but on the other hand, maybe that's why they've never bothered to enable MTP on release - for inference, it's much more architectural complexity, without consistent benefit. My feeling after playing around with it for a week is that the real strength of MTP is that it likely speeds up training being able to process a few tokens at a time, and the model is probably smarter because it has a richer internal representation per token. This might be why GLM 4 series models basically seem to perform as well as models almost twice their size? And again this is just me spitballing, I haven't thought about this for a while.

[-]

DistanceSolar1449@reddit

Deepseek R1 has an MTP as well. And Qwen3-Next. Deepseek has 2 unembedding matrices for some reason. I don’t think MTP is a magic way to increase quality.

[-]

reditzer@reddit

There's a long running \[PR\](https://github.com/ggml-org/llama.cpp/pull/15225)

[-]

Time_Reaper@reddit

Recently ngxson also started their own pr. The original one you linked seems abandoned. https://github.com/ggml-org/llama.cpp/pull/18886

[-]

jacek2023@reddit

MTP is in development IIRC

[-]

cantgetthistowork@reddit

For years