llama: avoid copying logits during prompt decode in MTP by am17an · Pull Request #23198 · ggml-org/llama.cpp
Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 48 comments
time to update your llama.cpp -> improved pp
MaruluVR@reddit
Offtopic but has Gemma 4 MTP been merged yet, do I need to marge the MTP into the main model?
If not can someone link me the pull request?
SkyFeistyLlama8@reddit
I'm still waiting on this too, MTP in llama.cpp doesn't seem to support separate MTP models like Gemma 4's assistants.
pmttyji@reddit
Not yet. Aman is working on it. I couldn't find PR.
OsmanthusBloom@reddit
Oh great, just as I finished my benchmarks. Well, I'm not going to redo them because in my case (6GB VRAM, 35B-A3B) the TG boost was minimal, as expected.
DunderSunder@reddit
why did you expect minimal boost?
OsmanthusBloom@reddit
MTP works well for dense models (around \~2x boost in TG in many cases), but brings much less benefit for MoE models like the Qwen3.6 35B-A3B when the experts are in CPU RAM. In my case, I was already extremely VRAM starved and having to cram the MTP extras into VRAM as well did not exactly help with the performance. But I decided to try anyway, because why not.
Agreeable-Market-692@reddit
You are overcomplicating it.
Does it fit in vram? Then expect an MTP advantage.
Latency is what kills the benefit of MTP, if you are offloading you are not going to have a reason to use MTP.
It is just that simple.
I am getting up to 3x boost in TG with Qwen 3.6 35B-A3B because it fits in vram.
Dense or sparse it matters not. Does it fit in vram is the only question you should ask.
SkyFeistyLlama8@reddit
It works fine on unified RAM machines where VRAM and system RAM are from the same pool of memory.
DunderSunder@reddit
at least you can use n-gram
am17an@reddit
k
pmttyji@reddit
Few questions Aman.
Thanks again.
am17an@reddit
pmttyji@reddit
Thanks for the response. I'll check CPU-only tomorrow & let you know.
It's good to have more models. Both GLM-5.1 & DeepseekV3 are so big. At least StepFun-3.5 is nice to have as Q4 comes around 100GB size so it's possible to run that one with VRAM + RAM. So will StepFun-3.5 work with current llama.cpp version? If yes, then we could ask some quanters to create StepFun-3.5-MTP GGUF. Same with GLM's previous version models like GLM-4.5-Air & GLM-4.7-Flash.
Agreeable-Market-692@reddit
latency kills the MTP advantage, offloading is a no-go
fallingdowndizzyvr@reddit
Am I the only one that updates llama.cpp daily?
10F1@reddit
I run yay (CachyOS) multiple times a day, so sometimes it updates multiple times a day.
cleversmoke@reddit
One a week here
StardockEngineer@reddit
Every few days here
pmttyji@reddit
I update during new models support & new feature support(Ex: MTP).
Probably I'll do frequent updates regularly after getting new rig this month.
LumbarJam@reddit
No. Twice a day here.
Thrumpwart@reddit
Make sure you've installed ccache.
Amazing_Athlete_2265@reddit
No
Defiant_Storm3233@reddit
Update few times a day sometimes.
IvGranite@reddit
Everyone posts their first rush of benchmarks and they’re all outdated within 24 hours, I’m tired boss lol
cleversmoke@reddit
I did! 🙋🏻
jacek2023@reddit (OP)
You should always wait for the dust to settle
PaceZealousideal6091@reddit
One thing I have learnt- dust never settles in llama.cpp. They are always eeking out just a little bit more. So, it's best get to things only when you absolutely need it or you have time to kill. 😉
IvGranite@reddit
Yeah I know, but it’s all just so exciting lol
jtjstock@reddit
Ok, so skipping logits during decode is pretty obvious.. what about skipping mtp entirely until it matters? Do the mtp heads really need to full context? I doubt it, I’ve been using a custom (albeit slop) fork that does that, gets to about 85% of non mtp prefill when the prefill is large, lets things go through mtp for smaller incremental prompts.
andreasntr@reddit
If you have data and feel confident, you can open an issue or a discussion
jtjstock@reddit
Well, per what I’ve read online, they do not, they are only looking at a small number of preceding tokens, but I may be missing something and have little more than a slop fork lol(well, and some numbers from casual testing, not anything in depth), I have no interest in adding to the bombardment of ill formed ideas that the developers who actually have familiarity with the code and know the math involved are dealing with…
So for now, I am assuming that either I am missing something, or they know, and need to figure out the maintainable way to do so.
StorageHungry8380@reddit
Anyone picked up why MTP negatively affects prompt processing? Is it just code issues like what this PR fixes, or is there something fundamental? As I understood it the extra output token prediction layers were just tacked on at the end of model, if so I don't see a fundamental reason why it should affect processing speed. Did I miss anything?
marking89@reddit
IIRC big model needs to send the pp data to the MTP model too to keep in sync and predict correctly. Right now it’s going VRAM->RAM->VRAM. They wanted the MTP PR merged before addressing this. They work fast!
StorageHungry8380@reddit
Then it's implemented very differently from what I understood. I understood it as just another (few) layers attached to the final layer of the model, before the final softmax. Anyway, cheers.
lolwutdo@reddit
Decent improvement, but it still halves my prompt processing speeds with MTP in half compared to without MTP.
Losing out on nearly 1000t/s PP just for an extra +10t/s TG, not really worth it.
MaruluVR@reddit
This might be off topic but has Gemma 4 MTP been merged yet? If not what is the pull request?
pmttyji@reddit
Not yet. Aman is currently working on it.
TuskNaPrezydenta2020@reddit
there are no PRs for Gemma yet on the topic of mtp, right?
jacek2023@reddit (OP)
There is one draft (in progress)
donomo@reddit
where? I can't find it
Ok-Ask1962@reddit
Already on the latest version, the prompt processing speed boost is real. This is why I always recommend staying current with ggml releases.
No_Lingonberry1201@reddit
I freaking love llama.cpp, such an awesome piece of software.
No_Swimming6548@reddit
Fr llama.cpp, llms and agent harnesses are closest thing I have seen in my life to magic
Silver-Champion-4846@reddit
I wish I had good hardware so I could use non-dumb models
No_Swimming6548@reddit
Qwen3.6 is pretty good. Like how tf are you that good good. You can always go API too. Deepseek v4 flash is pretty affordable.
Silver-Champion-4846@reddit
I can't even pay 1c, the online payment system of my country is only local as of now. So no deepseek either. Openrouter's free models are ficcal.
soyalemujica@reddit
Using this, and for some reason, I said "hi" and for the first time EVER, Qwen replied in just CHINESE. wtf?
xignaceh@reddit
Qwen 2.5 vibes haha