llama: avoid copying logits during prompt decode in MTP by am17an · Pull Request #23198 · ggml-org/llama.cpp

[-]

MaruluVR@reddit

Offtopic but has Gemma 4 MTP been merged yet, do I need to marge the MTP into the main model?

If not can someone link me the pull request?

[-]

SkyFeistyLlama8@reddit

I'm still waiting on this too, MTP in llama.cpp doesn't seem to support separate MTP models like Gemma 4's assistants.

[-]

pmttyji@reddit

Not yet. Aman is working on it. I couldn't find PR.

[-]

OsmanthusBloom@reddit

Oh great, just as I finished my benchmarks. Well, I'm not going to redo them because in my case (6GB VRAM, 35B-A3B) the TG boost was minimal, as expected.

[-]

MTP works well for dense models (around \~2x boost in TG in many cases), but brings much less benefit for MoE models like the Qwen3.6 35B-A3B when the experts are in CPU RAM. In my case, I was already extremely VRAM starved and having to cram the MTP extras into VRAM as well did not exactly help with the performance. But I decided to try anyway, because why not.

[-]

Agreeable-Market-692@reddit

You are overcomplicating it.

Does it fit in vram? Then expect an MTP advantage.

Latency is what kills the benefit of MTP, if you are offloading you are not going to have a reason to use MTP.

It is just that simple.

I am getting up to 3x boost in TG with Qwen 3.6 35B-A3B because it fits in vram.

Dense or sparse it matters not. Does it fit in vram is the only question you should ask.

[-]

SkyFeistyLlama8@reddit

It works fine on unified RAM machines where VRAM and system RAM are from the same pool of memory.

[-]

DunderSunder@reddit

at least you can use n-gram

[-]

am17an@reddit

k

[-]

pmttyji@reddit

Few questions Aman.

Never seen anyone tried MTP on CPU-only. Will it work? (I have plan to use small models on my old laptop so)
So far I see that CUDA, Vulkan, ROCm backends up with MTP. Any other backends? or All backends up already?
I see that currently only Qwen 3.5/3.6 models works with MTP. Big Thanks for your work on that. And currently you're working on Gemma4 now. What other models can work with current MTP? And here come the dumbest question(From Non-Coder POV). Isn't possible to bring common MTP code to support all future models? To rephrase question, do you(and/or other devs) need to work on MTP changes every time to support future models?

Thanks again.

[-]

am17an@reddit

No idea. It should probably work?
CUDA (and by extension ROCm), Vulkan, Metal are supported. Rest should follow soon
Not all models have MTP layers. The current architecture is suitable for adding MTP for GLM 5.1, DeepseekV3, StepFun3.5 etc. but not Gemma 4 at the moment

[-]

pmttyji@reddit

Thanks for the response. I'll check CPU-only tomorrow & let you know.

It's good to have more models. Both GLM-5.1 & DeepseekV3 are so big. At least StepFun-3.5 is nice to have as Q4 comes around 100GB size so it's possible to run that one with VRAM + RAM. So will StepFun-3.5 work with current llama.cpp version? If yes, then we could ask some quanters to create StepFun-3.5-MTP GGUF. Same with GLM's previous version models like GLM-4.5-Air & GLM-4.7-Flash.

[-]

Agreeable-Market-692@reddit

latency kills the MTP advantage, offloading is a no-go

[-]

fallingdowndizzyvr@reddit

time to update your llama.cpp

Am I the only one that updates llama.cpp daily?

[-]

10F1@reddit

I run yay (CachyOS) multiple times a day, so sometimes it updates multiple times a day.

[-]

cleversmoke@reddit

One a week here

[-]

StardockEngineer@reddit

Every few days here

[-]

pmttyji@reddit

I update during new models support & new feature support(Ex: MTP).

Probably I'll do frequent updates regularly after getting new rig this month.

[-]

LumbarJam@reddit

No. Twice a day here.

[-]

Thrumpwart@reddit

Make sure you've installed ccache.

[-]

Amazing_Athlete_2265@reddit

No

[-]

Defiant_Storm3233@reddit

Update few times a day sometimes.

[-]

IvGranite@reddit

Everyone posts their first rush of benchmarks and they’re all outdated within 24 hours, I’m tired boss lol

[-]

cleversmoke@reddit

I did! 🙋🏻

[-]

jacek2023@reddit (OP)

You should always wait for the dust to settle

[-]

PaceZealousideal6091@reddit

One thing I have learnt- dust never settles in llama.cpp. They are always eeking out just a little bit more. So, it's best get to things only when you absolutely need it or you have time to kill. 😉

[-]

IvGranite@reddit

Yeah I know, but it’s all just so exciting lol

[-]

jtjstock@reddit

Ok, so skipping logits during decode is pretty obvious.. what about skipping mtp entirely until it matters? Do the mtp heads really need to full context? I doubt it, I’ve been using a custom (albeit slop) fork that does that, gets to about 85% of non mtp prefill when the prefill is large, lets things go through mtp for smaller incremental prompts.

[-]

andreasntr@reddit

If you have data and feel confident, you can open an issue or a discussion

[-]

jtjstock@reddit

Well, per what I’ve read online, they do not, they are only looking at a small number of preceding tokens, but I may be missing something and have little more than a slop fork lol(well, and some numbers from casual testing, not anything in depth), I have no interest in adding to the bombardment of ill formed ideas that the developers who actually have familiarity with the code and know the math involved are dealing with…

So for now, I am assuming that either I am missing something, or they know, and need to figure out the maintainable way to do so.

[-]

StorageHungry8380@reddit

Anyone picked up why MTP negatively affects prompt processing? Is it just code issues like what this PR fixes, or is there something fundamental? As I understood it the extra output token prediction layers were just tacked on at the end of model, if so I don't see a fundamental reason why it should affect processing speed. Did I miss anything?

[-]

marking89@reddit

IIRC big model needs to send the pp data to the MTP model too to keep in sync and predict correctly. Right now it’s going VRAM->RAM->VRAM. They wanted the MTP PR merged before addressing this. They work fast!

[-]

StorageHungry8380@reddit

Then it's implemented very differently from what I understood. I understood it as just another (few) layers attached to the final layer of the model, before the final softmax. Anyway, cheers.

[-]

lolwutdo@reddit

Decent improvement, but it still halves my prompt processing speeds with MTP in half compared to without MTP.

Losing out on nearly 1000t/s PP just for an extra +10t/s TG, not really worth it.

[-]

MaruluVR@reddit

This might be off topic but has Gemma 4 MTP been merged yet? If not what is the pull request?

[-]

pmttyji@reddit

Not yet. Aman is currently working on it.

[-]

TuskNaPrezydenta2020@reddit

there are no PRs for Gemma yet on the topic of mtp, right?

[-]

jacek2023@reddit (OP)

There is one draft (in progress)

[-]