Why might MTP be net negative for tool heavy agentic flows?

Posted by Substantial_Step_351@reddit | LocalLLaMA | View on Reddit | 2 comments

The Qwen3.6-27B MTP benchmarks that have been circulating put factual tasks at 62-70% acceptance vs code at 79-89%. Tool calls probably sit in that factual range or lower, structured output, constrained format, less predictable than pure code generation. For agents doing dense tool calling sequences, the PP overhead per prefill pass might consistently eat the TG benefit. Not obvious MTP is net positive there tbh.

Anyone actually running it on agentic pipelines seeing a different result?

[-]

sisyphus-cycle@reddit

It’s hard to tell since you only really get total draft summary after an agent turn. So even though you might have 90% acceptance, that 10% might be tool calls with variable/dynamic params. But most of my tool calls are write/edit/read/web search. So I’d assume that the MTP can definitely predict the first few tokens containing the function call with arguments pretty consistently. Overall I see a benefit for TG and no change to PP when using MTP qwen

Not sure what you mean about PP overhead for tool calls? I might be interpreting it wrong, but MTP just predicts for token gen right? After the tokens are generated it should never be part of PP, should get inherently added to existing KV cache.

DeProgrammer99@reddit

MTP has some impact on PP, I believe because the MTP layers require their own KV cache: https://github.com/ggml-org/llama.cpp/pull/23198#pullrequestreview-4305586947