Llama.cpp MTP with Qwen3.6 27B on Headless RTX 3090

Posted by cleversmoke@reddit | LocalLLaMA | View on Reddit | 42 comments

Saw some posts around PP being slower, so they were cautious on trying it.

Here's a real-world datapoint.

Settings:

Use Cases:

Without MTP (llama.cpp:server-cuda13-b9174):

With MTP (latest master fork):

A 41% time savings is quite huge, so unless you're PP heavy, I'd recommend giving MTP a try on your use cases! I have it on a dual agent set-up so your total processing times may be better since I have another critic agent check the main agent's work.