Testing llama.cpp MTP support on Qwen3.6 - RTX 5090
Posted by 3VITAERC@reddit | LocalLLaMA | View on Reddit | 30 comments
Setup:
- RTX 5090, 32 GB, Linux
- Built llama.cpp from 4f13cb7 (the official ghcr.io/ggml-org/llama.cpp:server-cuda image hasn't picked up the merge yet as of writing — had to docker build from source with CUDA_DOCKER_ARCH=120)
- Unsloth's Qwen3.6-27B-MTP-GGUF Q5_K_M and Qwen3.6-35B-A3B-MTP-GGUF UD-Q4_K_M
- 128k context, flash-attn, q8_0 KV cache, temp 0.8, --parallel 1 (required for MTP)
- Same GGUF for "MTP on" and "MTP off" — only the --spec-type draft-mtp --spec-draft-n-max 3 flag toggled. This isolates MTP from quant differences.
- 2 prompts: "short story about a cat" (\~400 tokens) and "Flappy Bird clone as a single HTML file" (\~3000 tokens)
- 3 seeds per config, averaged
OsmanthusBloom@reddit
I'd be interested in your prompt processing speeds. Is there a big difference between MTP and non-MTP chugging through a prompt of, say, 10k tokens?
No-Dot-6573@reddit
I'd second that. The prompt processing speed seems to be affected as well. So if it is slower, MTP might better be turned off for eg short stories with 27B
tomByrer@reddit
I wonder if just shorter runs see little benifit?
legit_split_@reddit
The difference is much lower now:
https://github.com/ggml-org/llama.cpp/pull/23198
Bulky-Priority6824@reddit
180tks on dual 5060ti with Parallel 2 have you tried
330d@reddit
LLM inference is memory througput bound at np1, the compute utilization is not full. Parallelism and batching and MTP all rely on this fact to use that unused compute, going more than parallel 1 with MTP would in theory have 2 mechanisms competing for the same (compute) resource
unjustifiably_angry@reddit
I wonder if it's possible to run the draft on a second GPU.
tomByrer@reddit
maybe
https://github.com/Luce-Org/lucebox-hub/issues/102
Bulky-Priority6824@reddit
Theory vs measured results. The more I test the more I find.
Plasmx@reddit
Most likely p2 is better for you because of the dual GPU setup. I don’t think there is a benefit for single GPU setups.
kitanokikori@reddit
parallel=1 was required for initial versions of this PR but that isn't true any more
StardockEngineer@reddit
Good to know, thanks.
dco44@reddit
unjustifiably_angry@reddit
Same seed used? Made a huge difference in my test.
fck__spz@reddit
What's your estimation when I can use Gemma 4 26b with MTP in llama.cpp?
nickm_27@reddit
Hopefully with the next day or two, Aman said he's working on it
wilo108@reddit
Am I right in thinking
llama-benchhasn't been updated to allow testing--spec-draftyet? Will it be?go0og@reddit
There's an issue ticket open to add these and seems to be worked on.
go0og@reddit
Another parameter to use which affects t/s generation is
--spec-draft-p-min- start with 0.75; I ended up dropping it all the way down to 0.2.Shoulon@reddit
What's a good guide on learning how to build llama.cpp, parameters, and making it be a docker container? I can ask claude for example, but all this specific custom information is far too fragmented it seems like.
FiLo420blazeit@reddit
Really useful breakdown, thanks for running this. The accept% column is doing all the work here, wherever it hits \~90% (the code prompts) you get a real speedup, and wherever it drops to \~40% (prose) MTP basically just adds overhead. That's expected behavior but it's nice to see it this cleanly on actual hardware.
The spicy result is the 35B MoE on short story going backwards (0.81×). That's the worst case for speculative-style decoding: the base model is already fast (227 tok/s), so a low-acceptance draft can't earn back its own cost nd you net negative. The dense model never goes below 1.0× because its baseline is slow enough that even a bad draft is roughly free.
Practical takeaway seems to be: enable MTP for structured/code-ish workloads, leave it off for creative/open-ended generatian, especially on the MoE.
Curious what your draft settings were (number of speculative tokens, any threshold on acceptance)? Wonder if tuning those pulls the prose numbers back above 1.0× or if low accept% just kills it regardless.
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
AmoebaDue6638@reddit
Solid benchmarking methodology, especially isolating MTP by toggling only the spec flags on the same GGUF. Curious how much the gains scale with longer context windows.
kitanokikori@reddit
This is an interesting result, I consistently get ~10-15 tok/sec on Qwen3.6-35B-A3B on Strix Halo. Params I'm running for an OpenClaw type app (not an expert! don't take these params as gospel):
Forever_Playful@reddit
What’s the performance at 8bit weights and also 8bit kv cache?
ambient_temp_xeno@reddit
Accepting that it will be bad anyway, the more speedup the short story gets, the worse it will be.
330d@reddit
Any chance you could drop your compose file? Thanks
DepictWeb@reddit
You could also try testing prose at temperature 0.2. I’d expect noticeably higher token acceptance there since the sampling becomes much more deterministic.
youcloudsofdoom@reddit
Thanks for this, roughly aligns with my experience of it across both models. on 35b I really couldn't find any scenarios that seemed to have a speed up, looks like I should have some patience there...
DepictWeb@reddit
The base MoE throughput is already so high that the MTP verification/rollback overhead can end up costing more than the speculative gains, especially when accept rates are low. Dense models seem to benefit much more consistently from MTP right now.