llama.cpp MTP support landed - Qwen3.6 27B at 2.44× on a Strix Halo, 2.17× on a RTX 3090 rig

Posted by C_Coffie@reddit | LocalLLaMA | View on Reddit | 27 comments

PR #22673 (commit 4f13cb7) landed MTP speculative decoding in mainline llama.cpp on May 16. I tested it on two separate rigs.

Qwen3.6 27B, single-stream chat, temperature 0, median of 5 runs:

Strix Halo (Framework Desktop, ROCm 7.0.2):

Single RTX 3090 @ 450W (CUDA 12.9, driver 590.26):

Dual RTX 3090, layer-split:

Qwen3.6 35B-A3B (MoE):

Enable with --spec-type draft-mtp --spec-draft-n-max N. Output is byte-identical to baseline at the same seed and temperature.

MTP helps MoE less because only \~3B of 35B params run per token — each forward pass is already cheap, so saving N-1 of them is a smaller win. Sweet-spot N also depends on the rig: uncapped 3090 prefers n=2 at Q4, capped 3090 and Strix Halo prefer n=3.

Couple of follow-ups from the last thread:

MTP writeup with both rigs side-by-side, build commands, and per-shape tables: https://calebcoffie.com/blog/benchmarking-llama-cpp-mtp-on-strix-halo

Raw YAML per run: https://github.com/CCoffie/CalebCoffie.com/tree/main/content/benchmarks/runs