llama.cpp MTP support landed - Qwen3.6 27B at 2.44× on a Strix Halo, 2.17× on a RTX 3090 rig
Posted by C_Coffie@reddit | LocalLLaMA | View on Reddit | 27 comments
PR #22673 (commit 4f13cb7) landed MTP speculative decoding in mainline llama.cpp on May 16. I tested it on two separate rigs.
Qwen3.6 27B, single-stream chat, temperature 0, median of 5 runs:
Strix Halo (Framework Desktop, ROCm 7.0.2):
- Q4_K_M: 11.7 → 21.2 tok/s (1.81×)
- Q8_0: 7.4 → 18.1 tok/s (2.44×)
Single RTX 3090 @ 450W (CUDA 12.9, driver 590.26):
- Q4_K_M: 38.7 → 59.5 tok/s (1.54×, n=2)
Dual RTX 3090, layer-split:
- Q8_0: 25.7 → 55.9 tok/s (2.17×, n=3)
Qwen3.6 35B-A3B (MoE):
- Strix Halo: 49.5 → 69.4 tok/s (1.40×)
- 3090: 120.0 → 148.3 tok/s (1.24×)
Enable with --spec-type draft-mtp --spec-draft-n-max N. Output is byte-identical to baseline at the same seed and temperature.
MTP helps MoE less because only \~3B of 35B params run per token — each forward pass is already cheap, so saving N-1 of them is a smaller win. Sweet-spot N also depends on the rig: uncapped 3090 prefers n=2 at Q4, capped 3090 and Strix Halo prefer n=3.
Couple of follow-ups from the last thread:
- The 3090 numbers in my earlier post were undercut by an undisclosed 200W cap (breaker-popping issue with 4 cards on one circuit). I re-benched 26 of the 3090 models at 350W and 450W; dense 27-32B models gained +70 to +113%. Writeup with the curve and full table: https://calebcoffie.com/blog/benchmarking-llama-cpp-mtp-on-strix-halo
- Prompt-processing tok/s and prompt-token columns are now on every row of the benchmarks page.
MTP writeup with both rigs side-by-side, build commands, and per-shape tables: https://calebcoffie.com/blog/benchmarking-llama-cpp-mtp-on-strix-halo
Raw YAML per run: https://github.com/CCoffie/CalebCoffie.com/tree/main/content/benchmarks/runs
FormalAd7367@reddit
amazing works. need to save this down and have a thoghout look after work
leiqixin@reddit
Are these tests ran with 128k or longer context?
wgaca2@reddit
I get 70 t/s on dual 3090 with 196k context 27b q8
tryunite@reddit
holyyyy I gotta set up my second 3090
wizoneway@reddit
at 132k context on a 5090 i see not much in tg on q6, now at 96k its fast
cafedude@reddit
full llama command line would be nice to see here.
yes_i_tried_google@reddit
Nice but you need to tune your 3090. I get 60 tok/s running at 350w / +130 clock / +500 mem
tryunite@reddit
TIL how to tune my 3090 on Linux, thanks for bringing this to my attention!
Anbeeld@reddit
Temperature 0? Yeah, how very much reflective of real usage.
TypicalPudding6190@reddit
What you mentioned on qwen 3.6 35B moe matches what i saw on my 5060ti. The jump in tps is not as significant as dense models.
I was able to get ~65tps with 131k context using with cline for coding.
I have written it down here for my reference: https://www.compiledthoughts.dev/blog/compiledthoughts_blog_03_kv.html
Thee_Depression@reddit
does this model work on lmstudio?
zkkzkk32312@reddit
no cuz they will need to update to work with the new Lamma .exe
digitalfreshair@reddit
Does it work on CPU only too?
C_Coffie@reddit (OP)
Yeah it should help with CPU only as well but I haven't tested that yet.
gh0stwriter1234@reddit
Probably you want to run the ik_llama fork for CPU though.
overand@reddit
That may or may not have MTP merged in - the two don't have feature parity, going both directions.
gh0stwriter1234@reddit
They had it first: https://www.reddit.com/r/LocalLLaMA/comments/1sz0aaj/ik_llama_now_supports_qwen35_mtp_support_o/
IK is more of an experimental fork anyway... but yeah it has also iterated alongside mainline.
MrMisterShin@reddit
What context length did you use?
s0uldrag0n@reddit
I don't see much difference with my single 3090. Does MTP work with router mode?
Non-Technical@reddit
Yes in router mode.
Non-Technical@reddit
I didn’t get quite as nice numbers as you.
With MTP unsloth version of Qwen 3.6 27B at Q6 on an Evo x2 128GB Strix Halo my tps went from 9 to 20
ArtisticHamster@reddit
Is there a PR for this with Gemma?
C_Coffie@reddit (OP)
I know the Gemma MTP support was in a llama.cpp fork and I attempted to benchmark that but I couldn't get consistent results so I excluded them.
Shoddy_Bed3240@reddit
Could you also mention any changes in prompt processing speed?
Stock_Ad9641@reddit
Of course it reduces prompt processing speed. Also costs a gigabyte of VRAM.
overand@reddit
Well, if you were using CUDA and llama-server in router mode, you'll have recently gotten back about 500 megs of VRAM from a bug re: CUDA contexts
gh0stwriter1234@reddit
there are changes after that mitigate some of the prompt processing speed loss.