llama.cpp MTP support landed - Qwen3.6 27B at 2.44× on a Strix Halo, 2.17× on a RTX 3090 rig | TheaterFire

llama.cpp MTP support landed - Qwen3.6 27B at 2.44× on a Strix Halo, 2.17× on a RTX 3090 rig

Posted by C_Coffie@reddit | LocalLLaMA | View on Reddit | 27 comments

PR #22673 (commit 4f13cb7) landed MTP speculative decoding in mainline llama.cpp on May 16. I tested it on two separate rigs.

Qwen3.6 27B, single-stream chat, temperature 0, median of 5 runs:

Strix Halo (Framework Desktop, ROCm 7.0.2):

Q4_K_M: 11.7 → 21.2 tok/s (1.81×)
Q8_0: 7.4 → 18.1 tok/s (2.44×)

Single RTX 3090 @ 450W (CUDA 12.9, driver 590.26):

Q4_K_M: 38.7 → 59.5 tok/s (1.54×, n=2)

Dual RTX 3090, layer-split:

Q8_0: 25.7 → 55.9 tok/s (2.17×, n=3)

Qwen3.6 35B-A3B (MoE):

Strix Halo: 49.5 → 69.4 tok/s (1.40×)
3090: 120.0 → 148.3 tok/s (1.24×)

Enable with --spec-type draft-mtp --spec-draft-n-max N. Output is byte-identical to baseline at the same seed and temperature.

MTP helps MoE less because only \~3B of 35B params run per token — each forward pass is already cheap, so saving N-1 of them is a smaller win. Sweet-spot N also depends on the rig: uncapped 3090 prefers n=2 at Q4, capped 3090 and Strix Halo prefer n=3.

Couple of follow-ups from the last thread:

The 3090 numbers in my earlier post were undercut by an undisclosed 200W cap (breaker-popping issue with 4 cards on one circuit). I re-benched 26 of the 3090 models at 350W and 450W; dense 27-32B models gained +70 to +113%. Writeup with the curve and full table: https://calebcoffie.com/blog/benchmarking-llama-cpp-mtp-on-strix-halo
Prompt-processing tok/s and prompt-token columns are now on every row of the benchmarks page.

MTP writeup with both rigs side-by-side, build commands, and per-shape tables: https://calebcoffie.com/blog/benchmarking-llama-cpp-mtp-on-strix-halo

Raw YAML per run: https://github.com/CCoffie/CalebCoffie.com/tree/main/content/benchmarks/runs

[-]

FormalAd7367@reddit

amazing works. need to save this down and have a thoghout look after work

[-]

leiqixin@reddit

Are these tests ran with 128k or longer context?

[-]

wgaca2@reddit

I get 70 t/s on dual 3090 with 196k context 27b q8

[-]

tryunite@reddit

holyyyy I gotta set up my second 3090

[-]

wizoneway@reddit

at 132k context on a 5090 i see not much in tg on q6, now at 96k its fast

[-]

cafedude@reddit

full llama command line would be nice to see here.

[-]

yes_i_tried_google@reddit

Nice but you need to tune your 3090. I get 60 tok/s running at 350w / +130 clock / +500 mem

[-]

tryunite@reddit

TIL how to tune my 3090 on Linux, thanks for bringing this to my attention!

# enable coolbits first
sudo nvidia-xconfig --cool-bits=8

# +130 core
nvidia-settings -a "[gpu:0]/GPUGraphicsClockOffsetAllPerformanceLevels=130"

# 500+500 memory clock
nvidia-settings -a "[gpu:0]/GPUMemoryTransferRateOffsetAllPerformanceLevels=1000"

[-]

Anbeeld@reddit

Temperature 0? Yeah, how very much reflective of real usage.

[-]

TypicalPudding6190@reddit

What you mentioned on qwen 3.6 35B moe matches what i saw on my 5060ti. The jump in tps is not as significant as dense models.

I was able to get ~65tps with 131k context using with cline for coding.

I have written it down here for my reference: https://www.compiledthoughts.dev/blog/compiledthoughts_blog_03_kv.html

[-]

Thee_Depression@reddit

does this model work on lmstudio?

[-]

zkkzkk32312@reddit

no cuz they will need to update to work with the new Lamma .exe

[-]

digitalfreshair@reddit

Does it work on CPU only too?

[-]

C_Coffie@reddit (OP)

Yeah it should help with CPU only as well but I haven't tested that yet.

[-]

gh0stwriter1234@reddit

Probably you want to run the ik_llama fork for CPU though.

[-]

overand@reddit

That may or may not have MTP merged in - the two don't have feature parity, going both directions.

[-]

gh0stwriter1234@reddit

They had it first: https://www.reddit.com/r/LocalLLaMA/comments/1sz0aaj/ik_llama_now_supports_qwen35_mtp_support_o/

IK is more of an experimental fork anyway... but yeah it has also iterated alongside mainline.

[-]

MrMisterShin@reddit

What context length did you use?

[-]

s0uldrag0n@reddit

I don't see much difference with my single 3090. Does MTP work with router mode?

[-]

Non-Technical@reddit

Yes in router mode.

[-]

Non-Technical@reddit

I didn’t get quite as nice numbers as you.

With MTP unsloth version of Qwen 3.6 27B at Q6 on an Evo x2 128GB Strix Halo my tps went from 9 to 20

[-]

ArtisticHamster@reddit

Is there a PR for this with Gemma?

[-]

C_Coffie@reddit (OP)

I know the Gemma MTP support was in a llama.cpp fork and I attempted to benchmark that but I couldn't get consistent results so I excluded them.

[-]

Shoddy_Bed3240@reddit

Could you also mention any changes in prompt processing speed?

[-]

Stock_Ad9641@reddit

Of course it reduces prompt processing speed. Also costs a gigabyte of VRAM.

[-]

overand@reddit

Well, if you were using CUDA and llama-server in router mode, you'll have recently gotten back about 500 megs of VRAM from a bug re: CUDA contexts

[-]

gh0stwriter1234@reddit

there are changes after that mitigate some of the prompt processing speed loss.