MTP on strix halo with llama.cpp (PR #22673)

Posted by Edenar@reddit | LocalLLaMA | View on Reddit | 27 comments

I saw a post about incoming MTP support in llama.cpp so i tried it out on a AI max 395 with 128GB DDR5 8000:
I rebuilt the radv container from https://github.com/kyuz0/amd-strix-halo-toolboxes with that PR : https://github.com/ggml-org/llama.cpp/pull/22673
I ran that GGUF : https://huggingface.co/am17an/Qwen3.6-35BA3B-MTP-GGUF/tree/main and added --spec-type mtp --spec-draft-n-max 3

Result : between 60 and 80 token/s from 40ish token/s without MTP (on the screen i was trying rocm but it's more like 40-45 token/s with vulkan) depending on the subject (some common math stuff seems to be the fastest). PP seems unchanged. The two GGUF on the screen capture are almost the same size : around 36GB each

I have yet to try it on qwen 3.5 122B and there will be some tweaks to do with launch parameters but it's really impressive !!

[-]

oShievy@reddit

What does performance look like after 100,000 tokens? Wondering long term performance

[-]

Edenar@reddit (OP)

with qwen 35B A3B it was around 40 tok/s above 100k and around 30 tok/s above 200k. So token gen isn't an issue at large context. PP on the other hand ... (check my others comment on this thread, basically it goes into the 200-300 tok/s pp speed when above 200k context)

[-]

q-admin007@reddit

Tested it with Proxmox9, LXC container with ROCm and Q8 27B, full context, f16 for KV.

From 7.5 to 17 t/s. Awesome!

[-]

ayylmaonade@reddit

This is looking brilliant. Appreciate someone posting some tests on AMD hardware for once! I'm excited to see how the 35B-A3B fares with MTP on my 7900 XTX.

[-]

Due_Net_3342@reddit

did anyone tried 122b?

[-]

Jawnnypoo@reddit

On 27B Q8.0, went from 7.8 t/s generation to 17.28 t/s on my raw llama.cpp test, this is great! Thanks for writing this up.

[-]

FullstackSensei@reddit

How does 3.6 27B fare?

[-]

Edenar@reddit (OP)

goes from sluggish to half-decent (same model, same question, MTP up/down) :

[-]

my_name_isnt_clever@reddit

I run models at Q6, have you tried that as a compromise?

[-]

fallingdowndizzyvr@reddit

Where did you get 27B MTP? Or did you just hack in the MTP block into an existing file?

[-]

Edenar@reddit (OP)

I used that one : https://huggingface.co/am17an/Qwen3.6-27B-MTP-GGUF
(i think someone linked it in the PR thread on github)

[-]

fallingdowndizzyvr@reddit

Sweet.

[-]

FullstackSensei@reddit

20t/s at Q8 is great IMO. I get ~32t/s on two 3090s using Q8_K_XL. On my Jetson AGX Xavier it runs at 2.7t/s.

27B is my workhorse now, and can do a ton of stuff unattended. Wouldn't mind letting it run at 7t/s overnight. Won't get as much done as the 3090s, but also uses a small fraction of the power. If it can get to 10t/s with 2 concurrent (batched) requests, that would be amazing.

[-]

CalligrapherFar7833@reddit

Thats almost 3x perf

[-]

Edenar@reddit (OP)

yeah !
here at Q4 (wouldn't use it , quality too low compared to Q8) :

[-]

kant12@reddit

I'm seeing similar results on mine. This is really great so far.

[-]

clintonium119@reddit

I just tested this on my Rog Flow z13, and this gave me a big boost. All models are Q6, except the 'normal' 27b, which was a Q5. Just used a local bench script that runs a simple test 3 times each.

Model                  | Prefill (t/s)  | Decode (t/s)   | Status  
=================================================================
Qwen3.6-27B            | 55.3           | 10.3           | OK      
Qwen3.6-27B-MTP        | 47.8           | 20.0           | OK      
Qwen3.6-35B            | 153.1          | 42.3           | OK      
Qwen3.6-35B-MTP        | 136.6          | 58.2           | OK

[-]

overand@reddit

What's your prompt processing speed like?

[-]

Edenar@reddit (OP)

With vulkan + radv and qwen 3.6 Q8 from the post (and MTP up but it shouldn't change pp much)
-700 tok/s for 20k context
-240 tok/s for 215k context (14min47s pp)
With rocm (MTP, same model)
-850 tok/s with 10k context
-261 tok/s for 215k context (13min40s pp)

[-]

overand@reddit

Not a speed demon, but competent. Thanks for sharing! (And, yeah - I'd assumed MTP woudn't affect PP, but, I'm just a dumdum, so who knows!)

[-]

silverud@reddit

I tested that PR out, and the two Qwen models on am17an's repo.

Performance was not great on a Macbook M3 MAX w/ 128gb of unified memory. I managed to hit 61t/s on 35B-A35, but I had to set the --spec-draft-n-max to 1 to do that. Values of 2 or higher got me the same or less performance than I get from a stock Q8_0 GGUF copy of 35B-A3B.

[-]

metigue@reddit

Prompt ingestion tps? Just curious about strix halo

[-]

Edenar@reddit (OP)

Qwen 3.6 35B is around 1000-1200 tok/s pp for low context (with rocm ! right now i'm testing radv since token gen is higher with it).
Just did a stupid test with qwen 3.6 35B (Q8 MTP quant from the post, MTP UP) : dumped almost the entire first book of pandora (peter f hamilton) and asked it to summarize it.
It took 14min47s to proceed 214 824 tokens (242 tok/s on average but was far faster early on). Then it generated 3 352 tokens at 28.76 tok/s (1min 56). I will patch the rocm image with the MTP PR to see if it's better.

[-]

Rattling33@reddit

Wow niiice!

[-]

EarAdministrative742@reddit

quality is same?

[-]

Edenar@reddit (OP)

i believe so yes

[-]

Everlier@reddit

thats pretty nice, looking forward to trying it out!