MTP on strix halo with llama.cpp (PR #22673)
Posted by Edenar@reddit | LocalLLaMA | View on Reddit | 27 comments
I saw a post about incoming MTP support in llama.cpp so i tried it out on a AI max 395 with 128GB DDR5 8000:
I rebuilt the radv container from https://github.com/kyuz0/amd-strix-halo-toolboxes with that PR : https://github.com/ggml-org/llama.cpp/pull/22673
I ran that GGUF : https://huggingface.co/am17an/Qwen3.6-35BA3B-MTP-GGUF/tree/main and added --spec-type mtp --spec-draft-n-max 3
Result : between 60 and 80 token/s from 40ish token/s without MTP (on the screen i was trying rocm but it's more like 40-45 token/s with vulkan) depending on the subject (some common math stuff seems to be the fastest). PP seems unchanged. The two GGUF on the screen capture are almost the same size : around 36GB each
I have yet to try it on qwen 3.5 122B and there will be some tweaks to do with launch parameters but it's really impressive !!
oShievy@reddit
What does performance look like after 100,000 tokens? Wondering long term performance
Edenar@reddit (OP)
with qwen 35B A3B it was around 40 tok/s above 100k and around 30 tok/s above 200k. So token gen isn't an issue at large context. PP on the other hand ... (check my others comment on this thread, basically it goes into the 200-300 tok/s pp speed when above 200k context)
q-admin007@reddit
Tested it with Proxmox9, LXC container with ROCm and Q8 27B, full context, f16 for KV.
From 7.5 to 17 t/s. Awesome!
ayylmaonade@reddit
This is looking brilliant. Appreciate someone posting some tests on AMD hardware for once! I'm excited to see how the 35B-A3B fares with MTP on my 7900 XTX.
Due_Net_3342@reddit
did anyone tried 122b?
Jawnnypoo@reddit
On 27B Q8.0, went from 7.8 t/s generation to 17.28 t/s on my raw llama.cpp test, this is great! Thanks for writing this up.
FullstackSensei@reddit
How does 3.6 27B fare?
Edenar@reddit (OP)
goes from sluggish to half-decent (same model, same question, MTP up/down) :
my_name_isnt_clever@reddit
I run models at Q6, have you tried that as a compromise?
fallingdowndizzyvr@reddit
Where did you get 27B MTP? Or did you just hack in the MTP block into an existing file?
Edenar@reddit (OP)
I used that one : https://huggingface.co/am17an/Qwen3.6-27B-MTP-GGUF
(i think someone linked it in the PR thread on github)
fallingdowndizzyvr@reddit
Sweet.
FullstackSensei@reddit
20t/s at Q8 is great IMO. I get ~32t/s on two 3090s using Q8_K_XL. On my Jetson AGX Xavier it runs at 2.7t/s.
27B is my workhorse now, and can do a ton of stuff unattended. Wouldn't mind letting it run at 7t/s overnight. Won't get as much done as the 3090s, but also uses a small fraction of the power. If it can get to 10t/s with 2 concurrent (batched) requests, that would be amazing.
CalligrapherFar7833@reddit
Thats almost 3x perf
Edenar@reddit (OP)
yeah !
here at Q4 (wouldn't use it , quality too low compared to Q8) :
kant12@reddit
I'm seeing similar results on mine. This is really great so far.
clintonium119@reddit
I just tested this on my Rog Flow z13, and this gave me a big boost. All models are Q6, except the 'normal' 27b, which was a Q5. Just used a local bench script that runs a simple test 3 times each.
overand@reddit
What's your prompt processing speed like?
Edenar@reddit (OP)
With vulkan + radv and qwen 3.6 Q8 from the post (and MTP up but it shouldn't change pp much)
-700 tok/s for 20k context
-240 tok/s for 215k context (14min47s pp)
With rocm (MTP, same model)
-850 tok/s with 10k context
-261 tok/s for 215k context (13min40s pp)
overand@reddit
Not a speed demon, but competent. Thanks for sharing! (And, yeah - I'd assumed MTP woudn't affect PP, but, I'm just a dumdum, so who knows!)
silverud@reddit
I tested that PR out, and the two Qwen models on am17an's repo.
Performance was not great on a Macbook M3 MAX w/ 128gb of unified memory. I managed to hit 61t/s on 35B-A35, but I had to set the --spec-draft-n-max to 1 to do that. Values of 2 or higher got me the same or less performance than I get from a stock Q8_0 GGUF copy of 35B-A3B.
metigue@reddit
Prompt ingestion tps? Just curious about strix halo
Edenar@reddit (OP)
Qwen 3.6 35B is around 1000-1200 tok/s pp for low context (with rocm ! right now i'm testing radv since token gen is higher with it).
Just did a stupid test with qwen 3.6 35B (Q8 MTP quant from the post, MTP UP) : dumped almost the entire first book of pandora (peter f hamilton) and asked it to summarize it.
It took 14min47s to proceed 214 824 tokens (242 tok/s on average but was far faster early on). Then it generated 3 352 tokens at 28.76 tok/s (1min 56). I will patch the rocm image with the MTP PR to see if it's better.
Rattling33@reddit
Wow niiice!
EarAdministrative742@reddit
quality is same?
Edenar@reddit (OP)
i believe so yes
Everlier@reddit
thats pretty nice, looking forward to trying it out!