MTP on strix halo with llama.cpp (PR #22673)

Posted by Edenar@reddit | LocalLLaMA | View on Reddit | 27 comments

MTP on strix halo with llama.cpp (PR #22673)

I saw a post about incoming MTP support in llama.cpp so i tried it out on a AI max 395 with 128GB DDR5 8000:
I rebuilt the radv container from https://github.com/kyuz0/amd-strix-halo-toolboxes with that PR : https://github.com/ggml-org/llama.cpp/pull/22673
I ran that GGUF : https://huggingface.co/am17an/Qwen3.6-35BA3B-MTP-GGUF/tree/main and added --spec-type mtp --spec-draft-n-max 3

Result : between 60 and 80 token/s from 40ish token/s without MTP (on the screen i was trying rocm but it's more like 40-45 token/s with vulkan) depending on the subject (some common math stuff seems to be the fastest). PP seems unchanged. The two GGUF on the screen capture are almost the same size : around 36GB each

I have yet to try it on qwen 3.5 122B and there will be some tweaks to do with launch parameters but it's really impressive !!