How do I use MTP?
Posted by WhatererBlah555@reddit | LocalLLaMA | View on Reddit | 20 comments
Hi,
I'm trying to use MTP with llama.cpp, I built from source the mtp-pr, download an MTP model from huggingface https://huggingface.co/unsloth/Qwen3.6-27B-GGUF-MTP/resolve/main/Qwen3.6-27B-Q6_K.gguf
But when I run the model I have an error:
error while handling argument "--spec-type": unknown speculative decoding type without draft model
Can someone tell me what I'm doing wrong?
GodComplecs@reddit
Op what commands did you run in the end then? Also use 2 instead of 3 for the draft tokens, it's the fastest for mtp according to Unsloth
WhatererBlah555@reddit (OP)
I use this script to build for ROCm and CUDA:
Might contain bugs.
afd8856@reddit
make sure to checkout/switch to the right branch before building.
WhatererBlah555@reddit (OP)
I'm on mtp-pr branch:
The AI is suggesting that I should have two versions if the model, one "standard" and one MTP and use the standard version as the draft model, is that correct?
grumd@reddit
AI doesn't know about MTP support that was only recently added.
Post the full list of commands starting from cloning the llama.cpp repo and ending at running the llama-server. You are most likely on the wrong branch.
HomsarWasRight@reddit
So many people get this stuff wrong and it’s why using LLM well is actually a skill. This is doubly true when running local models.
Having a sense of what it could and couldn’t understand is important. But if you specifically instructed it to, say, research the PR, then do a broader search on what MTP is and how it works, it would do a better job of directing.
relmny@reddit
except when using web search, isn't it?
HomsarWasRight@reddit
The point is, LLMs don’t often know when to use a web search. So knowing when to actually prompt them to do it is often essential. They’ll often happily spout blatantly false information.
fizzy1242@reddit
might be the wrong pr checkout. here's what mine says:
$ git statusOn branch pr-22673-mtpnothing to commit, working tree cleanEvening_Barracuda_20@reddit
I have this error.
error while handling argument "--spec-type": unknown speculative type: mtp
last Qwen3.6-35B-A3B MTP launched with --spec-type mtp --spec-draft-n-max 2 --spec-type mtp --spec-draft-n-max 2
compiled from instructions here https://unsloth.ai/docs/models/qwen3.6#mtp-guide
git status
Sur la branche mtp-clean
Votre branche est à jour avec 'origin/mtp-clean'.
rien à valider, la copie de travail est propre
Any advice ?
WhatererBlah555@reddit (OP)
Make sure you're running the right binary:
Evening_Barracuda_20@reddit
./llama-server --version
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24118 MiB):
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24118 MiB
version: 105 (e7b4848)
built with GNU 11.4.0 for Linux x86_64
It's seems to work when using --spec-type draft-mtp instead of --spec-type mtp
llama-server messages:
[46013] srv load_model: creating MTP draft context against the target model '/llm/Qwen3.6-35B-A3B-UD-IQ4_XS_MTP.gguf'
[46013] common_speculative_init: adding speculative implementation 'draft-mtp'
[46013] slot print_timing: id 0 | task 2663 |
[46013] prompt eval time = 418.42 ms / 143 tokens ( 2.93 ms per token, 341.76 tokens per second)
[46013] eval time = 829.00 ms / 116 tokens ( 7.15 ms per token, 139.93 tokens per second)
[46013] total time = 1247.43 ms / 259 tokens
[46013] draft acceptance rate = 1.00000 ( 76 accepted / 76 generated)
[46013] statistics draft-mtp: #calls(b,g,a) = 24 2494 1997, #gen drafts = 1997, #acc drafts = 1997, #gen tokens = 3584, #acc tokens = 3554, dur(b,g,a) = 0.020, 9289.926, 0.287 ms
Is it ok ?
ImNotAMan@reddit
I'm curious if anyone else has managed to get MTP working with multimodality, because I had to pull the atomic chat turboquant llama.cpp fork and modify it to allow me to get close.
Currently I have it set up so that it keeps both the MTP draft model and the mmproj loaded and switches between the two based on if the mmproj is used or not.
VoidAlchemy@reddit
at least on ik_llama.cpp, which has MTP support in main, for a while it wasn't working with mmproj until this PR got merged yesterday: https://github.com/ikawrakow/ik_llama.cpp/pull/1758
I'm not sure if the mainline draft PR MTP works with mmproj or not, still need to test that.
ImNotAMan@reddit
Thank you, this seem like what I'm looking for.
Organic_Scarcity_495@reddit
make sure you're using the mtp build of llama.cpp — the regular build doesn't have the mtp flag. the spec-draft-n-max 3 is a good start. also check that your model has the mtp head (the unsloth ones are fine). the main mistake i see is running mtp on a regular gguf without the extra head layer
houchenglin@reddit
It's depends on your patch, some patch uses "-mtp" for this feature.
See the help for detail:
houchenglin@reddit
jacek2023@reddit
maybe you should show your command?
bonobomaster@reddit
Let's see your llama-server config, especially the full --spec-type line and post it next time immediately. Nobody can diagnose shit without the right info.