How do I use MTP?

Posted by WhatererBlah555@reddit | LocalLLaMA | View on Reddit | 20 comments

Hi,

I'm trying to use MTP with llama.cpp, I built from source the mtp-pr, download an MTP model from huggingface https://huggingface.co/unsloth/Qwen3.6-27B-GGUF-MTP/resolve/main/Qwen3.6-27B-Q6_K.gguf

But when I run the model I have an error:

error while handling argument "--spec-type": unknown speculative decoding type without draft model

Can someone tell me what I'm doing wrong?

[-]

GodComplecs@reddit

Op what commands did you run in the end then? Also use 2 instead of 3 for the draft tokens, it's the fastest for mtp according to Unsloth

[-]

afd8856@reddit

make sure to checkout/switch to the right branch before building.

[-]

WhatererBlah555@reddit (OP)

I'm on mtp-pr branch:

$ git status
On branch mtp-pr
nothing to commit, working tree clean

The AI is suggesting that I should have two versions if the model, one "standard" and one MTP and use the standard version as the draft model, is that correct?

[-]

grumd@reddit

AI doesn't know about MTP support that was only recently added.

Post the full list of commands starting from cloning the llama.cpp repo and ending at running the llama-server. You are most likely on the wrong branch.

[-]

HomsarWasRight@reddit

AI doesn't know about MTP support that was only recently added.

So many people get this stuff wrong and it’s why using LLM well is actually a skill. This is doubly true when running local models.

Having a sense of what it could and couldn’t understand is important. But if you specifically instructed it to, say, research the PR, then do a broader search on what MTP is and how it works, it would do a better job of directing.

[-]

relmny@reddit

except when using web search, isn't it?

[-]

HomsarWasRight@reddit

The point is, LLMs don’t often know when to use a web search. So knowing when to actually prompt them to do it is often essential. They’ll often happily spout blatantly false information.

[-]

fizzy1242@reddit

might be the wrong pr checkout. here's what mine says:

$ git status

On branch pr-22673-mtp

nothing to commit, working tree clean

[-]

Evening_Barracuda_20@reddit

I have this error.
error while handling argument "--spec-type": unknown speculative type: mtp

last Qwen3.6-35B-A3B MTP launched with --spec-type mtp --spec-draft-n-max 2 --spec-type mtp --spec-draft-n-max 2

compiled from instructions here https://unsloth.ai/docs/models/qwen3.6#mtp-guide

git status
Sur la branche mtp-clean
Votre branche est à jour avec 'origin/mtp-clean'.
rien à valider, la copie de travail est propre

Any advice ?

[-]

WhatererBlah555@reddit (OP)

Make sure you're running the right binary:

which llama-cli
llama-cli --version
llama.cpp/build/bin/llama-cli --version

[-]

Evening_Barracuda_20@reddit

./llama-server --version
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24118 MiB):
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24118 MiB
version: 105 (e7b4848)
built with GNU 11.4.0 for Linux x86_64

It's seems to work when using --spec-type draft-mtp instead of --spec-type mtp

llama-server messages:
[46013] srv load_model: creating MTP draft context against the target model '/llm/Qwen3.6-35B-A3B-UD-IQ4_XS_MTP.gguf'
[46013] common_speculative_init: adding speculative implementation 'draft-mtp'

[46013] slot print_timing: id 0 | task 2663 |
[46013] prompt eval time = 418.42 ms / 143 tokens ( 2.93 ms per token, 341.76 tokens per second)
[46013] eval time = 829.00 ms / 116 tokens ( 7.15 ms per token, 139.93 tokens per second)
[46013] total time = 1247.43 ms / 259 tokens

[46013] draft acceptance rate = 1.00000 ( 76 accepted / 76 generated)

[46013] statistics draft-mtp: #calls(b,g,a) = 24 2494 1997, #gen drafts = 1997, #acc drafts = 1997, #gen tokens = 3584, #acc tokens = 3554, dur(b,g,a) = 0.020, 9289.926, 0.287 ms

Is it ok ?

[-]

ImNotAMan@reddit

I'm curious if anyone else has managed to get MTP working with multimodality, because I had to pull the atomic chat turboquant llama.cpp fork and modify it to allow me to get close.

Currently I have it set up so that it keeps both the MTP draft model and the mmproj loaded and switches between the two based on if the mmproj is used or not.

[-]

VoidAlchemy@reddit

at least on ik_llama.cpp, which has MTP support in main, for a while it wasn't working with mmproj until this PR got merged yesterday: https://github.com/ikawrakow/ik_llama.cpp/pull/1758

I'm not sure if the mainline draft PR MTP works with mmproj or not, still need to test that.

[-]

ImNotAMan@reddit

Thank you, this seem like what I'm looking for.

[-]

Organic_Scarcity_495@reddit

make sure you're using the mtp build of llama.cpp — the regular build doesn't have the mtp flag. the spec-draft-n-max 3 is a good start. also check that your model has the mtp head (the unsloth ones are fine). the main mistake i see is running mtp on a regular gguf without the extra head layer

[-]

houchenglin@reddit

It's depends on your patch, some patch uses "-mtp" for this feature.

See the help for detail:

llama-cli --help | grep mtp

[-]

houchenglin@reddit

It's depends on your patch, some patch uses "-mtp" for this feature.
You should try this help first: `llama-cli --help | grep mtp`

[-]

jacek2023@reddit

maybe you should show your command?

[-]

bonobomaster@reddit

Let's see your llama-server config, especially the full --spec-type line and post it next time immediately. Nobody can diagnose shit without the right info.