llama.cpp docker images to run MTP models

Posted by havenoammo@reddit | LocalLLaMA | View on Reddit | 26 comments

There have been many improvements to the MTP pull request and the llama.cpp main branch, such as image support and various bug fixes. I recently made a new build for my local machine, but keeping guides up to date is an issue, so I built Docker images to make running them easier. If you are already using llama.cpp Docker images, it would be straightforward to switch over until official builds support MTP.

Here, pick your flavour:

havenoammo/llama:cuda13-server
havenoammo/llama:cuda12-server
havenoammo/llama:vulkan-server
havenoammo/llama:intel-server
havenoammo/llama:rocm-server

I have not been able to test all of them, as I only run cuda13 for now. Feel free to give it a test and see if it works for your hardware.

Also, Unsloth released MTP models for Qwen 3.6, which makes my previous grafted models obsolete. You can find them here if you missed them:

https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF

I believe they quantize some of the MTP layers. I kept mine at Q8 quantization for improved prediction. It is possible that higher quantization for MTP layers makes them more precise, giving you more speed at the cost of more VRAM usage. I will keep my versions for now until I finish doing some benchmarks and I am sure they are fully obsolete.

Finally, here is how I use it:

docker run --gpus all --rm \
    -p 8080:8080 \
    -v ./models:/models \
    havenoammo/llama:cuda13-server \
    -m /models/Qwen3.6-27B-MTP-UD-Q8_K_XL.gguf \
    --port 8080 \
    --host 0.0.0.0 \
    -n -1 \
    --parallel 1 \
    --ctx-size 262144 \
    --fit-target 844 \
    --mmap \
     -ngl -1 \
    --flash-attn on \
    --metrics \
    --temp 1.0 \
    --top-p 0.95 \
    --top-k 20 \
    --jinja \
    --chat-template-kwargs '{"preserve_thinking":true}' \
    --ubatch-size 512 \
    --batch-size 2048 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --spec-type mtp \
    --spec-draft-n-max 3

Adjust as you see fit. What matters most for MTP is --spec-type mtp and --spec-draft-n-max 3.

[-]

relmny@reddit

what about the build (patches), are those edits on your previous post still valid? which is the preferred way?

[-]

havenoammo@reddit (OP)

I updated build numbers on huggingface readmes but not Reddit posts. Let me update them really quick!

[-]

relmny@reddit

thanks,
btw, is it "mtp" or "draft-mtp", because with "mtp" I get a --spec-type error and the help doesn't show "mtp", only shows:

--spec-type none,draft-simple,draft-eagle3,draft-mtp,ngram-simple,ngram-map-k,ngram-map-k4v,ngram-mod,ngram-cache

[-]

havenoammo@reddit (OP)

It is possible they renamed mtp to draft-mtp, give it a try and let us know! My local build still works with mtp.

[-]

relmny@reddit

thanks, yeah, I did, and I don't see a difference, except I lose PP speed.

Tried qwen3.6 35b and 27b on a 4080 super (16gb vram) both vanilla and mtp (both Unsloth), with max 2 and max3, in a single turn.

[-]

cleversmoke@reddit

Awesome work! I had a whole guide written up on getting Docker set up with MTP PR 22673, but Reddit autobot flagged it and mods wouldn't reply to approving it so it didn't get its light of the day. Your guide will work splendidly for the community though!

I also tried out your MTP quants, thank you!

[-]

fragment_me@reddit

I get your point is to show that Unsloth does quant MTP layers at smaller quants, but at first glance it's weird and distracting when you show all of the other layers in it. It looks more like a comparison of your quant vs Unsloth's lower/smaller quant.

[-]

havenoammo@reddit (OP)

Yeah, it was a quick edit. I just ran some benchmarks and figured out the Unsloth MTP models are just as good. Edited the post now!

[-]

fragment_me@reddit

nice

[-]

Prudence-0@reddit

On s'en fou d'Unsloth, ce qui nous interresse c'est de la performance et de la qualité.

S'il y a de meilleur modèles en MTP, je prends

[-]

CircularSeasoning@reddit

I find it mildly amusing how we're, in essence, speculatively drafting llama.cpp with draft PRs trying to get access to faster speculative decoding inference faster.

I like the energy. Keep on! And will somebody please give this kindly person more ammo.

[-]

havenoammo@reddit (OP)

Haha, thank you very much, I appreciate it! ❤️

[-]

Solidified4ever@reddit

Gemma 4 needed. Thanks for your work.

[-]

havenoammo@reddit (OP)

I will look into it and see if it works, will let you know!

[-]

suprjami@reddit

Thanks for doing these quants and builds. It's a big time saver and makes this awesome feature accessible to more people.

Not every day someone gets to say they make a better quant than Unsloth!

[-]

havenoammo@reddit (OP)

Haha, thanks! ❤️ Though it is just Unsloth + MTP grafted at Q8. I did not do much, it is their great work. I love Unsloth's work and just wanted to make their models usable before the PR was merged. Theoretically, having Q8 MTP layers should increase precision and hence speed. I will benchmark it and let you all know the results! If the difference turns out to be only 1-2% accuracy, then saving 200MB might be the better choice for people with less VRAM.

[-]

soyalemujica@reddit

I have issues with MTP in Vulkan, I only get as high as 27t/s, and without I get 42t/s with my 7900XTX, it's strange. Using Ubuntu 26.04, has anyone encountered this issue?

[-]

BringMeTheBoreWorms@reddit

I think it’s because MTP adds vram overhead that on a 24gb card forces cache into regulate ram.

I was playing with this last night and noticed it doing that. I had much better throughput splitting over 2 xtx cards, but it’d be better if I could get it on one

[-]

Mountain_Patience231@reddit

i used to get 2x under MTP with old MTP built, but after the latest update, i find the same with you. the accepted rate is ridiculously high 0.9 but the tg is mah...

[-]

havenoammo@reddit (OP)

I see some people reporting no gain with Vulkan in the original PR https://github.com/ggml-org/llama.cpp/pull/22673, while others do report a gain. There was some work in progress so it could be a regression in speed. By the way, is your report based on Docker images built by me? Maybe give ROCm one a try. It is possible I might mess up something during build as I couldn't test since I have no hardware for it. Also, feel free to share how you run this. Maybe someone else will catch something about your configuration.

[-]

havenoammo@reddit (OP)

❤️ Thanks! Good call, just added that.

[-]

metmelo@reddit

The hero we don't deserve.