Qwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR
Posted by havenoammo@reddit | LocalLLaMA | View on Reddit | 41 comments
Hey everyone, I've been working on getting Multi-Token Prediction (MTP) working with quantized GGUFs for Qwen3-27B and the results are pretty impressive. Here's what I put together: https://huggingface.co/havenoammo/Qwen3.6-27B-MTP-UD-GGUF
These are Unsloth's UD XL quantizations of Qwen3-27B with the MTP draft heads grafted on top in Q8_0. The base model stays in its usual low-bit quantization, while the 3 MTP layers stay at Q8 to preserve speculative accuracy.
Sharing the grafted GGUF files (UD XL base + Q8 MTP), the raw MTP layer source I extracted (MTP_Q8_0.gguf), and convert.py, the grafting script I adapted from this gist in case anyone wants to do this for other models. Also included are full build instructions for the custom llama.cpp.
Qwen3 was trained with 3 MTP steps, meaning each forward pass predicts 4 tokens at once. llama.cpp's main branch doesn't support MTP yet, so I pulled in the speculative decoding support from the still-open PR #22673, merged it on top of master, and built llama-server from that. Run it with: --spec-type mtp --spec-draft-n-max 3
The results: roughly 2.5x token throughput compared to running the same UD XL GGUF without MTP, with a solid acceptance rate where most draft tokens are kept, meaning the MTP heads are genuinely useful and not just burning compute. The Q8 MTP layers also add very little VRAM overhead since they're a tiny fraction of the full model.
MTP is one of the biggest efficiency wins available for speculative decoding, but it's basically unsupported outside of official Qwen3 deployments on SGLang and vLLM. This brings it to GGUF and llama.cpp, meaning you can run it locally with the same tooling you already use. PR #22673 will hopefully land soon and this will all just work out of the box. In the meantime, the merge process is straightforward (3 git commands).
Happy to answer questions or help anyone get it running. Let me know if you try it and what speeds you see!
Pineapple_King@reddit
Ok. how did you do it?
havenoammo@reddit (OP)
Explained step by step in the repo, but here's the short version: clone llama.cpp, patch in PR #22673, build, and run with the MTP flags.
jumpingcross@reddit
For posterity, which commit was master at at the time you did this?
tempedbyfate@reddit
This is awesome, thank you for the detailed instructions!
Any chance you could provide the instructions for grafting the MTP_Q8_0.gguf using your script onto another model? would like try this on Heretic Qwen 3.6 27B model. Thanks.
havenoammo@reddit (OP)
Pretty simple, and the script isn't mine either. I found it buried in a HuggingFace community posts. Original link is provided in repo. You'll need uv package installed, which handles Python virtual environments. Then:
tempedbyfate@reddit
❤️❤️❤️
havenoammo@reddit (OP)
Awesome!
Prestigious-Chair282@reddit
Thx for git setup
Prestigious-Chair282@reddit
Didn't the post say it?
tempedbyfate@reddit
Just did a quick test using your instructions on a RTX Pro 6000.
qwen 3.6 2.7B Q8_K_XL = 41 tokens per second
qwen 3.6 2.7B Q8_K_XL (mtp) = 100 tokens per second
Wow! This is mind blowing. I hope all the issues get ironed out on that PR and MTP changes get merged soon!
NickCanCode@reddit
If you have a RTX Pro 6000, have you try lucebox-hub, their number actually looks more impressive with DFlash, DDtree, PFlash but it doesn't support multi-gpu very well so I don't have enough VRAM to run it.
tempedbyfate@reddit
Looks very interesting, will check it out. Thank you!
External_Dentist1928@reddit
But also at the same quality?
Awwtifishal@reddit
Speculative decoding doesn't alter quality. It just batches multiple tokens under the assumption that the draft tokens are correct, and the results from incorrect draft tokens are thrown away. The speedup comes from the fact that LLM inference is mostly bound to memory bandwidth, and inference of several batches uses the same bandwidth as a single one.
ambient_temp_xeno@reddit
Yes. It doesn't drop quality.
srigi@reddit
We demand pelican on bicycle test!
gordi555@reddit
On RTX Pro 6000 MaxQ I got/get...
qwen 3.6 2.7B Q8_K_XL = 36 tokens per second
qwen 3.6 2.7B Q8_K_XL (mtp) = 78 tokens per second
I've lost about 20% prompt processing but these generation speeds are massively worth it.
tempedbyfate@reddit
Based on the comments on that PR, I think the PP is a known issue and it sounds like it could be fixed before it that PR is merged in.
havenoammo@reddit (OP)
Amazing, I also use Q8! I have a 5090 + 3090 and was getting 25-30 t/s before, now I'm in the 60-75 t/s range. Been using it for a few hours for coding and no issues at all.
Dazzling_Equipment_9@reddit
This is really good news, thank you for your contribution! Besides, has anyone tested it on strixhalo?
lolwutdo@reddit
I wonder if this makes 27b have usable speeds to those who do partial cpu offloads
AppealSame4367@reddit
In ik_llama another 27B gguf reached up to 1.7 tps instead of 1 tps on first token. Lol.
6gb vram..
LetsGoBrandon4256@reddit
Well that's at least a 70% speed increase.
2Norn@reddit
get another cheap gpu, way better than cpu offloading
No_Algae1753@reddit
You should definitely at least get 10
redonculous@reddit
Will this run on a 12gb or 24gb card like a 3060 or pair of them?
EmotionalLock6844@reddit
No parallel agents possible?
ethereal_intellect@reddit
Any chance of a comparison of speed vs a3b with and without mtp? It's probably a a lot of work and I've heard mtp helps dense models more yeah, but sounded interesting to know
Beginning-Window-115@reddit
thanks dude the 8bit versions that were released in the pr draft are way too big and so this is absolutely perfect for me.
Legitimate-Dog5690@reddit
Running 2x12gb cards, it's not pretty. Using mod spec decoding I can get 20tps, using mtp I'm struggling to get 15. It feels like it's loading up the model in to the GPU then squeezing the MTP into CPU memory at the end.
Has anyone with a 32gb R9700 tried this yet? Really intrigued if it plays to it's strengths.
WoodCreakSeagull@reddit
If you are finding yourself getting squeezed like that, setting ub to 256 might just do the trick. If you really want to make sure though I suggest toning context down or testing a lower quant to see if it's really the VRAM limit or something else.
Legitimate-Dog5690@reddit
Yeah, lowering the context helps, as does ub 256, but the speed seems to rapidly taper off as it fills the context. Feels very much like more spills over into CPU than standard models. I also think you end up with a bit more fragmentation over 2x12gb compared to a single 24gb buffer.
I normally run unsloth q4_k_m with 90k q8_0 kv, or 120k turbo4. Both sit around 20tps, the q8_0 seems to not bog down at bigger contexts and pp is quicker, so I usually use that.
CapsAdmin@reddit
idk if very important, but I struggled to get this working because I was using a model.ini file while ALSO pointing the llamacpp server to a model directory, so I kept loading Qwen3.6-27B-MTP-UD-Q4_K_XL.gguf without the required flags (i assume, not sure if [*] flags in model.ini applies to --models-dir), making llamacpp segfault and or get stuck in infinite loops trying to prompt the model.
No_Swimming6548@reddit
Mfw when i get 6 token/s instead 3 token/s
dinerburgeryum@reddit
Hey, thanks, I used your isolated MTP GGUF and your conversion script to graph it into my own quant. Saved me some time, appreciate it.
VoidAlchemy@reddit
Nice job testing out the PR! I have a rough 3-way benchmark between mainline - ik - vllm running on a single 24GB VRAM GPU here: https://github.com/noonghunna/club-3090/pull/64#issuecomment-4383699676
Thanks again for sharing your full build and run commands!
iportnov@reddit
This really does 2x tokens per second for me.
The only problem is, llama-server segfaults when I press ctrl-c to stop it.
Also it says it does not support --parallel value more than 1, but this does not matter to me personally.
GrungeWerX@reddit
Sorry, I'm not 100% following. I have lm-studio, no llama.cpp. SInce these are ggufs, should they work out of the box or something else I need to do?
havenoammo@reddit (OP)
This is experimental and not available in the official llama.cpp release yet. What we do here is patch in some work-in-progress code and build llama.cpp from source to enable Multi-Token Prediction (MTP), which gives a nice speed boost. Think of it as early access. It should land in the main release sooner or later, and Unsloth will probably ship official MTP models too.
As for LM Studio, I'm not sure since I haven't tested it. I believe it uses llama.cpp under the hood, so it might work once MTP support lands in the official release, but I can't say for certain.
Rattling33@reddit
Great! I will try, quick question, so your Q4, Q8 gguf means unsloth's corresponding UD Q4 + Q8_0 MTP layered and UD Q8 + Q8_0 MTP layered?
havenoammo@reddit (OP)
Yes! I grafted Q8_0 MTP on Q4, Q5, Q6, Q8 of Unsloth UD models. So all of them have MTP in Q8 quantization.