Multi-Token Prediction (MTP) for Qwen on LLaMA.cpp + TurboQuant
Posted by gladkos@reddit | LocalLLaMA | View on Reddit | 38 comments
Implemented Multi-Token Prediction for QWEN on LLaMA.cpp
+40% performance! 90% acceptance rate. TurboQuant enabled
Running locally on a MacBook Pro M5 Max 64GB RAM
Patched LLaMA.cpp with MTP and TurboQuant: https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant
Quantized Qwen 3.6 27B (and 35B) into GGUF with MTP: https://huggingface.co/collections/AtomicChat/qwen-36-udt-mtp
Local Ai Models App: Atomic.Chat
havenoammo@reddit
There was a TurboQuant pull request to llama.cpp itself, but it got rejected because llama.cpp already has rotations for Q4 KV quantization levels. There was not much gain and Q4 quantization was already faster, so the PR was not accepted. I think it was only useful at Q3, but then quality suffers anyway. That is why I did not add it to my llama.cpp Docker builds.
FatheredPuma81@reddit
Do you have the PR on hand by chance? I didn't see it and search is terrible at the best of times.
havenoammo@reddit
It was this PR and the comment helped me make the decision: https://github.com/ggml-org/llama.cpp/pull/21089#issuecomment-4184785523 Also, there are tons of TurboQuant PRs that have been declined.
FatheredPuma81@reddit
I think this is what you're looking for this actually https://github.com/ggml-org/llama.cpp/pull/21089#issuecomment-4187227563 or one of CISC's comments since Mushoz isn't a even a contributor.
And AFAIK the vast majority of the TurboQuant PRs were closed because they were too large and/or slop in the maintainer's eyes. This one wasn't because it isn't but hasn't been merged because it shows no gains.
It's pretty sad that people are still going nuts over TurboQuant then when it's basically on par with Q4_0 while being slower. I think the bare minimum would be at least Q4_1 quality and preferably at least Q5_0. Maybe the 10 posts with their own forks actually improved on Turboquant and that PR is outdated? Doubtful though.
superdariom@reddit
To be fair I thought the timeline was that q4 rotations were added after the PR was submitted but your logic still stands
gladkos@reddit (OP)
Fair enough
shenglong@reddit
Do the implementations work well with AMD hardware? Specifically using ROCm
Alternative-Way-7894@reddit
Anyone know if MTP models work in LM Studio?
nickm_27@reddit
Why do people keep posting these with turboquant as if it is faster, when it is in fact slower than f16 or even q8/q4
gladkos@reddit (OP)
right, depends on the task. With a large context, TurboQuant is more effective, especially for agentic tasks. for smaller prompts, it’s slower
Automatic-Arm8153@reddit
No it’s not.
It’s a downgrade. A serious one.
Your AI becomes stupid. While not being faster than running KV quanted at Q8/Q4.
But in all honesty the Qwen models already have such efficient KV cache. I wouldn’t run anything less than Bf16
RnRau@reddit
Depends on hardware. I benched f16 vs bf16 on the Strix Halo and there is a dramatic slowdown;
Model was Qwen3.6-35B-A3B-UD-Q6_K_XL
Picard12832@reddit
bf16 FA is not yet implemented in Vulkan, but should be added soon-ish. In your test it's falling back to CPU.
RnRau@reddit
Ahh cool - I look forward to the release! Thanks for headsup.
JustANerd420@reddit
Right, because everyone can afford to run the BF16 /s
rpkarma@reddit
For KV cache? Yeah most can. And if you can’t you’re paying a massive intelligence cost
llitz@reddit
Even at bf16 when cache gets to 200k qwen starts to... Misremember things. I am surprised if at these lower quants it has any accuracy left at all.
rpkarma@reddit
I can say FP8 KV is dogshit lol. Like noticeably worse. Gets stuck in loops and randomly emits stop tokens and shit
llitz@reddit
Yeah, I am getting some of those in bf16, but likely the issue lies with the drafting/MTP and breaking thinking tags.
Such a pain -_-
gladkos@reddit (OP)
Turboquant significantly compress context at least for gemma models. We run tests. For qwen it’s less effective, agreed.
Automatic-Arm8153@reddit
Yes it will most definitely compress context for the Gemma models at the cost of intelligence.
Even Q8 kv cache for Gemma is a serious intelligence loss
FatheredPuma81@reddit
I more want to know why people keep posting these. This is literally the like 8th post where someone has got this working in llama.cpp and more if you include VLLM. It's becoming slop at this point imo.
Jeidoz@reddit
I have feeling that those posts are promo or from bots. Each of similar post had link to some service website and sounds similar. For last 5-7 days I have already seen 3+ such posts
gladkos@reddit (OP)
i'm not a bot)
Still-Wafer1384@reddit
That's what they all say
SailIntelligent2633@reddit
So then this is a promo?
Daniel_H212@reddit
Yeah and Qwen is already pretty damn memory efficient with KV cache compared to traditional full attention. And turboquant is better than Q4 but still way worse than full precision. MTP is the only exciting part here.
tomz17@reddit
Turboquant can indeed be faster in situations where you are memory-bandwidth limited vs. compute-limited. So 100% depends on the hardware and whether you are interested in prefill vs. decode, and single user vs. multi-user (e.g. in multi-user setups prefill can tank your decode), etc. etc. etc.
Dazzling_Equipment_9@reddit
It looks fast, but what about the quality?
Sabin_Stargem@reddit
TurboQuant4 offers compression greater than Q4, but higher quality. Mind, it has to be asymmetric KV for that: Q8 K, and Q4 V. Doing stuff to the K severely reduces the quality.
You can find assorted benches at TheTom's repository. I am going by the M5 Max's stat sheet for the quality & compression comparisons, but there are many other hardware tests.
https://github.com/TheTom/turboquant_plus
Dazzling_Equipment_9@reddit
Thanks for the explanation.
However, I previously saw someone claim that any kind of quantization on the key-value cache for Qwen 3.5 (and probably 3.6 as well?) causes a noticeable drop in quality. They say the Qwen series is particularly sensitive to KV cache quantization.
I’m not a professional, so I don’t really understand why this happens or what the actual reason is. This rumor has been holding me back from using any KV cache quantization — including TurboQuant. I’m not sure if anyone can clarify what’s really going on here.
Sabin_Stargem@reddit
If you need absolute quality, definitely don't use KV quantization. That said, it depends on what sort of memory constraints you are working with. If I am using 100b+ models, TQ+ would give me more wiggle room, and the larger amount of parameters helps limit the damage of KV quanting.
In your case, I recommend just trying out KV quantization, and see how the AI feels for you. If it ain't great, just switch back to a higher KV.
In any case, I hope TQ+ is implemented into Kobold, TextGen, and other applications of LlamaCPP. I want to try out TQ+ myself, rather than trusting naysayers and advocates.
Automatic-Arm8153@reddit
Qwen models quantise better than the Gemma models considerably.
But still if you can keep kv at bf16, at worst q8.
You shouldn’t be trying to reach the models full context anyway.
If your able to get 100k context window with bf16 that’s ideal anything else is an extra
Automatic-Arm8153@reddit
No that’s not what those results are saying at all.
Go down to the heading that reads “KL Divergence vs f16”
Q4 is better than turbo quant. That’s the only metric that matters on that page
gladkos@reddit (OP)
similar quality with 90% acceptance rate
Distinct_Lion7157@reddit
dont use mtp use dflash its 30-40% faster than the built in mtp
there was also already a pull request for this so you didnt need to waste your time vibe coding it
siegevjorn@reddit
MTP already implemented in llama.cpp's recent update, check it out. And yeah, randomized Hadamard transform (core of turboquant algo) is already implemented in Marchish by ggerganov himself.
Charming-Author4877@reddit
If you want speed, use MTP without turboquant
If you want context use normal Q4_1 or 4_0 quantization
If you want both, use both.
Or is there something special for your Mac that makes turboquant interesting