Multi-Token Prediction (MTP) for Qwen on LLaMA.cpp + TurboQuant

Posted by gladkos@reddit | LocalLLaMA | View on Reddit | 38 comments

Implemented Multi-Token Prediction for QWEN on LLaMA.cpp

+40% performance! 90% acceptance rate. TurboQuant enabled

Running locally on a MacBook Pro M5 Max 64GB RAM

Patched LLaMA.cpp with MTP and TurboQuant: https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant

Quantized Qwen 3.6 27B (and 35B) into GGUF with MTP: https://huggingface.co/collections/AtomicChat/qwen-36-udt-mtp

Local Ai Models App: Atomic.Chat

[-]

havenoammo@reddit

There was a TurboQuant pull request to llama.cpp itself, but it got rejected because llama.cpp already has rotations for Q4 KV quantization levels. There was not much gain and Q4 quantization was already faster, so the PR was not accepted. I think it was only useful at Q3, but then quality suffers anyway. That is why I did not add it to my llama.cpp Docker builds.

[-]

FatheredPuma81@reddit

Do you have the PR on hand by chance? I didn't see it and search is terrible at the best of times.

[-]

havenoammo@reddit

It was this PR and the comment helped me make the decision: https://github.com/ggml-org/llama.cpp/pull/21089#issuecomment-4184785523 Also, there are tons of TurboQuant PRs that have been declined.

[-]

FatheredPuma81@reddit

I think this is what you're looking for this actually https://github.com/ggml-org/llama.cpp/pull/21089#issuecomment-4187227563 or one of CISC's comments since Mushoz isn't a even a contributor.

And AFAIK the vast majority of the TurboQuant PRs were closed because they were too large and/or slop in the maintainer's eyes. This one wasn't because it isn't but hasn't been merged because it shows no gains.

It's pretty sad that people are still going nuts over TurboQuant then when it's basically on par with Q4_0 while being slower. I think the bare minimum would be at least Q4_1 quality and preferably at least Q5_0. Maybe the 10 posts with their own forks actually improved on Turboquant and that PR is outdated? Doubtful though.

[-]

superdariom@reddit

To be fair I thought the timeline was that q4 rotations were added after the PR was submitted but your logic still stands

[-]

gladkos@reddit (OP)

Fair enough

[-]

shenglong@reddit

Do the implementations work well with AMD hardware? Specifically using ROCm

[-]

Alternative-Way-7894@reddit

Anyone know if MTP models work in LM Studio?

[-]

nickm_27@reddit

Why do people keep posting these with turboquant as if it is faster, when it is in fact slower than f16 or even q8/q4

[-]

gladkos@reddit (OP)

right, depends on the task. With a large context, TurboQuant is more effective, especially for agentic tasks. for smaller prompts, it’s slower

[-]

Automatic-Arm8153@reddit

No it’s not.

It’s a downgrade. A serious one.

Your AI becomes stupid. While not being faster than running KV quanted at Q8/Q4.

But in all honesty the Qwen models already have such efficient KV cache. I wouldn’t run anything less than Bf16

[-]

RnRau@reddit

Depends on hardware. I benched f16 vs bf16 on the Strix Halo and there is a dramatic slowdown;

|   bf16 |   bf16 |  1 | Vulkan0      |  pp2048 @ d8192 |        125.60 ± 0.32 |
|   bf16 |   bf16 |  1 | Vulkan0      |   tg256 @ d8192 |         22.86 ± 2.44 |
|   f16  |    f16 |  1 | Vulkan0      |  pp2048 @ d8192 |        921.99 ± 3.77 |
|   f16  |    f16 |  1 | Vulkan0      |   tg256 @ d8192 |         53.10 ± 0.05 |

Model was Qwen3.6-35B-A3B-UD-Q6_K_XL

[-]

Picard12832@reddit

bf16 FA is not yet implemented in Vulkan, but should be added soon-ish. In your test it's falling back to CPU.

[-]

RnRau@reddit

Ahh cool - I look forward to the release! Thanks for headsup.

[-]

JustANerd420@reddit

Right, because everyone can afford to run the BF16 /s

[-]

rpkarma@reddit

For KV cache? Yeah most can. And if you can’t you’re paying a massive intelligence cost

[-]

llitz@reddit

Even at bf16 when cache gets to 200k qwen starts to... Misremember things. I am surprised if at these lower quants it has any accuracy left at all.

[-]

rpkarma@reddit

I can say FP8 KV is dogshit lol. Like noticeably worse. Gets stuck in loops and randomly emits stop tokens and shit

[-]

llitz@reddit

Yeah, I am getting some of those in bf16, but likely the issue lies with the drafting/MTP and breaking thinking tags.

Such a pain -_-

[-]

gladkos@reddit (OP)

Turboquant significantly compress context at least for gemma models. We run tests. For qwen it’s less effective, agreed.

[-]

Automatic-Arm8153@reddit

Yes it will most definitely compress context for the Gemma models at the cost of intelligence.

Even Q8 kv cache for Gemma is a serious intelligence loss

[-]

FatheredPuma81@reddit

I more want to know why people keep posting these. This is literally the like 8th post where someone has got this working in llama.cpp and more if you include VLLM. It's becoming slop at this point imo.

[-]

Jeidoz@reddit

I have feeling that those posts are promo or from bots. Each of similar post had link to some service website and sounds similar. For last 5-7 days I have already seen 3+ such posts

[-]

gladkos@reddit (OP)

i'm not a bot)

[-]

Still-Wafer1384@reddit

That's what they all say

[-]

SailIntelligent2633@reddit

So then this is a promo?

[-]

Daniel_H212@reddit

Yeah and Qwen is already pretty damn memory efficient with KV cache compared to traditional full attention. And turboquant is better than Q4 but still way worse than full precision. MTP is the only exciting part here.

[-]

tomz17@reddit

Turboquant can indeed be faster in situations where you are memory-bandwidth limited vs. compute-limited. So 100% depends on the hardware and whether you are interested in prefill vs. decode, and single user vs. multi-user (e.g. in multi-user setups prefill can tank your decode), etc. etc. etc.

[-]

Dazzling_Equipment_9@reddit

It looks fast, but what about the quality?

[-]

Sabin_Stargem@reddit

TurboQuant4 offers compression greater than Q4, but higher quality. Mind, it has to be asymmetric KV for that: Q8 K, and Q4 V. Doing stuff to the K severely reduces the quality.

You can find assorted benches at TheTom's repository. I am going by the M5 Max's stat sheet for the quality & compression comparisons, but there are many other hardware tests.

https://github.com/TheTom/turboquant_plus

[-]

Dazzling_Equipment_9@reddit

Thanks for the explanation.

However, I previously saw someone claim that any kind of quantization on the key-value cache for Qwen 3.5 (and probably 3.6 as well?) causes a noticeable drop in quality. They say the Qwen series is particularly sensitive to KV cache quantization.

I’m not a professional, so I don’t really understand why this happens or what the actual reason is. This rumor has been holding me back from using any KV cache quantization — including TurboQuant. I’m not sure if anyone can clarify what’s really going on here.

[-]

Sabin_Stargem@reddit

If you need absolute quality, definitely don't use KV quantization. That said, it depends on what sort of memory constraints you are working with. If I am using 100b+ models, TQ+ would give me more wiggle room, and the larger amount of parameters helps limit the damage of KV quanting.

In your case, I recommend just trying out KV quantization, and see how the AI feels for you. If it ain't great, just switch back to a higher KV.

In any case, I hope TQ+ is implemented into Kobold, TextGen, and other applications of LlamaCPP. I want to try out TQ+ myself, rather than trusting naysayers and advocates.

[-]

Automatic-Arm8153@reddit

Qwen models quantise better than the Gemma models considerably.

But still if you can keep kv at bf16, at worst q8.

You shouldn’t be trying to reach the models full context anyway.

If your able to get 100k context window with bf16 that’s ideal anything else is an extra

[-]

Automatic-Arm8153@reddit

No that’s not what those results are saying at all.

Go down to the heading that reads “KL Divergence vs f16”

Q4 is better than turbo quant. That’s the only metric that matters on that page

[-]

gladkos@reddit (OP)

similar quality with 90% acceptance rate

[-]

Distinct_Lion7157@reddit

dont use mtp use dflash its 30-40% faster than the built in mtp

there was also already a pull request for this so you didnt need to waste your time vibe coding it

[-]

siegevjorn@reddit

MTP already implemented in llama.cpp's recent update, check it out. And yeah, randomized Hadamard transform (core of turboquant algo) is already implemented in Marchish by ggerganov himself.

[-]

Charming-Author4877@reddit

If you want speed, use MTP without turboquant
If you want context use normal Q4_1 or 4_0 quantization
If you want both, use both.
Or is there something special for your Mac that makes turboquant interesting