Multi-Token Prediction (MTP) for LLaMA.cpp - Gemma 4 speedup by 40%
Posted by gladkos@reddit | LocalLLaMA | View on Reddit | 107 comments
Implemented Multi-Token Prediction for LLaMA.cpp.
Quantized Gemma 4 assistant models into GGUF format.
Ran tests on a MacBook Pro M5Max. Gemma 26B with MTP drafts tokens 40% faster.
Prompt: Write a Python program to find the nth Fibonacci number using recursion
Outputs:
LLaMA.cpp: 97 tokens/s
LLaMA.cpp + MTP: 138 tokens/s
Gemma4-assistant GGUF Quantized model: https://huggingface.co/collections/AtomicChat/gemma-4-assistant-gguf
Local AI models app: http://atomic.chat
Patched llama.cpp: https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant
Inevitable-Log5414@reddit
Great test
oldeastvan@reddit
I cant seem to make it work when I enable the MTP assistant. Server loads without errors but the first request it gets like 'hello' crashes the server and closes the console window before I can see anything. If I just run without loading the mtp assistant the server runs fine. I'm coming from the LM Studio / Kobold world sorry if this is a dumb question. Are there any logs I can look at?
gladkos@reddit (OP)
Hi! what's your device? we didn't test under lm studio, sorry.
oldeastvan@reddit
No I built the atomic repo in win 11. Running cuda on 3090
Quirky_Inflation@reddit
Nice so llama.cpp running gemma4 can now crash 40% faster
error_museum@reddit
How do I get this to work in LM studio?
JamesEvoAI@reddit
Thank you for you work on this, I've replicated your results on Strix Halo:
https://sleepingrobots.com/dreams/gemma4-mtp-assistant-strix-halo/
The world of local coding models keep getting better by the day!
gladkos@reddit (OP)
nice! great to see you achieved similar results.
thetaFAANG@reddit
in this moment I am euphoric
opossum_cz@reddit
The promise was 2-3x. So 40% is pretty low, I am testing myself and it goes from 10t/s to about 14t/s, which is consistent with what you showed.
Disappointing. Normal speculative drafting seems to be much better.
bleakj@reddit
As someone who hadn't been reading about this prior - 40% seems like a solid jump to me,
What do you mean in terms of normal speculative drafting being better though? (Like, it's better while using MTP, or without? - wording me dumb.)
opossum_cz@reddit
Can you run something like this for me? Alter parameters to match your won testing, but use ggml-org/gemma-4-31B-it-GGUF:Q4_K_M with ggml-org/gemma-4-E2B-it-GGUF:Q8_0 as drafter. I would honestly like to have somebody to compare results with.
llama-server --host 0.0.0.0 --port 18080 --alias gemma4:31b --ctx-size 8192 --parallel 1 --n-gpu-layers 999 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --jinja --reasoning off --chat-template-kwargs \{\"enable_thinking\":false\} --temp 1.0 --top-k 64 --top-p 0.95 --hf-repo ggml-org/gemma-4-31B-it-GGUF:Q4_K_M --spec-draft-n-max 32 --spec-draft-n-min 0 --n-gpu-layers-draft 999 --cache-type-k-draft q8_0 --cache-type-v-draft q8_0 --hf-repo-draft ggml-org/gemma-4-E2B-it-GGUF:Q8_0
bleakj@reddit
I'll do this tomorrow (just seeing now)
with ctx-size set to 8192 though, do you find this usable, or is it testing purposes only? (I usually have context set between 64k and 128k for local runs)
caetydid@reddit
It is solid for MoE. I'd expect 2x for 31B dense.
MiaBchDave@reddit
It is ;-)
b1231227@reddit
Does it only support Gemma 4?
gladkos@reddit (OP)
at the moment only Gemma. working on QWEN
bleakj@reddit
The second QWEN3.6 27B is working, let me know
I've got it running on a 4090ti at the moment, and it does good work / speed, however, if I could manage to make it more usable on my 4070ti PC, it'd be a huge win for me lol.
Running it on my 4070ti currently is around 15-20/toks
audioen@reddit
You've been living under the rock. https://github.com/ggml-org/llama.cpp/pull/22673 and whole bunch is running the MTP versions already.
bleakj@reddit
Oooh - yeah, cave living is big for me.
Thanks for the info
SavingsWeather1659@reddit
gemma 4 26b was fast but what we need is 31b dense model to improve this model
pepedombo@reddit
I've already checked mtp for qwen 27b and it gave me around 40% increase so I think it'll be the same with 31b.
In qwen I had 25tps at the start and now it hits 35-37. Q5 variant. In the same time it looks it needs more vram so I had to change kv from f16 to q8 to keep up with same ctx.
pabloodiablo@reddit
Where did you get compiled version of llama.cpp ?
pepedombo@reddit
Compiled myself because public revision is not exposing mtp yet.
juandann@reddit
can you link which commit/pr do you use?
pepedombo@reddit
here
FunnyAble7408@reddit
MiaBchDave@reddit
Gemma 4 31B MTP works fine using MLX. I followed the instructions on the Google assistant page: https://huggingface.co/mlx-community/gemma-4-31B-it-assistant-bf16
Quick test:
Gemma 4 31B BF16 Without MTP Draft: Generation: 75 tokens, 7.318 tokens-per-sec
Gemma 4 31B BF16 With MTP Draft: Generation: 75 tokens, 13.408 tokens-per-sec
M5 Max 128GB
Kaioh_shin@reddit
Does anyone know a fork that has MTP + TQ and works with Qwen3.6 27B ?
gladkos@reddit (OP)
MLX supports MTP for Mac silicon. On llama cpp we'll release MTP for qwen next week
juandann@reddit
how about mtp support for gemma4?
MiaBchDave@reddit
For Mac, oMLX 0.3.9 supports this.
d4nger_n00dle@reddit
I would like to know that as well
AnonLlamaThrowaway@reddit
Would this help in scenarios where you don't have enough VRAM and you've got half the model in VRAM, and the other half in RAM?
gladkos@reddit (OP)
not possible unfortunately. model hosts in vram
FerLuisxd@reddit
Vram usage?
gladkos@reddit (OP)
around 20Gb for the main model. 900Mb for assistance. 700Mb Kv cache, or 100Mb cache with turboquant.
Adorable-Sir-773@reddit
for some reason it actually slowed down generation on my 5060ti 16GB, idk what did I miss
Muted_Masterpiece342@reddit
Any way this works on AMD?
Material_Tone_6855@reddit
It's the dense model?
More-Bed-2557@reddit
Is this not compatible with GGUF quants? I tried running it with gemma4-31B-Q3_K_S.gguf, but got an error during starting up llama-server saying the assistant and model could not be loaded with your fork.
```
llama_model_load: error loading model: invalid vector subscript
llama_model_load_from_file_impl: failed to load model
llama_model_load_mtp_from_file: failed to load assistant from ...
````
Hot_Cupcake_6158@reddit
Thanks for the development u/gladkos! ❤️
I'll wait for a pre-compiled release to try it out. because I'm not terminal savvy enough to compile your fork using the base documentation written for base Llama.cpp. 😔
Own_Dimension_4513@reddit
40% speedup on a MacBook M5Max is no joke — MTP draft tokens are underrated for local inference. Gemma 4 26B at that speed starts to feel actually usable for real workloads without a GPU rack.
MiaBchDave@reddit
Gemma 4 31B BF16 with MTP is showing 2x speed increase on my same hardware.
DKO75@reddit
How do you run it from your app ?
gladkos@reddit (OP)
Hi! Not yet. We are working to add MTP to our atomic.chat app. Just install it, update will come next days with a popup window.
j0j0n4th4n@reddit
Does this works with finetunes/heretics/ablated/etc of Gemma 4 or just the official model?
gladkos@reddit (OP)
each model requires it's own small assistant. we took official models pairs. not sure if it works with others..
nickleodoen@reddit
visualization looks sick
gladkos@reddit (OP)
thank you! we have a small tasting stage in our lab
ChessGibson@reddit
Very cool tests! Did you try with Gemma E2B and E4B?
gladkos@reddit (OP)
thank you! not yet. these are simple models for easy tasks mostly.
IrisColt@reddit
Thanks for the patched llama.cpp!!!
Confident-Aerie-6222@reddit
does it work in lmstudio?
jtjstock@reddit
No, this needs to be merged into the mainline first, then lm studio will need to update. Something to watch though, the performance improvement is amazing.
Temporary-Roof2867@reddit
but does it only work for MAC? 👀👀
gladkos@reddit (OP)
you can compile and run llama cpp on mac, win or linux. or just use harness atomic.chat
Qwen3_6_27b_UD_Q4XL@reddit
Need to force them to answer as similar as possible to compare quality.
gladkos@reddit (OP)
Yeah, I might run a bunch of prompts soon
fallingdowndizzyvr@reddit
Crank the temp down to 0.
mxforest@reddit
I thought it was crank up and tone down?
fallingdowndizzyvr@reddit
No. The higher the temp, the more variable it becomes. If you crank it down to 0, it's effectively deterministic.
TJKDev@reddit
I think they were making a point about the way you used “crank down” not about the actual temperature of the model.
fallingdowndizzyvr@reddit
Ah.... In that case that's also wrong.
Crank just means "turn". As in turning a crank. You can crank things up or down.
https://www.collinsdictionary.com/dictionary/english/crank-handle
https://dictionary.reverso.net/english-definition/crank+down
Alternative-Suit5541@reddit
Just Post a long text in context, tell it to copy it back.
Done
rockseller@reddit
How is the quality of the generated? Since is based on guessing idk does it has a bad result or downside?
Shinkai_I@reddit
This isn't just about guessing; it's about simultaneously enabling rapid guessing and large-scale parallel validation.
When a guess is rejected by the validator, the original generation method takes over for subsequent generation, thus ensuring the quality of the generated data.
rockseller@reddit
Thanks for sharing! Does it also improves the prefill speed or not? Like I noticed it takes a while to read files (using llama.cpp)
Shinkai_I@reddit
From what I understand, it won't speed up the pre-fill stage.
In fact, it might slightly increase TTFT latency, but not significantly (typically between 5% and 15%, depending on your environment).
I think what's more noteworthy is the increased usage of inference memory.
opossum_cz@reddit
No. It doesn't change output.
grumd@reddit
Would be interesting to see the same comparison but with the same seed and with temp 0.0, supposedly the output would be the exact same, proving MTP isn't degrading quality
m18coppola@reddit
the MTP heads predict multiple tokens at the same time, so the main model can verify them in parallel. If the main model doesn't agree, they tokens are rejected and resampled by the large model. MTP is effectively speculative decoding and will not affect the output at all, just generation speed.
grumd@reddit
I know. I said it would be nice to see how it works in a gif, instead of OP's gif showing different outputs.
Also, bugs in implementation can cause some errors during matmuls or quantizations or whatnot - leading to different outputs
Former-Ad-5757@reddit
Supposedly yes, in reality with quantization etc it can slightly shift a little, and because they are just next-word predictors 1 tiny shift means everything behind it will be different.
Basically they can only be 99% the same if the error happens in one of the last tokens. If the error happens not in one of the last tokens then you immediately drop down because everything after it will also be different.
gladkos@reddit (OP)
model wouldn't answer the same anytime. A judge is needed.
k4ch0w@reddit
Set the seed to the same like 42 and then set temp to 0
mxforest@reddit
That will work between 2 runs of MTP but not between runs that have it enabled and disabled.
opossum_cz@reddit
Nonsense. Do not comment if don't know enough.
petuman@reddit
Even if draft model is producing random noise, it should have no effect on final output, as each individual draft token gets verified by main model.
the320x200@reddit
The full model checks a batch of predictions from the draft model and they only get accepted if they match the full model. There should be zero output change with MTP on vs off.
Ok-Measurement-1575@reddit
This looks great but the burning question is:
Can 27b with mtp enabled STILL fix the slop produced by opus?
stoppableDissolution@reddit
Mtp has no effect on the output itself
Ok-Measurement-1575@reddit
bollocks
stoppableDissolution@reddit
You dont know how speculative decoding works, do you
Ok-Measurement-1575@reddit
no idea
gvij@reddit
why is the difference not that much as mentioned in the release notes?
pizzaboyreddit@reddit
Also have great results in vllm, it's really made the 31b usable
bleakj@reddit
Honestly the Gemma 4 models have all been really usable on "reasonable" level consumer gear,
Even my 4070ti system runs 31b fine (I do offload some into system ram, which slows down a bit, but I don't find it too bad, it's still usually 40-50/tok or better) mind you, even the jump to just a 4090ti from the 4070ti makes a huge difference.
TheRealMasonMac@reddit
Try DFlash. I heard that it’s even faster?
gladkos@reddit (OP)
will release Dflash support very soon! It's faster, but with losses
bleakj@reddit
What type of losses? (memory?)
TheLipovoy@reddit
What about Ollama?
opossum_cz@reddit
Ollama is badly maintained in my opinion.
Nexter92@reddit
The landing page look very very good
gladkos@reddit (OP)
thank you!
HavenTerminal_com@reddit
97 was already enough. 138 is showing off.
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
innovasior@reddit
Does this work with ollama and lm studio?
false79@reddit
You. You have SOTA local. That is pretty cool.
CBW1255@reddit
Either you are lying to him or to yourself.
gladkos@reddit (OP)
thanks! It's a Monster!
kjbbbreddd@reddit
I'm running gemma 4 31b Heretic for image captioning, and it's taking 10 minutes per image. I'm excited to see what happens.
Monkey_1505@reddit
Try qwen 9b instead, it'll be faster.
DashinTheFields@reddit
I have a vision model that identifies items in like half a second; it's only like 3-4b. Shouldn't your caption need a dramatically smaller model? Or do you need something special from the model you are using?
EvilGuy@reddit
10 mins? Is this running on your CPU?
Zauos@reddit
u/gladkos please make heretic (https://github.com/p-e-w/heretic) ggufs! you would do me a great favour
gladkos@reddit (OP)
nice! will play a bit thank you