Multi-Token Prediction (MTP) for LLaMA.cpp - Gemma 4 speedup by 40%

[-]

I cant seem to make it work when I enable the MTP assistant. Server loads without errors but the first request it gets like 'hello' crashes the server and closes the console window before I can see anything. If I just run without loading the mtp assistant the server runs fine. I'm coming from the LM Studio / Kobold world sorry if this is a dumb question. Are there any logs I can look at?

[-]

gladkos@reddit (OP)

Hi! what's your device? we didn't test under lm studio, sorry.

[-]

oldeastvan@reddit

No I built the atomic repo in win 11. Running cuda on 3090

[-]

Quirky_Inflation@reddit

Nice so llama.cpp running gemma4 can now crash 40% faster

[-]

error_museum@reddit

How do I get this to work in LM studio?

[-]

JamesEvoAI@reddit

Thank you for you work on this, I've replicated your results on Strix Halo:

https://sleepingrobots.com/dreams/gemma4-mtp-assistant-strix-halo/

The world of local coding models keep getting better by the day!

[-]

gladkos@reddit (OP)

nice! great to see you achieved similar results.

[-]

thetaFAANG@reddit

in this moment I am euphoric

[-]

opossum_cz@reddit

The promise was 2-3x. So 40% is pretty low, I am testing myself and it goes from 10t/s to about 14t/s, which is consistent with what you showed.

Disappointing. Normal speculative drafting seems to be much better.

[-]

bleakj@reddit

As someone who hadn't been reading about this prior - 40% seems like a solid jump to me,

What do you mean in terms of normal speculative drafting being better though? (Like, it's better while using MTP, or without? - wording me dumb.)

[-]

opossum_cz@reddit

Can you run something like this for me? Alter parameters to match your won testing, but use ggml-org/gemma-4-31B-it-GGUF:Q4_K_M with ggml-org/gemma-4-E2B-it-GGUF:Q8_0 as drafter. I would honestly like to have somebody to compare results with.

llama-server --host 0.0.0.0 --port 18080 --alias gemma4:31b --ctx-size 8192 --parallel 1 --n-gpu-layers 999 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --jinja --reasoning off --chat-template-kwargs \{\"enable_thinking\":false\} --temp 1.0 --top-k 64 --top-p 0.95 --hf-repo ggml-org/gemma-4-31B-it-GGUF:Q4_K_M --spec-draft-n-max 32 --spec-draft-n-min 0 --n-gpu-layers-draft 999 --cache-type-k-draft q8_0 --cache-type-v-draft q8_0 --hf-repo-draft ggml-org/gemma-4-E2B-it-GGUF:Q8_0

[-]

bleakj@reddit

I'll do this tomorrow (just seeing now)

with ctx-size set to 8192 though, do you find this usable, or is it testing purposes only? (I usually have context set between 64k and 128k for local runs)

[-]

caetydid@reddit

It is solid for MoE. I'd expect 2x for 31B dense.

[-]

MiaBchDave@reddit

It is ;-)

[-]

b1231227@reddit

Does it only support Gemma 4?

[-]

gladkos@reddit (OP)

at the moment only Gemma. working on QWEN

[-]

bleakj@reddit

The second QWEN3.6 27B is working, let me know

I've got it running on a 4090ti at the moment, and it does good work / speed, however, if I could manage to make it more usable on my 4070ti PC, it'd be a huge win for me lol.

Running it on my 4070ti currently is around 15-20/toks

[-]

audioen@reddit

You've been living under the rock. https://github.com/ggml-org/llama.cpp/pull/22673 and whole bunch is running the MTP versions already.

[-]

bleakj@reddit

Oooh - yeah, cave living is big for me.

Thanks for the info

[-]

SavingsWeather1659@reddit

gemma 4 26b was fast but what we need is 31b dense model to improve this model

[-]

pepedombo@reddit

I've already checked mtp for qwen 27b and it gave me around 40% increase so I think it'll be the same with 31b.

In qwen I had 25tps at the start and now it hits 35-37. Q5 variant. In the same time it looks it needs more vram so I had to change kv from f16 to q8 to keep up with same ctx.

[-]

pabloodiablo@reddit

Where did you get compiled version of llama.cpp ?

[-]

pepedombo@reddit

Compiled myself because public revision is not exposing mtp yet.

[-]

juandann@reddit

can you link which commit/pr do you use?

[-]

pepedombo@reddit

here

[-]

FunnyAble7408@reddit

[-]

MiaBchDave@reddit

Gemma 4 31B MTP works fine using MLX. I followed the instructions on the Google assistant page: https://huggingface.co/mlx-community/gemma-4-31B-it-assistant-bf16

Quick test:

Gemma 4 31B BF16 Without MTP Draft: Generation: 75 tokens, 7.318 tokens-per-sec

Gemma 4 31B BF16 With MTP Draft: Generation: 75 tokens, 13.408 tokens-per-sec

M5 Max 128GB

[-]

Kaioh_shin@reddit

Does anyone know a fork that has MTP + TQ and works with Qwen3.6 27B ?

[-]

gladkos@reddit (OP)

MLX supports MTP for Mac silicon. On llama cpp we'll release MTP for qwen next week

[-]

juandann@reddit

how about mtp support for gemma4?

[-]

MiaBchDave@reddit

For Mac, oMLX 0.3.9 supports this.

[-]

d4nger_n00dle@reddit

I would like to know that as well

[-]

AnonLlamaThrowaway@reddit

Would this help in scenarios where you don't have enough VRAM and you've got half the model in VRAM, and the other half in RAM?

[-]

gladkos@reddit (OP)

not possible unfortunately. model hosts in vram

[-]

FerLuisxd@reddit

Vram usage?

[-]

gladkos@reddit (OP)

around 20Gb for the main model. 900Mb for assistance. 700Mb Kv cache, or 100Mb cache with turboquant.

[-]

Adorable-Sir-773@reddit

for some reason it actually slowed down generation on my 5060ti 16GB, idk what did I miss

[-]

Muted_Masterpiece342@reddit

Any way this works on AMD?

[-]

Material_Tone_6855@reddit

It's the dense model?

[-]

More-Bed-2557@reddit

Is this not compatible with GGUF quants? I tried running it with gemma4-31B-Q3_K_S.gguf, but got an error during starting up llama-server saying the assistant and model could not be loaded with your fork.

```

llama_model_load: error loading model: invalid vector subscript

llama_model_load_from_file_impl: failed to load model

llama_model_load_mtp_from_file: failed to load assistant from ...
````

[-]

Hot_Cupcake_6158@reddit

Thanks for the development u/gladkos! ❤️

I'll wait for a pre-compiled release to try it out. because I'm not terminal savvy enough to compile your fork using the base documentation written for base Llama.cpp. 😔

[-]

Own_Dimension_4513@reddit

40% speedup on a MacBook M5Max is no joke — MTP draft tokens are underrated for local inference. Gemma 4 26B at that speed starts to feel actually usable for real workloads without a GPU rack.

[-]

MiaBchDave@reddit

Gemma 4 31B BF16 with MTP is showing 2x speed increase on my same hardware.

[-]

DKO75@reddit

How do you run it from your app ?

[-]

gladkos@reddit (OP)

Hi! Not yet. We are working to add MTP to our atomic.chat app. Just install it, update will come next days with a popup window.

[-]

j0j0n4th4n@reddit

Does this works with finetunes/heretics/ablated/etc of Gemma 4 or just the official model?

[-]

gladkos@reddit (OP)

each model requires it's own small assistant. we took official models pairs. not sure if it works with others..

[-]

nickleodoen@reddit

visualization looks sick

[-]

gladkos@reddit (OP)

thank you! we have a small tasting stage in our lab

[-]

ChessGibson@reddit

Very cool tests! Did you try with Gemma E2B and E4B?

[-]

gladkos@reddit (OP)

thank you! not yet. these are simple models for easy tasks mostly.

[-]

IrisColt@reddit

Thanks for the patched llama.cpp!!!

[-]

Confident-Aerie-6222@reddit

does it work in lmstudio?

[-]

jtjstock@reddit

No, this needs to be merged into the mainline first, then lm studio will need to update. Something to watch though, the performance improvement is amazing.

[-]

Temporary-Roof2867@reddit

but does it only work for MAC? 👀👀

[-]

gladkos@reddit (OP)

you can compile and run llama cpp on mac, win or linux. or just use harness atomic.chat

[-]

Qwen3_6_27b_UD_Q4XL@reddit

Need to force them to answer as similar as possible to compare quality.

[-]

gladkos@reddit (OP)

Yeah, I might run a bunch of prompts soon

[-]

fallingdowndizzyvr@reddit

Crank the temp down to 0.

[-]

mxforest@reddit

I thought it was crank up and tone down?

[-]

fallingdowndizzyvr@reddit

No. The higher the temp, the more variable it becomes. If you crank it down to 0, it's effectively deterministic.

[-]

TJKDev@reddit

I think they were making a point about the way you used “crank down” not about the actual temperature of the model.

[-]

fallingdowndizzyvr@reddit

Ah.... In that case that's also wrong.

Crank just means "turn". As in turning a crank. You can crank things up or down.

https://www.collinsdictionary.com/dictionary/english/crank-handle

https://dictionary.reverso.net/english-definition/crank+down

[-]

Alternative-Suit5541@reddit

Just Post a long text in context, tell it to copy it back.

Done

[-]

rockseller@reddit

How is the quality of the generated? Since is based on guessing idk does it has a bad result or downside?

[-]

Shinkai_I@reddit

This isn't just about guessing; it's about simultaneously enabling rapid guessing and large-scale parallel validation.
When a guess is rejected by the validator, the original generation method takes over for subsequent generation, thus ensuring the quality of the generated data.

[-]

rockseller@reddit

Thanks for sharing! Does it also improves the prefill speed or not? Like I noticed it takes a while to read files (using llama.cpp)

[-]

Shinkai_I@reddit

From what I understand, it won't speed up the pre-fill stage.

In fact, it might slightly increase TTFT latency, but not significantly (typically between 5% and 15%, depending on your environment).

I think what's more noteworthy is the increased usage of inference memory.

[-]

opossum_cz@reddit

No. It doesn't change output.

[-]

grumd@reddit

Would be interesting to see the same comparison but with the same seed and with temp 0.0, supposedly the output would be the exact same, proving MTP isn't degrading quality

[-]

m18coppola@reddit

the MTP heads predict multiple tokens at the same time, so the main model can verify them in parallel. If the main model doesn't agree, they tokens are rejected and resampled by the large model. MTP is effectively speculative decoding and will not affect the output at all, just generation speed.

[-]

grumd@reddit

I know. I said it would be nice to see how it works in a gif, instead of OP's gif showing different outputs.

Also, bugs in implementation can cause some errors during matmuls or quantizations or whatnot - leading to different outputs

[-]

Former-Ad-5757@reddit

Supposedly yes, in reality with quantization etc it can slightly shift a little, and because they are just next-word predictors 1 tiny shift means everything behind it will be different.

Basically they can only be 99% the same if the error happens in one of the last tokens. If the error happens not in one of the last tokens then you immediately drop down because everything after it will also be different.

[-]

gladkos@reddit (OP)

model wouldn't answer the same anytime. A judge is needed.

[-]

k4ch0w@reddit

Set the seed to the same like 42 and then set temp to 0

[-]

mxforest@reddit

That will work between 2 runs of MTP but not between runs that have it enabled and disabled.

[-]

opossum_cz@reddit

Nonsense. Do not comment if don't know enough.

[-]

petuman@reddit

If a draft model is making predictions, it will not produce the same token as the full one.

Even if draft model is producing random noise, it should have no effect on final output, as each individual draft token gets verified by main model.

[-]

the320x200@reddit

The full model checks a batch of predictions from the draft model and they only get accepted if they match the full model. There should be zero output change with MTP on vs off.

[-]

Ok-Measurement-1575@reddit

This looks great but the burning question is:

Can 27b with mtp enabled STILL fix the slop produced by opus?

[-]

stoppableDissolution@reddit

Mtp has no effect on the output itself

[-]

Ok-Measurement-1575@reddit

bollocks

[-]

stoppableDissolution@reddit

You dont know how speculative decoding works, do you

[-]

Ok-Measurement-1575@reddit

no idea

[-]

gvij@reddit

why is the difference not that much as mentioned in the release notes?

[-]

pizzaboyreddit@reddit

Also have great results in vllm, it's really made the 31b usable

[-]

bleakj@reddit

Honestly the Gemma 4 models have all been really usable on "reasonable" level consumer gear,

Even my 4070ti system runs 31b fine (I do offload some into system ram, which slows down a bit, but I don't find it too bad, it's still usually 40-50/tok or better) mind you, even the jump to just a 4090ti from the 4070ti makes a huge difference.

[-]