Gemma 4 MTP released | TheaterFire

That PR has been merged. But so far I'm getting an error trying to use the draft model with up to date MLX via pip in a fresh venv. Have you had any luck?

[-]

fatboy93@reddit

clone the repo, cd into it, and do a pip install --force .

[-]

boutell@reddit

Thanks! Got much farther along, but wound up opening a ticket. Of course it's possible I'm still doing something wrong:

https://github.com/Blaizzy/mlx-vlm/issues/1122

[-]

DigiDecode_@reddit

The Gemma 4 implement seems quite a bit different to qwen3.6 MTP i.e. shared kv cache, activation from target shared with drafter, clustering of the embedding for drafter

so my guess it would needs its own implementation, so likely to take more time to be supported by llama.cpp unless if the work is already done

[-]

mythikal03@reddit

I built from the open vLLM PR today (#41745 — [Spec Decode]) and wanted to share some results from my RTX 6000, this is bonkers

[-]

FIdelity88@reddit

Where did you get that benchmark script? If it’s yours; care to share on GitHub?

[-]

mythikal03@reddit

You bet, it's just a shell script I've been adding to over time. Enjoy: https://gist.github.com/mythikal03/57ec60665fa41b23c43fb904a25af4e0

[-]

Intelligent-Lynx-953@reddit

The memory overhead here is almost trivially small. 930MB for the 31B drafter, 78M for the E2B. For anyone already running these models with a few GB of headroom, it's basically free throughput.

The real question is whether that 2x speedup holds when you're doing partial CPU offload. Speculative decoding needs the draft and verify steps to run fast on the same device, and if your main model is split across GPU and CPU, the verify bottleneck on the slower path could eat into the gains. Would be curious to see benchmarks from a mixed offload setup.

[-]

Rikers88@reddit

This is super cool! I get that this is a super specialized model for Gemma, but isn't already there the possibility in llama cpp to put a drafter? It works good only 66% of the times if you don't fine tune it, like gemma ppl did. Am I wrong?

Thanks for sharing!!

[-]

Daemontatox@reddit

how are you people running it ? vllm says multimodal mtp is not supported yet and llamacpp still has a pending PR

[-]

MaartenGr@reddit

For those interested in how they work, I updated my visual guide with some snippets here and there: https://newsletter.maartengrootendorst.com/i/193064129/multi-token-prediction-mtp-with-gemma-4

[-]

zzzzlugg@reddit

Thanks for the nice write up. I'm curious what you are using to make the diagrams? They're nice and crisp.

[-]

DigiDecode_@reddit

it seems to be a verbatim copy from https://x.com/googlegemma/status/2051694045869879749 with no link to the original source

[-]

djdanlib@reddit

Isn't it the same author though?

[-]

DigiDecode_@reddit

the link I posted is from official GoogleGemma account on X

[-]

djdanlib@reddit

What I'm saying is, the person you're replying to looks like the actual author of the paper, and appears to work on that specific team at Google.

[-]

wren6991@reddit

One thing I find weird about the Gemma 4 family: why does the 31B not use PLE? I wouldn't mind having an extra 30 GB on disk if it meant better model performance for the same VRAM, bandwidth and compute.

[-]

log_2@reddit

How were gradients backpropagated through the clustering head?

[-]

superdariom@reddit

Really helpful

[-]

Champignac1@reddit

Great read, easy to understand even for a non native speaker !

[-]

IrisColt@reddit

Thanks!!

[-]

cleversmoke@reddit

Beautiful write up. Thank you!

[-]

portmanteaudition@reddit

Great guide

[-]

APFrisco@reddit

Such a great write up, thanks! I’ll be coming back to this one often

[-]

keepthepace@reddit

Thanks, that's a great write-up. I had never really understood the per-layer embeddings correctly.

I am pretty sure right now that some people are working hard at combining it with engrams and I can't wait to see what's happening there!

[-]

marscarsrars@reddit

Genius.

[-]

getpodapp@reddit

Great write up

[-]

JoNike@reddit

Great writing, great explanations!

[-]

hackerllama@reddit

This is the way!

[-]

Craftkorb@reddit

The E2B model had a 78M draft model - Cuuute!

[-]

No_Afternoon_4260@reddit

Can someone explain to me how MTP is different from speculative decoding?

[-]

No-Refrigerator-1672@reddit

In case of Gemma 4 it isn't, they published speculative decoding drafters. In case of Qwen 3.5 and Next - MTP is done as a sdcondary output layer that looks into internal states of the model.

[-]

No_Afternoon_4260@reddit

That is my feeling, thanks !

[-]

arbv@reddit

UwU tensor

[-]

NineThreeTilNow@reddit

The E2B model has a 78M draft model - Cuuute!

I think some people think you need hundreds of millions or a billion parameters in models to do useful stuff.

Some of the heaviest lifting done by Gemma is within the vocabulary Google built. The tokenizer is extremely well trained, which is how the model ends up performing so well pound for pound against other models.

Someone at Google questioned the first principles of scaling. Parameters for the sake of parameters doesn't make sense if you have hardware to train an amazing tokenizer. It was the original Qwen 500m? model that demonstrated the strength of it. I think that model uses like 300m of those parameters for the tokenizer and only 200m for the weights of the model.

Gemma 4 is using a 262k sized tokenizer, versus Llama which was 32k in version 2 and 128k in version 3 Llama.

I think DeepSeek v4 should have used a larger tokenizer but they stuck with the 128k.

That little draft model is borrowing heavily on their tokenizer which is like ~3b parameters.

[-]

DistanceSolar1449@reddit

Meanwhile o200k

[-]

First_Ad6432@reddit

look at this tiny little safetensor, so small XD

[-]

kingo86@reddit

*squeals*

[-]

First_Ad6432@reddit

[-]

GirlNumber20@reddit

I have found my people. 🤗

[-]

Acceptable_Home_@reddit

He is so small he only needs one popcorn 🥹🥹

[-]

OuterKey@reddit

Surprisingly small draft model

[-]

Queasy-Contract9753@reddit

I need to clear space on my phone and try this out. It's a 6gb massive it'll fit

[-]

No-Upstairs-4031@reddit

Is this for real? When did Google get so generous?

[-]

cass1o@reddit

When did Google get so generous?

Eh, when they pioneered the entire modern LLM field.

[-]

br33213@reddit

Since deepmind did (they always were, see alphafold, weathernext, research from alphago, ...) , Google never was but Hassabis made a deal to not constantly have to struggle for funds

[-]

kvothe5688@reddit

telling google never was generous is misleading. google has always published lots of research. one of biggest contributor to linux kernel. kubernetes, angular, golang . lots of health related research . flood warnings, wildfire warning systems. Google does not usually hoard research unlike most other tech companies.

[-]

combrade@reddit

The easiest counterexample to Google is Amazon. They literally have entire cloud products like their managed versions of Airflow and Kubernetes. So many other AWS products, just using open source products and repackaging them.

[-]

Warrenio@reddit

Google pretty much started the AI boom by publishing the "Attention Is All You Need" paper.

[-]

DistanceSolar1449@reddit

To be fair, everyone inside Google and outside Google recognizes now that if they knew what they had, they wouldn't have published that publicly. They just had no clue how big of a deal it was gonna become.

[-]

BoobooSmash31337@reddit

Isn't Go literally named after Google? I think the creator at least worked there. Maybe it's a coincidence.

[-]

ninjasaid13@reddit

Google does not usually hoard research unlike most other tech companies.

Didn't they delay research by months?

[-]

draconic_tongue@reddit

I don't think all of their ai stuff is through deepmind. https://huggingface.co/google t5 siglip bert, the transformers paper...

[-]

br33213@reddit

Fair point, good correction.

[-]

hackerllama@reddit

The team is cooking!

[-]

arbv@reddit

Gemmas are the most balanced models one can run locally. And probably the best ones for non-English speakers, second only to the Google's own cloud models.

Now I am hoping for a rumored Gemma 4 122B AxB (I hope it wasn't too good to be shelved - someone has to dethrone GPT-OSS 120B), and a QAT release series (like it was for Gemma 3).

[-]

quickreactor@reddit

long may it continue

[-]

dampflokfreund@reddit

I'm very grateful for what you have released, Gemma 4 is awesome. However I do hope you will keep the momentum up! Gemma 4.1 with even better quality, QAT, more reliable tool calling/agentic would be amazing.

[-]

Altruistic_Heat_9531@reddit

They always were. They are pretty much THE heavy lifter on LLM and other OSS.

Transformer, MoE, Instruct model, BERT, all from them.

[-]

DigiDecode_@reddit

maybe sponsored & paid by Apple for Apple Intelligence

[-]

andybrohol@reddit

Hassibis mentioned that they wanted to open source to help academia and that if they put the models on device, it's already exposed so why not just open source it.

[-]

GreenGreasyGreasels@reddit

That made me realize there will be no Gemma4-124B :(

[-]

First_Ad6432@reddit

They are... probably addicted to AI ;v

[-]

MoneyPowerNexis@reddit

my qwen 27b Q8 results with ~1k tokens generated / 250k context limit:

A6000 RTX

27tps -> 44tps

2x A6000 --split-mode tensor

33tps -> 57tps

Very Nice

[-]

erisian2342@reddit

That blog post was pretty great right up until the last line about the “Gemmaverse”. What lazy marketing moron comes up with this crap? Anything *verse lost all meaning and value when the Metaverse was vomited into the world.

[-]

everyoneisodd@reddit

I understand this is a great speed boost for local inference on llama.cpp. Wanted to understand if there is any benefit on inference engines like vLLM? I am under the impression from the previous speculative decoding conversations that it doesn't matter much for inference engines. Please correct me if I am wrong.

[-]

rerri@reddit (OP)

MTP definitely matters for inference engines and should work well in vLLM. A fellow redditors commented two days ago that for Qwen 3.6 27B, they are getting about 2x tg speed:

https://www.reddit.com/r/LocalLLaMA/comments/1t3guzw/comment/ojvbi9l/

[-]

Dry-Reveal4114@reddit

Has anyone tested these yet with quantized Gemma 4 models? Wondering how much of the speedup remains after quantization.

[-]

rm-rf-rm@reddit

Im really doubtful/fearful that given the pitiful state of benchmarks in terms of actually measuring intelligence, these engineering improvements that are narrowly focused on speed/latency may be causing quality regression that goes unmeasured

[-]

soldture@reddit

I'm still wondering why diffusion‑based LLMs like Mercury 2 are not widely adopted. Mercury is so fast

[-]

LetsGoBrandon4256@reddit

This results in significant decoding speedups (up to 2x) while guaranteeing the exact same quality as standard generation

Sounds awesome. What's the catch though?

[-]

coder543@reddit

If you need energy efficiency more than you need speed, drafting is probably the wrong choice. It is trying to spend more compute in order to make things go faster, and a lot of the drafted tokens will be rejected as nothing more than waste energy.

You also need more memory.

Those are the only tradeoffs.

[-]

DigiDecode_@reddit

with low acceptance rate, it is only overhead & no trade-offs

[-]

coder543@reddit

If the acceptance rate for MTP is low, then the MTP is broken.

[-]

DigiDecode_@reddit

yeah, but acceptance rate depends on the context domain, i.e. coding might get high AR, whereas a foreign language that the drafter was not trained on will see low AR

[-]

coder543@reddit

MTP is trained on everything the model is trained on. That’s what I’m saying. If the MTP doesn’t know that language,

[-]

rerri@reddit (OP)

There's a small catch: Slightly higher memory requirements.

[-]

BitGreen1270@reddit

How much higher? My gemma4-26B apex model is about 21gb. How much memory will MTP take?

[-]

2Norn@reddit

not a lot, 26b assistant is 400m parameters only i believe so it cant be more than 1.5gb at most

[-]

rerri@reddit (OP)

The MTP model for Gemma 4 26B is \~800 MB, but the llama.cpp implementation will most likely require some more on top of that though. Hard to say how much.

[-]

nickm_27@reddit

That is the safetensors, I think llama.cpp uses Q8_0 for MTP?

I had Gemini read the PR and guess what the extra VRAM usage would be and this is what it gave

Item	VRAM Cost	Why?
Current Baseline	21.335 GB	Your Q4_K_XL + Context + Vulkan Baseline.
MTP Drafter Weights	+ 0.720 GB	Gemma 4 assistant heads (Q8_0 precision).
Parallel KV States	+ 0.450 GB	Space for 2–4 draft tokens in flight.
MTP Dispatcher (Vulkan)	+ 0.180 GB	New compute graph nodes for verification logic.
Total VRAM Forecast	22.685 GB	Safety Margin: ~1.3 GB

[-]

TheTerrasque@reddit

the qwen3.6 27b model apparently takes roughly 3gb extra at runtime

[-]

IrisColt@reddit

Is there a Qwen 3.6 27B MTP model already supported by llama.cpp?

[-]

TheTerrasque@reddit

The creator of the PR made a model, and some have grafted the mtp part onto other quant models and got it working.

[-]

BannedGoNext@reddit

You forgot that if it gets 3 speculations wrong in a row it summons Beetlejuice, but that's really a small price to pay.

[-]

IrisColt@reddit

heh, but true

[-]

Silver-Champion-4846@reddit

Like, mashed insects?

[-]

legos_on_the_brain@reddit

Beetlejuice, Beetlejuice, Beetlejuice!

[-]

dtdisapointingresult@reddit

There's none. All speculative decoding[*] is a free lunch, a guaranteed 30%-100% gen speed boost that I'm not sure why it's not the 1st thing recommended to people. This is assuming you don't get greedy and configure a higher number of predicted tokens than your hardware can crunch in 1 pass without slowing down the overall generation. Just do some experiments with the number of tokens once, and find your sweetspot. Use a real prompt, like the same coding task, or the same essay request.

[*] MTP, Eagle3, and 'seperate small draft model'

[-]

Double_Cause4609@reddit

Well, the logic of speculative decoding is that you already paid the catch.

Basically, autoregressive models (like most LLMs) which predict the next token are really wasteful. They use a ton of memory bandwidth, but not really a lot of compute.

Modern processors are generally rich in compute, but low in bandwidth.

What this means is that if you're running a single user context (self hosting a chatbot, etc), you generally are massively under-utilizing your hardware.

All multi token prediction and speculative decoding do is move you from a memory bound scenario to a compute bound one, and give you some extra token predictions along the way.

For reference, Diffusion language models are already compute bound and so do not need this process, and that emphasis on compute is how they derive their massive speedups compared to autoregressive baselines.

[-]

shroddy@reddit

When benchmarking prompt processing, is that how fast a Gpu or Cpu would be when only compute bound?

[-]

Double_Cause4609@reddit

Prompt processing is fundamentally compute bound, but it can get a bit nuanced with MoE models.

For dense models it's pretty simple. You're essentially running a batch of activations through each weight tensor. In fact, arguably, you can even load each individual layer to a GPU (or even individual tensors!), run the forward for that loaded block of weights, and then load the next layer. LCPP doesn't do this for example, but Krasis does to my memory.

For MoE models it gets a bit complicated because there's expert co-occurence coefficient that you have to account for. With low expert overlap CPUs in particular slow down a lot more than you'd expect for prompt processing (they can look bandwidth bound here), but with high co-occurence they're compute bound, just like dense models.

[-]

Healthy-Nebula-3603@reddit

What ?

You know why Mac pro computers are slow with LLM in spite of a very fast memory ( 800 GB/s ) ?

Their CPU is too slow to fast output.

LLM are needed not only fast RAM but also fast CPU ( or GPU )

For instance underclocking my rtx 3090 by 50% but not touching VRAM I'm loosing almost 30 % token generation speed.

[-]

Double_Cause4609@reddit

What do you mean?

Macs have very fast CPUs overall. I think you're mixing something up.

Macs have relatively low compute compared to GPUs or other solutions, so at long context they can be a bit slow, but in situations where you're bandwidth bound (low context LLM inference for example) Macs are roughly as fast as their bandwidth would indicate.

Now, if you're talking about real-world inference, like booting up LlamaCPP and comparing speeds, the GPU may have a speed advantage because so much work was put into the CUDA backend or something, there can be *software* overhead, but this is more down to inefficient software than a fundamental statement on the scaling properties of LLMs on hardware.

I'm very confused as to where you're getting your ideas from, and I'd have to see the numbers you're looking at because what you're describing isn't something I've seen personally.

[-]

Healthy-Nebula-3603@reddit

DID you read my post to the end?

[-]

Freonr2@reddit

Besides what others pointed out on higher mem use, speculative decoding uses compute that could be otherwise used to increase concurrency. Won't matter if you are not using concurrency ofc.

[-]

Intelligent_Ice_113@reddit

What's the catch though?

I'm sure there must be one! LLM addiction or bill for electricity.
This just cannot be true.

[-]

dampflokfreund@reddit

For speculative decoding, the draft model usually uses quite a lot of memory. I haven't had any luck on my laptop with it. Hopefully for MTP it is different.

[-]

Top_Break1374@reddit

How do I run it?

[-]

coder543@reddit

llama.cpp does not have MTP support yet, so that rules out a lot of people for now. Maybe soon.

[-]

tarruda@reddit

https://github.com/ggml-org/llama.cpp/pull/22673 looks like very soon

[-]

audioen@reddit

Built the pr, testing it on Vulkan. The Q8_0 GGUF provides around 21 tokens/s early on in the context on a Strix Halo. I'm using spec-draft-n-max = 3 and it seems like it always generates maximum length drafts because the numbers are 1:1 the same. This is a little disappointing to me -- I assumed that the draft model predicts probabilities and so the regular speculative decoding confidence could produce variable length drafts according to the speculation head's probabilities, but evidently it either works differently or this is a minor oversight that will be corrected soon.

Other limitations: only parallel=1 works, meaning no multiple streams decoding in parallel. This is hopefully going to be next item on the list to fix.

But I don't really care to complain. I'm elated. This is easily double the performance I'm used to getting, and I was already willing to wait for 27b's results because they are that good. Much less waiting now, so that's incredibly good.

Excellent work from the llama.cpp team, especially am17an. Thank you for the solid work and the biggest performance gain I've ever seen on this software.

[-]

ricesteam@reddit

Assuming I downloaded the right gguf, do I just it normally or do I need some specific flags?

[-]

IrisColt@reddit

Any answer on this?

[-]

nickm_27@reddit

-spec-type mtp --spec-draft-n-max 3

[-]

IrisColt@reddit

Thanks!!!

[-]

nickm_27@reddit

-spec-type mtp --spec-draft-n-max 3

[-]

coder543@reddit

This PR does not appear to implement p-min for MTP.

[-]

tarruda@reddit

Nice to know. I currently get around 16 tokens/second on 3.6 27b with a M1 ultra and hopefully this will bring me close to 30 tokens/second

[-]

Zeeplankton@reddit

idk how llamacpp maintainers don't go insane trying to support every new feature

[-]

philmarcracken@reddit

oh its not them slowly going nuts

[-]

keepthepace@reddit

They were insane to start with!

(jk, we love you!)

[-]

Top-Rub-4670@reddit

ggerganov seems like a very pragmatic leader.

Thank god for that! A lesser man would have allowed llama.cpp to devolve and we'd need probably need docker + npm + python + rust to run it and a 28-steps process to build/bundle it.

But nope, he stayed true to the mission. A powerful yet portable program. It doesn't try to be everything, it just tries to be a building block. The pillar on which the entire local inference community is built on, really.

[-]

jld1532@reddit

So in theory, once this is implemented we can use this for any model sets that have the same general architecture? Looks like he was using qwen 3.5 0.8B with the larger 3.6 models.

[-]

Public_Umpire_1099@reddit

Warning, info dump essay incoming

Yeah, basically. The key requirement is that the draft model and target model share the same tokenizer, since the draft has to produce token IDs that the target understands. Same model family is the easiest way to guarantee that — Qwen draft + Qwen target works, Gemma draft + Gemma target works, but Qwen draft + Llama target won't because the vocabularies differ.

Quick clarification on terms because I went down this rabbit hole myself recently--

Speculative decoding is the general technique — small fast "draft" model proposes N tokens, big "target" model verifies them all in one parallel pass. If the draft was right, you accept those tokens at the speed of one forward pass instead of N. Already in llama.cpp via --draft-model. Basically, the small model is writing the essay FAST, and shows the large model it's homework so they can cheat on the exam. For each part, the large model either says "yep, that's what I would've written" and keeps it, or "nope" and writes the rest itself starting from the rejection point. The large model does this, then turns it in to the teacher (you). The end result the teacher sees is that the large model turned in a pretty good exam, and did it faster than he usually would have. It was slower than what the small model did, but largely more accurate and informative because it only kept the parts that made sense.The large model balanced speed, efficiency, and accuracy.

MTP (multi-token prediction) in the strict sense is an architectural feature where the model has multiple prediction heads built in (DeepSeek V3 popularized this). Google's recent Gemma 4 announcement uses "MTP" loosely — what they actually released is small drafter models for classical speculative decoding, not built-in heads.

On the Gemma E2B/E4B side: those are dense models, not MoE. The "E" stands for Edge (Google's edge model family). E2B is ~2B params, E4B is ~4B params, all parameters active per token. These should be straightforward speculative decoding targets. It is really important that they release these, because everyone has been waiting on it. They teased it a few weeks ago when they showed benchmarks using this "MTP" method, and a lot of people found themselves a bit disappointed at the speed.

One important thing I discovered:

On Qwen3.6-35B-A3B: it's MoE — 35B total params, ~3B active per token. The router selects 8 of 256 experts per token. Speculative decoding still works on MoE, but the gain is somewhat smaller than on dense models. When the target verifies N draft tokens in parallel, those tokens may route to different experts, so the weight-load amortization that makes spec decoding fast is partial rather than complete.

For the smaller-Qwen-as-drafter idea (Qwen3.5-0.8B drafting for 3.6-35B-A3B): tokenizer compatibility is the first thing to check. If those two share vocabulary, it should work. Acceptance rate will determine actual speedup — could be anywhere from 1.2x to 2x depending on how well the small model predicts the big one's distribution. Theory says spec decoding on MoE just trims the win because parallel verification doesn't amortize as cleanly. In practice on my hardware, it was a 3× regression (10.5 tok/s vs 30 tok/s baseline) even with 100% acceptance rate using same-family Q4 target + Q2 draft. Your mileage really does depend on whether your hardware is compute-bound or bandwidth-bound.

BUT!! Here's the big caveats: using .8B as a drafter for a much better model is certainly going to give you only a very small increase ~10-20%. For drafting to work, the small model needs to get a decent amount of the information correct. .8B isn't really gonna cut it. Also, spec decoding is just a lot less efficient on MoE models. Unless Qwen releases a model specifically for drafting for their MoE model, or their 27B dense model, you might not find a huge jump. Or, you could I guess. The more I mess with this stuff the less I think I understand lol. Everything depends upon whether your setup is compute bound or bandwidth bound. Once you know which you fall under, predicting gains becomes a lot easier.

If you want to test it: --model-draft is the flag. Watch acceptance rate in the server logs. If acceptance is high but your tok/s is lower than target-alone, you've hit the same wall I did.

[-]

ParadigmComplex@reddit

Google's recent Gemma 4 announcement uses "MTP" loosely — what they actually released is small drafter models for classical speculative decoding, not built-in heads.

My low-confidence understanding is that this isn't typical MTP with additional built-in heads, but it also isn't classic speculative decoding with stand-alone draft models, either.

From the blog post:

The draft models seamlessly utilize the target model's activations and share its KV cache, meaning they don't have to waste time recalculating context the larger model has already figured out.

which almost feels like slapping additional layers onto the model. It seems like it blurs the line between traditional MTP heads and traditional draft model.

[-]

BoobooSmash31337@reddit

E is "effective" afaik. E2B is 4B and E4B is 8B.

[-]

StardockEngineer@reddit

I'm using it, too! It works!

[-]

JsThiago5@reddit

I am using and it's working.

[-]

tarruda@reddit

Using with Gemma 4?

[-]

UnWiseSageVibe@reddit

They had a merge yesterday to add it.https://github.com/ggml-org/llama.cpp/pull/22673

[-]

chindoza@reddit

This has not yet been merged as of writing.

[-]

coder543@reddit

🤔 yes...?

[-]

michaelsoft__binbows@reddit

I have the same question. these might be the precursor models that the quantizers will use to prepare us the AWQ/autoround/int4 stuff for use with vllm and GGUFs for use with llama (which is also getting MTP soon). looking forward to the coming goodies.

[-]

unique-moi@reddit

VLLM supports MTP with qwen3.6 models

[-]

michaelsoft__binbows@reddit

This is what I have been using yes. 120 tok/s or so on the 27B with a 5090. Really good perf.

[-]

praxis22@reddit

There is a configuration option in LMStudio, if you enable it it gives you a file chooser.

[-]

IShitMyselfNow@reddit

Isn't that just speculative decoding?

LMStudio uses llama cpp behind the scenes so I'm a bit confused as to how they'd support something that lcpp doesn't :D

[-]

chimph@reddit

Ok, so I think what’s happening is that there will be models that have the MTP drafter built in but these Gemma drafters are separate models that target the Gemma 4 models. Therefore it is both speculative decoding and MTP.. just separated.

[-]

2Norn@reddit

exactly

[-]

chimph@reddit

It surely also means that you can’t run from lmstudio since that uses llama.cpp and that doesn’t support this specific implementation yet?

[-]

2Norn@reddit

https://github.com/ggml-org/llama.cpp/pull/22673

yeah kinda, so it's not readily available i suppose but we'll get it soon

[-]

RickyRickC137@reddit

It says no compatible model found in LMStudio. I am using GGUFs for the original model btw.

[-]

praxis22@reddit

There are four model links above, to match the four model sizes of Gemma 4. so if you have the Gemma 4 31B model as the one you have installed, you would download the smaller model from the first link above. the video I have posted in the link above shows you how to proceed from there.

[-]

grumpydad67@reddit

Total n00b here. I tried downloading one of the Gemma4 the assistant models from within LM Studio, but they don't show up in the model picker (yet). I assume this is normal?

[-]

praxis22@reddit

You can select a base model directory, to download all models too. So it will scan that at startup. So all models you download via LMStudio go there. Any you download from huggingface go there too.

I presume, though I have never tried, that you need to select the smaller model from the gear wheel attached to the model entry for the model you downloaded, in the interface as shown in the video above.

[-]

grumpydad67@reddit

Yeah, the problem is that you still need to download one of the drafting models, and those don't show up (yet) in LM Studio. Will keep trying!

[-]

OfficeNinja42@reddit

The list of supported predictions models seems to be hard-coded in LMStudio. One cannot use custom GGUFs, but only few combinations of older models. Hopefully some of the next releases fixes this.

[-]

jld1532@reddit

It's either hard-coded or functionality based. I was able to get it to work with Qwen 2.5 but nothing newer. Apparently vision capability may disable it?

[-]

helpmefindmycat@reddit

lm studio has had some issues regarding draft model to main model. I tried it early on, and found it was pretty good but something went awry. I think vllm supports speculative draft models in a more robust manner these days.

[-]

Top_Break1374@reddit

Where? I can't find any docs or any setting in my LMStudio.

[-]

praxis22@reddit

My PC has been offline for a while, However

https://www.youtube.com/watch?v=eLdItqdMKK8

[-]

Top_Break1374@reddit

Thanks, found it myself already, but there is no GGUF of the draft models

[-]

jld1532@reddit

This is under speculative decoding, yes?

[-]

King0fFud@reddit

It'll be out for Ollama soon if you're running with MLX: https://github.com/ollama/ollama/releases/tag/v0.23.1-rc0

[-]

pmttyji@reddit

Same question here. ELI5 version please

[-]

florinandrei@reddit

You wait for the tools (llama.cpp, etc) to catch up with it, then you run it.

[-]

horribleGuy3115@reddit

Sigh ! With llama.cpp - no MTP yet, maybe fire up that vllm script.

[-]

Qxz3@reddit

So can these be used as speculative decoding models in LM Studio?

[-]

tarruda@reddit

Not before https://github.com/ggml-org/llama.cpp/pull/22673 is merged

[-]

DigiDecode_@reddit

Gemma 4 MTP support will likely require more changes

[-]

2Norn@reddit

lm studio llamacpp version is like 2 weeks behind usually

[-]

marscarsrars@reddit

This is the way.

[-]

dero_name@reddit

This is the way.

[-]

Paradigmind@reddit

Is this the way?

[-]

genpfault@reddit

Token prediction failed, aborting decode.

[-]

imp_12189@reddit

Does anyone know da way?!

[-]

Silver-Champion-4846@reddit

Alex!

[-]

Specter_Origin@reddit

Way is this the?

[-]

Silver-Champion-4846@reddit

Do you know de way?

[-]

Don_Moahskarton@reddit

This is the way.

[-]

-JustAsking4AFriend@reddit

You mean, “This is the.. (MTP invoke) way”

[-]

First_Ad6432@reddit

[-]

Mother_Context_2446@reddit

Sweet! Does anyone know how to enable it wtih vLLM?

[-]

mythikal03@reddit

I had to pull in and build from vLLM PR #41745 that was opened for merge to main today.

[-]

Ok_Warning2146@reddit

Very good. Please also give us QAT version of 31B and 26BA3B.

[-]

Adventurous-Paper566@reddit

I assume it won't be compatible with the vision?

[-]

FerLuisxd@reddit

What about vram usage? How much did it increase?

[-]

Powerful_Evening5495@reddit

remind me of intel speed up hack

[-]

boutell@reddit

All right, what am I doing wrong?

I created a Python venv and activated it

I did a fresh "pip install mlx mlx-lm"

I verified "which mlx_lm.generate" is coming from the venv:

(.venv) boutell@Thomass-MacBook-Pro:~/mlx$ which mlx_lm.generate
/Users/boutell/mlx/.venv/bin/mlx_lm.generate

Then I ran:

lx_lm.generate --model mlx-community/gemma-4-26b-a4b-it-4bit --verbose True --prompt "hello"  --draft-model mlx-community/gemma-4-26B-A4B-it-assistant-bf16

I got back:

...

ModuleNotFoundError: No module named 'mlx_lm.models.gemma4_assistant'

... etc

Also noteworthy:

Any ideas?ValueError: Model type gemma4_assistant not supported.

Thanks!

[-]

Eelz_@reddit

MTP support is currently in mlx-vlm, not mlx-lm. Also note, an updated version is not on PyPI so you will have to install from the main branch on GitHub.

https://github.com/Blaizzy/mlx-vlm/tree/main#gemma-4-mtp

[-]

boutell@reddit

Thank you

[-]

spac420@reddit

who's running it? what did you use?

[-]

3Xtrax@reddit

doesn't seem like there's a way to run it on LM Studio or llama.cpp atm :(

it should be a really promising speed up of 2-3x, will just have to wait and see.

[-]

WolpertingerRumo@reddit

ELI5, what’s MTP? I just can’t keep up with all the new slang every day.

[-]

ParadigmComplex@reddit

Lets say you're a super duper diligent student that loves doing homework and being ahead in class. You finish all your homework up early, then sit there bored. How could you get even more ahead? Well, if you can guess what the next homework assignment is, you can get started on it now. If you guess right, you're even more ahead! If you guess wrong, it didn't cost you anything, because you love doing homework.

Modern computers typically have a part that does math (the CPU, GPU, TPU, etc) and a part that remembers things (RAM/VRAM). What usually limits how fast an AI model can talk is the connection between these parts. The math part will finish the math very quickly then sit there for a while doing nothing but waiting for the remembering part to send it more numbers with which to do math. MTP ("Multi-Token Prediction") has the AI not only say things to the user, but also say guesses about future math the computer will have to do. The computer math part can then work on that guess when it's waiting for more information. If it's correct, the result is the AI can talk faster. If it's wrong, well, the math part wasn't doing anything productive during that time anyways.

It isn't always the right trick (e.g. competes with batching for compute headroom, better for dense rather than MoE models, requires additional memory, etc) but sometimes it can let AIs talk around twice as fast as they would otherwise on the same machine!

[-]

WolpertingerRumo@reddit

Interesting. So systems with low Bandwidth but high compute and vram will profit most?

[-]

ParadigmComplex@reddit

The more drastic the low bandwidth vs high compute is, the more someone could potentially benefit from this. I suspect there may be diminishing returns as the likelihood of acceptance of the speculative tokens will reduce the farther out the model speculates, but I haven't seen either proofs or empirical testing to confirm this hunch yet.

Proportionally, the additional VRAM isn't that much. It's less about having a lot of VRAM than just not already having been right at the limit. If you can only just barely load a given model with context, this might be what pushes you over. But if you already had a bit of extra room left over, these additional layers might squeeze in there.

[-]

arbv@reddit

Gemma 4 122B when?

[-]

llama-impersonator@reddit

yes google, please cough up this model

[-]

dtdisapointingresult@reddit

Anyone an expert on the effect of quantization on MTP layers? Is this like Vision models were you gotta make sure you run the MTP layers unquantized? My Gemma 4 31B will be the 4-bit AWQ.

[-]

Healthy-Nebula-3603@reddit

That MCP fp16 model for Granna 4 31b has 930 MB ...is small

[-]

ready_or_not_3434@reddit

Official draft models are great for latency, but loading both the base and drafter usually kills the VRAM budget on consumer cards. Definetly waiting to see some real world t/s numbers once llama.cpp supports this pipeline.

[-]

Healthy-Nebula-3603@reddit

You know that MCP model has 940 MB for Gemma 4 31b ?

[-]

annodomini@reddit

So I've been using E4B and E2B as draft models for 31B already, and it's worked pretty well. Will be interested to try this to see how it compares.

I'm wondering, though; has anyone run evals to see how draft models affect the performance of a model? Since the draft model is the one producing tokens, which the main model is merely accepting or rejecting, I wonder if it would influence the quality of results. I could see it going either way; in some ways, you might get slightly better results as the draft tokens are effectively agreed on by two different models.

But on the other hand, it might reduce variety; it might be that there's some next tokens that the main model could produce, but the draft model never would, and while the draft model produces tokens that the main model finds acceptable, it might miss some possibilities.

What evals have been done comparing performance on tasks with speculative decoding, not just raw tokens/sec?

[-]

Healthy-Nebula-3603@reddit

Quality should be exactly the same .

Current draft models are big for Gemma 4 and acceptance tokens is quite low.

That MCP model ( each version is designed for only one side model ) is much smaller. For 31b Gemma 4 has only 930 MB fp16 and acceptance tokens over 70%

[-]

Guilty_Rooster_6708@reddit

Do I still get the benefit of MTP if I already partially offload the main model to my CPU?

[-]

CombinationKitchen76@reddit

Based on what I've read it is more compute demanding (we have a lot of that) and less bandwidth demanding (we don't have a lot of that). So yeah, it seems like a win-win especially for the VRAM poor

[-]

oShievy@reddit

If this were to become a norm, I imagine strix halo and similar devices that are bandwidth bound would be much more attractive

[-]

FluoroquinolonesKill@reddit

Hoping someone can clear this up. I thought speculative decoding was only useful if you could load the entire main model into VRAM. Happy to be corrected.

[-]

earslap@reddit

No, I don't see the connection. The speculative model in classical speculative decoding is just a separate model with a lot fewer parameters. You run it instead of the main model (a lot faster) for a few tokens and run the predictions / draft by the larger model. It takes almost the same time for the larger model to check the multiple tokens of the draft model vs. larger model generating a single token. If draft is accepted, you got those tokens for almost free. And even, you get a free extra token from the large model at the end. If the speculative model acceptance rate is high, you will almost always get a speed benefit.

[-]

First_Ad6432@reddit

Never tried MTP, but i think if its designed as a draft model the output will be faster, but if its a small llm used as a draft model you will pay the price for running 2 llms (output will be slighty faster, but resource usage will go brrr)

[-]

marscarsrars@reddit

Now all we need is some one to Claude distil this version.

Version 4.6 opus only kindly thank you very much.

[-]

2Norn@reddit

those are placebo my dude

[-]

wren6991@reddit

Community distills perform worse than originals.

Also there is no such thing as an Opus 4.6 distill, since there is no public Opus 4.6 CoT data to actually distill from. Anthropic treat that material as a trade secret and it doesn't leave their datacentres. The "distills" are trained on the summarised CoT available from the public API, which has been laundered through a smaller model to remove the CoT structure. This isn't a conspiracy, it's in Anthropic's API documentation.

[-]

chimph@reddit

I'm a bit confused. So this is speculative decoding where a separate drafting model drafting (MTP) is used but its not supported by llama.cpp even though it supports speculative decoding.. 🤔

[-]

nickm_27@reddit

Instead of an entirely different model (like using GemmaE2B for Gemma31B) it uses tensors built into the primary model to do speculative decoding.

There are many types of speculative decoding: ngram-mod (based text repetition), draft (uses separate full LLM model), EAGLE (uses a separate draft model which is created from the main model), and then MTP (Multi Token Prediction).

The advantage of MTP once it is implemented in llama.cpp will be that it is all distributed in a single GGUF, not separately, and it should hopefully have lower VRAM requirements compared to the other options (besides NGRAM).

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

shokuninstudio@reddit

When the gguf comes will this it work automatically in current llama.cpp? If so do we need to add extra flags?

[-]

rerri@reddit (OP)

Current release version of llama.cpp does not yet have MTP support. It is being worked on.

[-]

Fluxing_Capacitor@reddit

Pretty sure you can still use this model for normal speculative decoding in the meantime, no?

[-]

BillDStrong@reddit

And the current work is on the Qwen 3 MTP support, so 3.5,3.6, Coder-Next.

Each model family is going to need a bring up step.

[-]

ieatdownvotes4food@reddit

you can already get mtp working great with the qwen models and vllm.

benchmark tune + --speculative-config method:qwen3_next_mtp, num_speculative_tokens:2

really good prediction acceptance

[-]

BillDStrong@reddit

Okay, but this subtopic is about llama.cpp support for it.

I can't use vLLM, my P40 is not supported, so this isn't useful for me at all. Llama.cpp is, potentially, useful, until they finally move to CUDA 7+ only.

[-]

ieatdownvotes4food@reddit

oooh my bad. I just saw mtp excitement, it'll end up llama.cpp soon I'm sure!

[-]

BillDStrong@reddit

Its in testing now, for Qwen 3.5,3.6.

No worries.

[-]

CroquetteLauncher@reddit

How does it compare with dflash speculators ? https://huggingface.co/RedHatAI/gemma-4-31B-it-speculator.dflash or https://huggingface.co/z-lab/gemma-4-31B-it-DFlash

[-]

Maximum-Fact-5832@reddit

If I understand correctly they speedup PP (prompt processing), this speeds up decode (tokens per second). Making DFlash models useful for DGX Spark and Mac; which struggle with PP, and this useful for (those) as well as GPUs (at least as far as GPUs that can run this model; I'm not familiar beyond nvidia). That's my understanding anyway.

[-]

dtdisapointingresult@reddit

Lo and behold! We are come again!

[-]

MaruluVR@reddit

What are the odds we could use the E2B draft model as a tiny STT model exclusively

[-]

cnmoro@reddit

I like the way you think

[-]

_-_David@reddit

I would get excited but I have never had Gemma 4 stop repeating outputs.

[-]

No_Swimming6548@reddit

Will this work with partial offloads too?

[-]

rz2000@reddit

The 31B model @ bf16 is my favorite model for chat among anything that I can run with using up to 170GB of memory. It’s so efficient at getting to the point, that it barely matters that it only outputs at about 10tok/second. If speculative decoding accelerates that, it will be even better.

[-]

akavel@reddit

I wonder why they worked with Ollama to support it, but not with llama.cpp?

[-]

dryadofelysium@reddit

https://github.com/google-ai-edge/LiteRT-LM 0.11 has Gemma 4 MTP support and added Windows native support today

[-]

xanduonc@reddit

Yay! Google delivers

[-]

mortenmoulder@reddit

Tbh Google is pretty damn cool for releasing this. Can't wait to try it!

[-]

Fine_Nectarine9328@reddit

Can someone tell me what this is in easy way plss, and second llamacpp officially don't support turboquant but there is an unofficial fork on GitHub something name tom how to install that or does vllm support turboquant, pls someone clear these two doubts and pls don't downvote my karma is low

[-]

SQrQveren@reddit

and second llamacpp officially don't support turboquant but there is an unofficial fork on GitHub something name tom how to install that

It's this repo: https://github.com/TheTom/llama-cpp-turboquant you install/compile it just like you would the original llama.cpp.

Chatgpt can easily explain in details on how to do it.

When you have done so, you need to find models that are turbo'ed and the fitting parameters it.

I have found 1 good example, that suits me quite fine, though I have only tested a few hours, but /u/drepublic made a really cool post you can use as base: https://www.reddit.com/r/LocalLLM/comments/1sz7ih3/qwen359b_running_on_8gb_vram_is_insane/oizsauk/

In the same thread he gives a link to the model to download; it really can't be any easier than this.

[-]

First_Ad6432@reddit

MTP: Multi Token Prediction, will make your output faster
turboquant isnt that better from what we have today i think so u can forget it

[-]

Weak-Shelter-1698@reddit

W Gemma team.

[-]

nunodonato@reddit

when gguf

[-]

Look_0ver_There@reddit

Fairly easy to create your own Q8_0 near-lossless GGUF just by following the convert_hf_to_gguf.py instructions from llama.cpp if you're impatient.

[-]

popoppypoppylovelove@reddit

I wanted to try this to see if it works as a separate draft model, but it doesn't convert because the architecture is unknown: ERROR:hf-to-gguf:Model Gemma4AssistantForCausalLM is not supported

[-]

Look_0ver_There@reddit

Just pulled fresh and tried it myself, even upgrading to latest transformers and everything, and you are indeed correct. How weird that we can convert the full model, but not the MTP model!!

[-]

VoiceApprehensive893@reddit

now we need audio input for 26b, since its e4b but uses more ram and is 5 times better

[-]

Blues520@reddit

Is there any accuracy tradeoffs when using MTP?

Is it like quantization where you sacrifice accuracy for performance?

[-]

rerri@reddit (OP)

No, accuracy remains 100%. The main model checks every token that the MTP model generates and corrects when needed.

[-]

ThrowawayProgress99@reddit

How does this work with offloading, do both models need to be fully on GPU? What about kv cache, can that be on RAM? My current config is to override all ffn_down tensors. Also does this work with the (on RAM) mmproj for vision?

[-]

Character_Split4906@reddit

From what I understand llama.cpp have limitations on using draft model with mmproj model due to how kv cache is shared with main model. Do MTP support will help on running mmproj and draft model in parallel?

[-]