Gemma 4 MTP released
Posted by rerri@reddit | LocalLLaMA | View on Reddit | 258 comments
https://huggingface.co/google/gemma-4-31B-it-assistant
This model card is for the Multi-Token Prediction (MTP) drafters for the Gemma 4 models. MTP is implemented by extending the base model with a smaller, faster draft model. When used in a Speculative Decoding pipeline, the draft model predicts several tokens ahead, which the target model then verifies in parallel. This results in significant decoding speedups (up to 2x) while guaranteeing the exact same quality as standard generation, making these checkpoints perfect for low-latency and on-device applications.
inthesearchof@reddit
With the Gemma 4 fixes and updates, Gemma 4 and Qwen 3.6 are trading blows.
rerri@reddit (OP)
Depends on the use case too. Gemma 4 31B is vastly better at writing Finnish than Qwen 27B.
BusRevolutionary9893@reddit
LoL, quite the flex.
SirBardBarston@reddit
Wen Unsloth?
hackerllama@reddit
Enjoy!
dampflokfreund@reddit
Awesome, thank you! Right in time for llama.cpp support.
hackerllama@reddit
Yes, excited for it to land!
In the meantime, we're landing transformers, Ollama, VLLM, SGLang, and MLX support.
Virtamancer@reddit
What did you mean about MLX? My use case is entirely macOS so if I can get MTP support in MLX (or maybe even in MLX-vllm) that would be HUGE.
hackerllama@reddit
https://github.com/Blaizzy/mlx-vlm/pull/1112
https://huggingface.co/collections/mlx-community/gemma-4-assistant-mtp
boutell@reddit
That PR has been merged. But so far I'm getting an error trying to use the draft model with up to date MLX via pip in a fresh venv. Have you had any luck?
fatboy93@reddit
clone the repo, cd into it, and do a pip install --force .
boutell@reddit
Thanks! Got much farther along, but wound up opening a ticket. Of course it's possible I'm still doing something wrong:
https://github.com/Blaizzy/mlx-vlm/issues/1122
illusionmist@reddit
Nice!
DigiDecode_@reddit
The Gemma 4 implement seems quite a bit different to qwen3.6 MTP i.e. shared kv cache, activation from target shared with drafter, clustering of the embedding for drafter
so my guess it would needs its own implementation, so likely to take more time to be supported by llama.cpp unless if the work is already done
mythikal03@reddit
I built from the open vLLM PR today (#41745 — [Spec Decode]) and wanted to share some results from my RTX 6000, this is bonkers
FIdelity88@reddit
Where did you get that benchmark script? If it’s yours; care to share on GitHub?
mythikal03@reddit
You bet, it's just a shell script I've been adding to over time. Enjoy: https://gist.github.com/mythikal03/57ec60665fa41b23c43fb904a25af4e0
Intelligent-Lynx-953@reddit
The memory overhead here is almost trivially small. 930MB for the 31B drafter, 78M for the E2B. For anyone already running these models with a few GB of headroom, it's basically free throughput.
The real question is whether that 2x speedup holds when you're doing partial CPU offload. Speculative decoding needs the draft and verify steps to run fast on the same device, and if your main model is split across GPU and CPU, the verify bottleneck on the slower path could eat into the gains. Would be curious to see benchmarks from a mixed offload setup.
Rikers88@reddit
This is super cool! I get that this is a super specialized model for Gemma, but isn't already there the possibility in llama cpp to put a drafter? It works good only 66% of the times if you don't fine tune it, like gemma ppl did. Am I wrong?
Thanks for sharing!!
Daemontatox@reddit
how are you people running it ? vllm says multimodal mtp is not supported yet and llamacpp still has a pending PR
MaartenGr@reddit
For those interested in how they work, I updated my visual guide with some snippets here and there: https://newsletter.maartengrootendorst.com/i/193064129/multi-token-prediction-mtp-with-gemma-4
zzzzlugg@reddit
Thanks for the nice write up. I'm curious what you are using to make the diagrams? They're nice and crisp.
DigiDecode_@reddit
it seems to be a verbatim copy from https://x.com/googlegemma/status/2051694045869879749 with no link to the original source
djdanlib@reddit
Isn't it the same author though?
DigiDecode_@reddit
the link I posted is from official GoogleGemma account on X
djdanlib@reddit
What I'm saying is, the person you're replying to looks like the actual author of the paper, and appears to work on that specific team at Google.
wren6991@reddit
One thing I find weird about the Gemma 4 family: why does the 31B not use PLE? I wouldn't mind having an extra 30 GB on disk if it meant better model performance for the same VRAM, bandwidth and compute.
log_2@reddit
How were gradients backpropagated through the clustering head?
superdariom@reddit
Really helpful
Champignac1@reddit
Great read, easy to understand even for a non native speaker !
IrisColt@reddit
Thanks!!
cleversmoke@reddit
Beautiful write up. Thank you!
portmanteaudition@reddit
Great guide
APFrisco@reddit
Such a great write up, thanks! I’ll be coming back to this one often
keepthepace@reddit
Thanks, that's a great write-up. I had never really understood the per-layer embeddings correctly.
I am pretty sure right now that some people are working hard at combining it with engrams and I can't wait to see what's happening there!
marscarsrars@reddit
Genius.
getpodapp@reddit
Great write up
JoNike@reddit
Great writing, great explanations!
hackerllama@reddit
This is the way!
Craftkorb@reddit
The E2B model had a 78M draft model - Cuuute!
No_Afternoon_4260@reddit
Can someone explain to me how MTP is different from speculative decoding?
No-Refrigerator-1672@reddit
In case of Gemma 4 it isn't, they published speculative decoding drafters. In case of Qwen 3.5 and Next - MTP is done as a sdcondary output layer that looks into internal states of the model.
No_Afternoon_4260@reddit
That is my feeling, thanks !
arbv@reddit
UwU tensor
NineThreeTilNow@reddit
I think some people think you need hundreds of millions or a billion parameters in models to do useful stuff.
Some of the heaviest lifting done by Gemma is within the vocabulary Google built. The tokenizer is extremely well trained, which is how the model ends up performing so well pound for pound against other models.
Someone at Google questioned the first principles of scaling. Parameters for the sake of parameters doesn't make sense if you have hardware to train an amazing tokenizer. It was the original Qwen 500m? model that demonstrated the strength of it. I think that model uses like 300m of those parameters for the tokenizer and only 200m for the weights of the model.
Gemma 4 is using a 262k sized tokenizer, versus Llama which was 32k in version 2 and 128k in version 3 Llama.
I think DeepSeek v4 should have used a larger tokenizer but they stuck with the 128k.
That little draft model is borrowing heavily on their tokenizer which is like ~3b parameters.
DistanceSolar1449@reddit
Meanwhile o200k
First_Ad6432@reddit
look at this tiny little safetensor, so small XD
kingo86@reddit
*squeals*
First_Ad6432@reddit
GirlNumber20@reddit
I have found my people. 🤗
Acceptable_Home_@reddit
He is so small he only needs one popcorn 🥹🥹
OuterKey@reddit
Surprisingly small draft model
Queasy-Contract9753@reddit
I need to clear space on my phone and try this out. It's a 6gb massive it'll fit
No-Upstairs-4031@reddit
Is this for real? When did Google get so generous?
cass1o@reddit
Eh, when they pioneered the entire modern LLM field.
br33213@reddit
Since deepmind did (they always were, see alphafold, weathernext, research from alphago, ...) , Google never was but Hassabis made a deal to not constantly have to struggle for funds
kvothe5688@reddit
telling google never was generous is misleading. google has always published lots of research. one of biggest contributor to linux kernel. kubernetes, angular, golang . lots of health related research . flood warnings, wildfire warning systems. Google does not usually hoard research unlike most other tech companies.
combrade@reddit
The easiest counterexample to Google is Amazon. They literally have entire cloud products like their managed versions of Airflow and Kubernetes. So many other AWS products, just using open source products and repackaging them.
Warrenio@reddit
Google pretty much started the AI boom by publishing the "Attention Is All You Need" paper.
DistanceSolar1449@reddit
To be fair, everyone inside Google and outside Google recognizes now that if they knew what they had, they wouldn't have published that publicly. They just had no clue how big of a deal it was gonna become.
BoobooSmash31337@reddit
Isn't Go literally named after Google? I think the creator at least worked there. Maybe it's a coincidence.
ninjasaid13@reddit
Didn't they delay research by months?
draconic_tongue@reddit
I don't think all of their ai stuff is through deepmind. https://huggingface.co/google t5 siglip bert, the transformers paper...
br33213@reddit
Fair point, good correction.
hackerllama@reddit
The team is cooking!
arbv@reddit
Gemmas are the most balanced models one can run locally. And probably the best ones for non-English speakers, second only to the Google's own cloud models.
Now I am hoping for a rumored Gemma 4 122B AxB (I hope it wasn't too good to be shelved - someone has to dethrone GPT-OSS 120B), and a QAT release series (like it was for Gemma 3).
quickreactor@reddit
long may it continue
dampflokfreund@reddit
I'm very grateful for what you have released, Gemma 4 is awesome. However I do hope you will keep the momentum up! Gemma 4.1 with even better quality, QAT, more reliable tool calling/agentic would be amazing.
Altruistic_Heat_9531@reddit
They always were. They are pretty much THE heavy lifter on LLM and other OSS.
Transformer, MoE, Instruct model, BERT, all from them.
DigiDecode_@reddit
maybe sponsored & paid by Apple for Apple Intelligence
andybrohol@reddit
Hassibis mentioned that they wanted to open source to help academia and that if they put the models on device, it's already exposed so why not just open source it.
GreenGreasyGreasels@reddit
That made me realize there will be no Gemma4-124B :(
First_Ad6432@reddit
They are... probably addicted to AI ;v
MoneyPowerNexis@reddit
my qwen 27b Q8 results with ~1k tokens generated / 250k context limit:
A6000 RTX
2x A6000 --split-mode tensor
Very Nice
erisian2342@reddit
That blog post was pretty great right up until the last line about the “Gemmaverse”. What lazy marketing moron comes up with this crap? Anything *verse lost all meaning and value when the Metaverse was vomited into the world.
everyoneisodd@reddit
I understand this is a great speed boost for local inference on llama.cpp. Wanted to understand if there is any benefit on inference engines like vLLM? I am under the impression from the previous speculative decoding conversations that it doesn't matter much for inference engines. Please correct me if I am wrong.
rerri@reddit (OP)
MTP definitely matters for inference engines and should work well in vLLM. A fellow redditors commented two days ago that for Qwen 3.6 27B, they are getting about 2x tg speed:
https://www.reddit.com/r/LocalLLaMA/comments/1t3guzw/comment/ojvbi9l/
Dry-Reveal4114@reddit
Has anyone tested these yet with quantized Gemma 4 models? Wondering how much of the speedup remains after quantization.
rm-rf-rm@reddit
Im really doubtful/fearful that given the pitiful state of benchmarks in terms of actually measuring intelligence, these engineering improvements that are narrowly focused on speed/latency may be causing quality regression that goes unmeasured
soldture@reddit
I'm still wondering why diffusion‑based LLMs like Mercury 2 are not widely adopted. Mercury is so fast
LetsGoBrandon4256@reddit
Sounds awesome. What's the catch though?
coder543@reddit
If you need energy efficiency more than you need speed, drafting is probably the wrong choice. It is trying to spend more compute in order to make things go faster, and a lot of the drafted tokens will be rejected as nothing more than waste energy.
You also need more memory.
Those are the only tradeoffs.
DigiDecode_@reddit
with low acceptance rate, it is only overhead & no trade-offs
coder543@reddit
If the acceptance rate for MTP is low, then the MTP is broken.
DigiDecode_@reddit
yeah, but acceptance rate depends on the context domain, i.e. coding might get high AR, whereas a foreign language that the drafter was not trained on will see low AR
coder543@reddit
MTP is trained on everything the model is trained on. That’s what I’m saying. If the MTP doesn’t know that language,
rerri@reddit (OP)
There's a small catch: Slightly higher memory requirements.
BitGreen1270@reddit
How much higher? My gemma4-26B apex model is about 21gb. How much memory will MTP take?
2Norn@reddit
not a lot, 26b assistant is 400m parameters only i believe so it cant be more than 1.5gb at most
rerri@reddit (OP)
The MTP model for Gemma 4 26B is \~800 MB, but the llama.cpp implementation will most likely require some more on top of that though. Hard to say how much.
nickm_27@reddit
That is the safetensors, I think llama.cpp uses Q8_0 for MTP?
I had Gemini read the PR and guess what the extra VRAM usage would be and this is what it gave
TheTerrasque@reddit
the qwen3.6 27b model apparently takes roughly 3gb extra at runtime
IrisColt@reddit
Is there a Qwen 3.6 27B MTP model already supported by llama.cpp?
TheTerrasque@reddit
The creator of the PR made a model, and some have grafted the mtp part onto other quant models and got it working.
BannedGoNext@reddit
You forgot that if it gets 3 speculations wrong in a row it summons Beetlejuice, but that's really a small price to pay.
IrisColt@reddit
heh, but true
Silver-Champion-4846@reddit
Like, mashed insects?
legos_on_the_brain@reddit
Beetlejuice, Beetlejuice, Beetlejuice!
dtdisapointingresult@reddit
There's none. All speculative decoding[*] is a free lunch, a guaranteed 30%-100% gen speed boost that I'm not sure why it's not the 1st thing recommended to people. This is assuming you don't get greedy and configure a higher number of predicted tokens than your hardware can crunch in 1 pass without slowing down the overall generation. Just do some experiments with the number of tokens once, and find your sweetspot. Use a real prompt, like the same coding task, or the same essay request.
[*] MTP, Eagle3, and 'seperate small draft model'
Double_Cause4609@reddit
Well, the logic of speculative decoding is that you already paid the catch.
Basically, autoregressive models (like most LLMs) which predict the next token are really wasteful. They use a ton of memory bandwidth, but not really a lot of compute.
Modern processors are generally rich in compute, but low in bandwidth.
What this means is that if you're running a single user context (self hosting a chatbot, etc), you generally are massively under-utilizing your hardware.
All multi token prediction and speculative decoding do is move you from a memory bound scenario to a compute bound one, and give you some extra token predictions along the way.
For reference, Diffusion language models are already compute bound and so do not need this process, and that emphasis on compute is how they derive their massive speedups compared to autoregressive baselines.
shroddy@reddit
When benchmarking prompt processing, is that how fast a Gpu or Cpu would be when only compute bound?
Double_Cause4609@reddit
Prompt processing is fundamentally compute bound, but it can get a bit nuanced with MoE models.
For dense models it's pretty simple. You're essentially running a batch of activations through each weight tensor. In fact, arguably, you can even load each individual layer to a GPU (or even individual tensors!), run the forward for that loaded block of weights, and then load the next layer. LCPP doesn't do this for example, but Krasis does to my memory.
For MoE models it gets a bit complicated because there's expert co-occurence coefficient that you have to account for. With low expert overlap CPUs in particular slow down a lot more than you'd expect for prompt processing (they can look bandwidth bound here), but with high co-occurence they're compute bound, just like dense models.
Healthy-Nebula-3603@reddit
What ?
You know why Mac pro computers are slow with LLM in spite of a very fast memory ( 800 GB/s ) ?
Their CPU is too slow to fast output.
LLM are needed not only fast RAM but also fast CPU ( or GPU )
For instance underclocking my rtx 3090 by 50% but not touching VRAM I'm loosing almost 30 % token generation speed.
Double_Cause4609@reddit
What do you mean?
Macs have very fast CPUs overall. I think you're mixing something up.
Macs have relatively low compute compared to GPUs or other solutions, so at long context they can be a bit slow, but in situations where you're bandwidth bound (low context LLM inference for example) Macs are roughly as fast as their bandwidth would indicate.
Now, if you're talking about real-world inference, like booting up LlamaCPP and comparing speeds, the GPU may have a speed advantage because so much work was put into the CUDA backend or something, there can be *software* overhead, but this is more down to inefficient software than a fundamental statement on the scaling properties of LLMs on hardware.
I'm very confused as to where you're getting your ideas from, and I'd have to see the numbers you're looking at because what you're describing isn't something I've seen personally.
Healthy-Nebula-3603@reddit
DID you read my post to the end?
Freonr2@reddit
Besides what others pointed out on higher mem use, speculative decoding uses compute that could be otherwise used to increase concurrency. Won't matter if you are not using concurrency ofc.
Intelligent_Ice_113@reddit
I'm sure there must be one! LLM addiction or bill for electricity.
This just cannot be true.
dampflokfreund@reddit
For speculative decoding, the draft model usually uses quite a lot of memory. I haven't had any luck on my laptop with it. Hopefully for MTP it is different.
Top_Break1374@reddit
How do I run it?
coder543@reddit
llama.cpp does not have MTP support yet, so that rules out a lot of people for now. Maybe soon.
tarruda@reddit
https://github.com/ggml-org/llama.cpp/pull/22673 looks like very soon
audioen@reddit
Built the pr, testing it on Vulkan. The Q8_0 GGUF provides around 21 tokens/s early on in the context on a Strix Halo. I'm using spec-draft-n-max = 3 and it seems like it always generates maximum length drafts because the numbers are 1:1 the same. This is a little disappointing to me -- I assumed that the draft model predicts probabilities and so the regular speculative decoding confidence could produce variable length drafts according to the speculation head's probabilities, but evidently it either works differently or this is a minor oversight that will be corrected soon.
Other limitations: only parallel=1 works, meaning no multiple streams decoding in parallel. This is hopefully going to be next item on the list to fix.
But I don't really care to complain. I'm elated. This is easily double the performance I'm used to getting, and I was already willing to wait for 27b's results because they are that good. Much less waiting now, so that's incredibly good.
Excellent work from the llama.cpp team, especially am17an. Thank you for the solid work and the biggest performance gain I've ever seen on this software.
ricesteam@reddit
Assuming I downloaded the right gguf, do I just it normally or do I need some specific flags?
IrisColt@reddit
Any answer on this?
nickm_27@reddit
-spec-type mtp --spec-draft-n-max 3IrisColt@reddit
Thanks!!!
nickm_27@reddit
-spec-type mtp --spec-draft-n-max 3coder543@reddit
This PR does not appear to implement p-min for MTP.
tarruda@reddit
Nice to know. I currently get around 16 tokens/second on 3.6 27b with a M1 ultra and hopefully this will bring me close to 30 tokens/second
Zeeplankton@reddit
idk how llamacpp maintainers don't go insane trying to support every new feature
philmarcracken@reddit
oh its not them slowly going nuts
keepthepace@reddit
They were insane to start with!
(jk, we love you!)
Top-Rub-4670@reddit
ggerganov seems like a very pragmatic leader.
Thank god for that! A lesser man would have allowed llama.cpp to devolve and we'd need probably need docker + npm + python + rust to run it and a 28-steps process to build/bundle it.
But nope, he stayed true to the mission. A powerful yet portable program. It doesn't try to be everything, it just tries to be a building block. The pillar on which the entire local inference community is built on, really.
jld1532@reddit
So in theory, once this is implemented we can use this for any model sets that have the same general architecture? Looks like he was using qwen 3.5 0.8B with the larger 3.6 models.
Public_Umpire_1099@reddit
Warning, info dump essay incoming
Yeah, basically. The key requirement is that the draft model and target model share the same tokenizer, since the draft has to produce token IDs that the target understands. Same model family is the easiest way to guarantee that — Qwen draft + Qwen target works, Gemma draft + Gemma target works, but Qwen draft + Llama target won't because the vocabularies differ.
Quick clarification on terms because I went down this rabbit hole myself recently--
Speculative decoding is the general technique — small fast "draft" model proposes N tokens, big "target" model verifies them all in one parallel pass. If the draft was right, you accept those tokens at the speed of one forward pass instead of N. Already in llama.cpp via --draft-model. Basically, the small model is writing the essay FAST, and shows the large model it's homework so they can cheat on the exam. For each part, the large model either says "yep, that's what I would've written" and keeps it, or "nope" and writes the rest itself starting from the rejection point. The large model does this, then turns it in to the teacher (you). The end result the teacher sees is that the large model turned in a pretty good exam, and did it faster than he usually would have. It was slower than what the small model did, but largely more accurate and informative because it only kept the parts that made sense.The large model balanced speed, efficiency, and accuracy.
MTP (multi-token prediction) in the strict sense is an architectural feature where the model has multiple prediction heads built in (DeepSeek V3 popularized this). Google's recent Gemma 4 announcement uses "MTP" loosely — what they actually released is small drafter models for classical speculative decoding, not built-in heads.
On the Gemma E2B/E4B side: those are dense models, not MoE. The "E" stands for Edge (Google's edge model family). E2B is ~2B params, E4B is ~4B params, all parameters active per token. These should be straightforward speculative decoding targets. It is really important that they release these, because everyone has been waiting on it. They teased it a few weeks ago when they showed benchmarks using this "MTP" method, and a lot of people found themselves a bit disappointed at the speed.
One important thing I discovered:
On Qwen3.6-35B-A3B: it's MoE — 35B total params, ~3B active per token. The router selects 8 of 256 experts per token. Speculative decoding still works on MoE, but the gain is somewhat smaller than on dense models. When the target verifies N draft tokens in parallel, those tokens may route to different experts, so the weight-load amortization that makes spec decoding fast is partial rather than complete.
For the smaller-Qwen-as-drafter idea (Qwen3.5-0.8B drafting for 3.6-35B-A3B): tokenizer compatibility is the first thing to check. If those two share vocabulary, it should work. Acceptance rate will determine actual speedup — could be anywhere from 1.2x to 2x depending on how well the small model predicts the big one's distribution. Theory says spec decoding on MoE just trims the win because parallel verification doesn't amortize as cleanly. In practice on my hardware, it was a 3× regression (10.5 tok/s vs 30 tok/s baseline) even with 100% acceptance rate using same-family Q4 target + Q2 draft. Your mileage really does depend on whether your hardware is compute-bound or bandwidth-bound.
BUT!! Here's the big caveats: using .8B as a drafter for a much better model is certainly going to give you only a very small increase ~10-20%. For drafting to work, the small model needs to get a decent amount of the information correct. .8B isn't really gonna cut it. Also, spec decoding is just a lot less efficient on MoE models. Unless Qwen releases a model specifically for drafting for their MoE model, or their 27B dense model, you might not find a huge jump. Or, you could I guess. The more I mess with this stuff the less I think I understand lol. Everything depends upon whether your setup is compute bound or bandwidth bound. Once you know which you fall under, predicting gains becomes a lot easier.
If you want to test it: --model-draft is the flag. Watch acceptance rate in the server logs. If acceptance is high but your tok/s is lower than target-alone, you've hit the same wall I did.
ParadigmComplex@reddit
My low-confidence understanding is that this isn't typical MTP with additional built-in heads, but it also isn't classic speculative decoding with stand-alone draft models, either.
From the blog post:
which almost feels like slapping additional layers onto the model. It seems like it blurs the line between traditional MTP heads and traditional draft model.
BoobooSmash31337@reddit
E is "effective" afaik. E2B is 4B and E4B is 8B.
StardockEngineer@reddit
I'm using it, too! It works!
JsThiago5@reddit
I am using and it's working.
tarruda@reddit
Using with Gemma 4?
UnWiseSageVibe@reddit
They had a merge yesterday to add it.https://github.com/ggml-org/llama.cpp/pull/22673
chindoza@reddit
This has not yet been merged as of writing.
coder543@reddit
🤔 yes...?
michaelsoft__binbows@reddit
I have the same question. these might be the precursor models that the quantizers will use to prepare us the AWQ/autoround/int4 stuff for use with vllm and GGUFs for use with llama (which is also getting MTP soon). looking forward to the coming goodies.
unique-moi@reddit
VLLM supports MTP with qwen3.6 models
michaelsoft__binbows@reddit
This is what I have been using yes. 120 tok/s or so on the 27B with a 5090. Really good perf.
praxis22@reddit
There is a configuration option in LMStudio, if you enable it it gives you a file chooser.
IShitMyselfNow@reddit
Isn't that just speculative decoding?
LMStudio uses llama cpp behind the scenes so I'm a bit confused as to how they'd support something that lcpp doesn't :D
chimph@reddit
Ok, so I think what’s happening is that there will be models that have the MTP drafter built in but these Gemma drafters are separate models that target the Gemma 4 models. Therefore it is both speculative decoding and MTP.. just separated.
2Norn@reddit
exactly
chimph@reddit
It surely also means that you can’t run from lmstudio since that uses llama.cpp and that doesn’t support this specific implementation yet?
2Norn@reddit
https://github.com/ggml-org/llama.cpp/pull/22673
yeah kinda, so it's not readily available i suppose but we'll get it soon
RickyRickC137@reddit
It says no compatible model found in LMStudio. I am using GGUFs for the original model btw.
praxis22@reddit
There are four model links above, to match the four model sizes of Gemma 4. so if you have the Gemma 4 31B model as the one you have installed, you would download the smaller model from the first link above. the video I have posted in the link above shows you how to proceed from there.
grumpydad67@reddit
Total n00b here. I tried downloading one of the Gemma4 the assistant models from within LM Studio, but they don't show up in the model picker (yet). I assume this is normal?
praxis22@reddit
You can select a base model directory, to download all models too. So it will scan that at startup. So all models you download via LMStudio go there. Any you download from huggingface go there too.
I presume, though I have never tried, that you need to select the smaller model from the gear wheel attached to the model entry for the model you downloaded, in the interface as shown in the video above.
grumpydad67@reddit
Yeah, the problem is that you still need to download one of the drafting models, and those don't show up (yet) in LM Studio. Will keep trying!
OfficeNinja42@reddit
The list of supported predictions models seems to be hard-coded in LMStudio. One cannot use custom GGUFs, but only few combinations of older models. Hopefully some of the next releases fixes this.
jld1532@reddit
It's either hard-coded or functionality based. I was able to get it to work with Qwen 2.5 but nothing newer. Apparently vision capability may disable it?
helpmefindmycat@reddit
lm studio has had some issues regarding draft model to main model. I tried it early on, and found it was pretty good but something went awry. I think vllm supports speculative draft models in a more robust manner these days.
Top_Break1374@reddit
Where? I can't find any docs or any setting in my LMStudio.
praxis22@reddit
My PC has been offline for a while, However
https://www.youtube.com/watch?v=eLdItqdMKK8
Top_Break1374@reddit
Thanks, found it myself already, but there is no GGUF of the draft models
jld1532@reddit
This is under speculative decoding, yes?
King0fFud@reddit
It'll be out for Ollama soon if you're running with MLX: https://github.com/ollama/ollama/releases/tag/v0.23.1-rc0
pmttyji@reddit
Same question here. ELI5 version please
florinandrei@reddit
You wait for the tools (llama.cpp, etc) to catch up with it, then you run it.
horribleGuy3115@reddit
Sigh ! With llama.cpp - no MTP yet, maybe fire up that vllm script.
Qxz3@reddit
So can these be used as speculative decoding models in LM Studio?
tarruda@reddit
Not before https://github.com/ggml-org/llama.cpp/pull/22673 is merged
DigiDecode_@reddit
Gemma 4 MTP support will likely require more changes
2Norn@reddit
lm studio llamacpp version is like 2 weeks behind usually
marscarsrars@reddit
This is the way.
dero_name@reddit
This is the way.
Paradigmind@reddit
Is this the way?
genpfault@reddit
Token prediction failed, aborting decode.
imp_12189@reddit
Does anyone know da way?!
Silver-Champion-4846@reddit
Alex!
Specter_Origin@reddit
Way is this the?
Silver-Champion-4846@reddit
Do you know de way?
Don_Moahskarton@reddit
This is the way.
-JustAsking4AFriend@reddit
You mean, “This is the.. (MTP invoke) way”
First_Ad6432@reddit
Mother_Context_2446@reddit
Sweet! Does anyone know how to enable it wtih vLLM?
mythikal03@reddit
I had to pull in and build from vLLM PR #41745 that was opened for merge to main today.
Ok_Warning2146@reddit
Very good. Please also give us QAT version of 31B and 26BA3B.
Adventurous-Paper566@reddit
I assume it won't be compatible with the vision?
FerLuisxd@reddit
What about vram usage? How much did it increase?
Powerful_Evening5495@reddit
remind me of intel speed up hack
boutell@reddit
All right, what am I doing wrong?
I created a Python venv and activated it
I did a fresh "pip install mlx mlx-lm"
I verified "which mlx_lm.generate" is coming from the venv:
Then I ran:
I got back:
...
... etc
Also noteworthy:
Thanks!
Eelz_@reddit
MTP support is currently in mlx-vlm, not mlx-lm. Also note, an updated version is not on PyPI so you will have to install from the main branch on GitHub.
https://github.com/Blaizzy/mlx-vlm/tree/main#gemma-4-mtp
boutell@reddit
Thank you
spac420@reddit
who's running it? what did you use?
3Xtrax@reddit
doesn't seem like there's a way to run it on LM Studio or llama.cpp atm :(
it should be a really promising speed up of 2-3x, will just have to wait and see.
WolpertingerRumo@reddit
ELI5, what’s MTP? I just can’t keep up with all the new slang every day.
ParadigmComplex@reddit
Lets say you're a super duper diligent student that loves doing homework and being ahead in class. You finish all your homework up early, then sit there bored. How could you get even more ahead? Well, if you can guess what the next homework assignment is, you can get started on it now. If you guess right, you're even more ahead! If you guess wrong, it didn't cost you anything, because you love doing homework.
Modern computers typically have a part that does math (the CPU, GPU, TPU, etc) and a part that remembers things (RAM/VRAM). What usually limits how fast an AI model can talk is the connection between these parts. The math part will finish the math very quickly then sit there for a while doing nothing but waiting for the remembering part to send it more numbers with which to do math. MTP ("Multi-Token Prediction") has the AI not only say things to the user, but also say guesses about future math the computer will have to do. The computer math part can then work on that guess when it's waiting for more information. If it's correct, the result is the AI can talk faster. If it's wrong, well, the math part wasn't doing anything productive during that time anyways.
It isn't always the right trick (e.g. competes with batching for compute headroom, better for dense rather than MoE models, requires additional memory, etc) but sometimes it can let AIs talk around twice as fast as they would otherwise on the same machine!
WolpertingerRumo@reddit
Interesting. So systems with low Bandwidth but high compute and vram will profit most?
ParadigmComplex@reddit
The more drastic the low bandwidth vs high compute is, the more someone could potentially benefit from this. I suspect there may be diminishing returns as the likelihood of acceptance of the speculative tokens will reduce the farther out the model speculates, but I haven't seen either proofs or empirical testing to confirm this hunch yet.
Proportionally, the additional VRAM isn't that much. It's less about having a lot of VRAM than just not already having been right at the limit. If you can only just barely load a given model with context, this might be what pushes you over. But if you already had a bit of extra room left over, these additional layers might squeeze in there.
arbv@reddit
Gemma 4 122B when?
llama-impersonator@reddit
yes google, please cough up this model
dtdisapointingresult@reddit
Anyone an expert on the effect of quantization on MTP layers? Is this like Vision models were you gotta make sure you run the MTP layers unquantized? My Gemma 4 31B will be the 4-bit AWQ.
Healthy-Nebula-3603@reddit
That MCP fp16 model for Granna 4 31b has 930 MB ...is small
ready_or_not_3434@reddit
Official draft models are great for latency, but loading both the base and drafter usually kills the VRAM budget on consumer cards. Definetly waiting to see some real world t/s numbers once llama.cpp supports this pipeline.
Healthy-Nebula-3603@reddit
You know that MCP model has 940 MB for Gemma 4 31b ?
annodomini@reddit
So I've been using E4B and E2B as draft models for 31B already, and it's worked pretty well. Will be interested to try this to see how it compares.
I'm wondering, though; has anyone run evals to see how draft models affect the performance of a model? Since the draft model is the one producing tokens, which the main model is merely accepting or rejecting, I wonder if it would influence the quality of results. I could see it going either way; in some ways, you might get slightly better results as the draft tokens are effectively agreed on by two different models.
But on the other hand, it might reduce variety; it might be that there's some next tokens that the main model could produce, but the draft model never would, and while the draft model produces tokens that the main model finds acceptable, it might miss some possibilities.
What evals have been done comparing performance on tasks with speculative decoding, not just raw tokens/sec?
Healthy-Nebula-3603@reddit
Quality should be exactly the same .
Current draft models are big for Gemma 4 and acceptance tokens is quite low.
That MCP model ( each version is designed for only one side model ) is much smaller. For 31b Gemma 4 has only 930 MB fp16 and acceptance tokens over 70%
Guilty_Rooster_6708@reddit
Do I still get the benefit of MTP if I already partially offload the main model to my CPU?
CombinationKitchen76@reddit
Based on what I've read it is more compute demanding (we have a lot of that) and less bandwidth demanding (we don't have a lot of that). So yeah, it seems like a win-win especially for the VRAM poor
oShievy@reddit
If this were to become a norm, I imagine strix halo and similar devices that are bandwidth bound would be much more attractive
FluoroquinolonesKill@reddit
Hoping someone can clear this up. I thought speculative decoding was only useful if you could load the entire main model into VRAM. Happy to be corrected.
earslap@reddit
No, I don't see the connection. The speculative model in classical speculative decoding is just a separate model with a lot fewer parameters. You run it instead of the main model (a lot faster) for a few tokens and run the predictions / draft by the larger model. It takes almost the same time for the larger model to check the multiple tokens of the draft model vs. larger model generating a single token. If draft is accepted, you got those tokens for almost free. And even, you get a free extra token from the large model at the end. If the speculative model acceptance rate is high, you will almost always get a speed benefit.
First_Ad6432@reddit
Never tried MTP, but i think if its designed as a draft model the output will be faster, but if its a small llm used as a draft model you will pay the price for running 2 llms (output will be slighty faster, but resource usage will go brrr)
marscarsrars@reddit
Now all we need is some one to Claude distil this version.
Version 4.6 opus only kindly thank you very much.
2Norn@reddit
those are placebo my dude
wren6991@reddit
Community distills perform worse than originals.
Also there is no such thing as an Opus 4.6 distill, since there is no public Opus 4.6 CoT data to actually distill from. Anthropic treat that material as a trade secret and it doesn't leave their datacentres. The "distills" are trained on the summarised CoT available from the public API, which has been laundered through a smaller model to remove the CoT structure. This isn't a conspiracy, it's in Anthropic's API documentation.
chimph@reddit
I'm a bit confused. So this is speculative decoding where a separate drafting model drafting (MTP) is used but its not supported by llama.cpp even though it supports speculative decoding.. 🤔
nickm_27@reddit
Instead of an entirely different model (like using GemmaE2B for Gemma31B) it uses tensors built into the primary model to do speculative decoding.
There are many types of speculative decoding:
ngram-mod(based text repetition),draft(uses separate full LLM model),EAGLE(uses a separate draft model which is created from the main model), and then MTP (Multi Token Prediction).The advantage of MTP once it is implemented in llama.cpp will be that it is all distributed in a single GGUF, not separately, and it should hopefully have lower VRAM requirements compared to the other options (besides NGRAM).
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
shokuninstudio@reddit
When the gguf comes will this it work automatically in current llama.cpp? If so do we need to add extra flags?
rerri@reddit (OP)
Current release version of llama.cpp does not yet have MTP support. It is being worked on.
Fluxing_Capacitor@reddit
Pretty sure you can still use this model for normal speculative decoding in the meantime, no?
BillDStrong@reddit
And the current work is on the Qwen 3 MTP support, so 3.5,3.6, Coder-Next.
Each model family is going to need a bring up step.
ieatdownvotes4food@reddit
you can already get mtp working great with the qwen models and vllm.
benchmark tune + --speculative-config method:qwen3_next_mtp, num_speculative_tokens:2
really good prediction acceptance
BillDStrong@reddit
Okay, but this subtopic is about llama.cpp support for it.
I can't use vLLM, my P40 is not supported, so this isn't useful for me at all. Llama.cpp is, potentially, useful, until they finally move to CUDA 7+ only.
ieatdownvotes4food@reddit
oooh my bad. I just saw mtp excitement, it'll end up llama.cpp soon I'm sure!
BillDStrong@reddit
Its in testing now, for Qwen 3.5,3.6.
No worries.
CroquetteLauncher@reddit
How does it compare with dflash speculators ? https://huggingface.co/RedHatAI/gemma-4-31B-it-speculator.dflash or https://huggingface.co/z-lab/gemma-4-31B-it-DFlash
Maximum-Fact-5832@reddit
If I understand correctly they speedup PP (prompt processing), this speeds up decode (tokens per second). Making DFlash models useful for DGX Spark and Mac; which struggle with PP, and this useful for (those) as well as GPUs (at least as far as GPUs that can run this model; I'm not familiar beyond nvidia). That's my understanding anyway.
dtdisapointingresult@reddit
Lo and behold! We are come again!
MaruluVR@reddit
What are the odds we could use the E2B draft model as a tiny STT model exclusively
cnmoro@reddit
I like the way you think
_-_David@reddit
I would get excited but I have never had Gemma 4 stop repeating outputs.
No_Swimming6548@reddit
Will this work with partial offloads too?
rz2000@reddit
The 31B model @ bf16 is my favorite model for chat among anything that I can run with using up to 170GB of memory. It’s so efficient at getting to the point, that it barely matters that it only outputs at about 10tok/second. If speculative decoding accelerates that, it will be even better.
akavel@reddit
I wonder why they worked with Ollama to support it, but not with llama.cpp?
dryadofelysium@reddit
https://github.com/google-ai-edge/LiteRT-LM 0.11 has Gemma 4 MTP support and added Windows native support today
xanduonc@reddit
Yay! Google delivers
mortenmoulder@reddit
Tbh Google is pretty damn cool for releasing this. Can't wait to try it!
Fine_Nectarine9328@reddit
Can someone tell me what this is in easy way plss, and second llamacpp officially don't support turboquant but there is an unofficial fork on GitHub something name tom how to install that or does vllm support turboquant, pls someone clear these two doubts and pls don't downvote my karma is low
SQrQveren@reddit
It's this repo: https://github.com/TheTom/llama-cpp-turboquant you install/compile it just like you would the original llama.cpp.
Chatgpt can easily explain in details on how to do it.
When you have done so, you need to find models that are turbo'ed and the fitting parameters it.
I have found 1 good example, that suits me quite fine, though I have only tested a few hours, but /u/drepublic made a really cool post you can use as base: https://www.reddit.com/r/LocalLLM/comments/1sz7ih3/qwen359b_running_on_8gb_vram_is_insane/oizsauk/
In the same thread he gives a link to the model to download; it really can't be any easier than this.
First_Ad6432@reddit
MTP: Multi Token Prediction, will make your output faster
turboquant isnt that better from what we have today i think so u can forget it
Weak-Shelter-1698@reddit
W Gemma team.
nunodonato@reddit
when gguf
Look_0ver_There@reddit
Fairly easy to create your own Q8_0 near-lossless GGUF just by following the convert_hf_to_gguf.py instructions from llama.cpp if you're impatient.
popoppypoppylovelove@reddit
I wanted to try this to see if it works as a separate draft model, but it doesn't convert because the architecture is unknown:
ERROR:hf-to-gguf:Model Gemma4AssistantForCausalLM is not supportedLook_0ver_There@reddit
Just pulled fresh and tried it myself, even upgrading to latest transformers and everything, and you are indeed correct. How weird that we can convert the full model, but not the MTP model!!
VoiceApprehensive893@reddit
now we need audio input for 26b, since its e4b but uses more ram and is 5 times better
Blues520@reddit
Is there any accuracy tradeoffs when using MTP?
Is it like quantization where you sacrifice accuracy for performance?
rerri@reddit (OP)
No, accuracy remains 100%. The main model checks every token that the MTP model generates and corrects when needed.
ThrowawayProgress99@reddit
How does this work with offloading, do both models need to be fully on GPU? What about kv cache, can that be on RAM? My current config is to override all ffn_down tensors. Also does this work with the (on RAM) mmproj for vision?
Character_Split4906@reddit
From what I understand llama.cpp have limitations on using draft model with mmproj model due to how kv cache is shared with main model. Do MTP support will help on running mmproj and draft model in parallel?
jacek2023@reddit
Looks like my love to Gemma 4 will continue
zipzak@reddit
will this work with fine-tuned variants of gemma4?
finevelyn@reddit
I love Google. I also hate Google.
msp26@reddit
I take back everything bad I ever said about google
No-Falcon-8135@reddit
Mlx quant version possible?
Intelligent_Ice_113@reddit
I believe it's possible to convert it with mlx-vlm (or mlx-lm? not sure)
Iory1998@reddit
GGUF when?
Healthy-Nebula-3603@reddit
For Gemma 4 31b MTP model has only 930 MB 😍
Healthy-Nebula-3603@reddit
Nice but not working under llamacpp yet
dai_app@reddit
Where can I find the gguf?
Potential_Block4598@reddit
Imagine Qwen3.5 9B running on 4.5GB with GPT-4 performance on an iPhone
Whoa!
ThrowawayProgress99@reddit
How does this work with offloading, do both models need to be fully on GPU? What about kv cache, can that be on RAM? My current config is to override all ffn_down tensors. Also does this work with the (on RAM) mmproj for vision?
Eyelbee@reddit
Does this come with a slight degradation in quality?
ResidentPositive4122@reddit
No, the small model proposes tokens and those proposals are verified by the big model. If they don't match with the top prediction, they get discarded and a new token is being generated the normal way.
There might be implementation bugs, or batch related inaccuracies here and there, but in theory the quality should be identical to the big model, just faster.
Deep-Vermicelli-4591@reddit
none
Intelligent_Ice_113@reddit
does LM studio support mlx draft models?
Comrade_Vodkin@reddit
Awesome!