Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code)
Posted by PerceptionGrouchy187@reddit | LocalLLaMA | View on Reddit | 99 comments
Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code)
Following up on my previous Gemma 4 31B benchmark post, I tested speculative decoding with Gemma 4 E2B (4.65B) as the draft model.
The results were much better than I expected, so I wanted to share some controlled benchmark numbers.
Setup
- GPU: RTX 5090 (32GB VRAM)
- Main model: Gemma 4 31B UD-Q4_K_XL (18.3GB)
- Draft model: Gemma 4 E2B UD-Q4_K_XL (3.0GB)
- Backend: llama.cpp fork with TurboQuant KV cache (turbo3)
- Config: 128K context, parallel=1, Flash Attention,
--draft-max 8 --draft-min 1
Benchmark Results
Same server config for both, max_tokens=500, temp=0.7, warm-up query discarded before measuring.

| Query Type | Baseline (t/s) | SpecDec (t/s) | Accept Rate | Speedup |
|---|---|---|---|---|
| Math explanation | 57.45 | 85.86 | 62.9% | +49.5% |
| Korean poetry | 56.93 | 62.34 | 44.1% | +9.5% |
| Code generation | 57.15 | 86.05 | 60.7% | +50.5% |
| Science explanation | 57.19 | 71.14 | 50.9% | +24.4% |
| Translation + analysis | 57.14 | 63.26 | 42.2% | +10.7% |
| Average | 57.17 | 73.73 | 52.2% | +29.0% |
Even at 42% acceptance rate, speculative decoding is still +10% faster because there's zero token translation overhead when the vocabs are compatible.
The GGUF Version Trap
I initially got terrible results — the draft model was slower than no draft at all (7.31 t/s vs 57 t/s baseline). Every draft model combo gave this warning:
the target and draft vocabs are not compatible - tokens will be translated between the two
After digging into speculative.cpp, I found the compatibility check compares add_bos_token between target and draft. My 31B GGUF was from early April when Gemma 4 first dropped, and it had add_bos_token = false. The E2B model (downloaded later) had add_bos_token = true. This single metadata mismatch forced llama.cpp into token translation mode, killing all performance gains.
Re-downloading the 31B GGUF (Unsloth re-quantized all Gemma 4 GGUFs recently with the fix) made the warning disappear and unlocked the full +29% speedup.
TL;DR: If you downloaded your Gemma 4 GGUF in early April 2026, re-download it. The tokenizer metadata was fixed.
Practical Tips
Add these flags to your existing llama-server command:
-md gemma-4-E2B-it-UD-Q4_K_XL.gguf
-ngld 99
--draft-max 8
--draft-min 1
--parallel 1
Things to watch out for:
--parallel 1is mandatory — with auto (=4), the draft model's KV cache is allocated 4x, eating VRAM and tanking speed to 7 t/s- No vision — speculative decoding and multimodal can't be used together
- Q4 draft is fine — Q8 (4.8GB) doesn't improve speed over Q4 (3.0GB), and Q4 leaves more VRAM headroom
- Extra VRAM \~2.3GB — total \~23.4GB with 128K context on a 32GB card (256K fits too, \~25.5GB).
Content-dependent speedup
The gains scale with how predictable the output is:
- Code / Math (structured, repetitive patterns): \~60% accept rate → +50% speed
- Explanations (semi-structured): \~50% accept rate → +24%
- Creative / Translation (less predictable): \~42% accept rate → +10%
Even the worst case is still a net positive, which is the key difference from having incompatible vocabs where even 65% acceptance rate resulted in zero gains.
CheatCodesOfLife@reddit
Have you tried using a low quant of the 26B MoE as a draft model?
PerceptionGrouchy187@reddit (OP)
Too heavy for a draft model
CheatCodesOfLife@reddit
For me it's better to just use ik_llama.cpp though with graph-split:
slot print_timing: id 0 | task 0 | prompt eval time = 96.53 ms / 17 tokens ( 5.68 ms per token, 176.12 tokens per second) eval time = 1543.67 ms / 63 tokens ( 24.50 ms per token, 40.81 tokens per second) total time = 1640.20 ms / 80 tokens
PaMRxR@reddit
Prompt eval is much slower with -sm graph though, or?
CheatCodesOfLife@reddit
Usually it's faster. But it depends on your PCIe bandwidth. Here's a bigger prompt since you don't get an accurate reading when the prompt_len < pp speed:
2 x 3090 with q5_k_m with NVLINK -sm graph:
2 x 3090 with q5_k_m with pcie4.0 x8 -sm graph:
2 x 3090 with q5_k_m with pcie4.0 x8 -sm layer:
See, pretty much always faster
CheatCodesOfLife@reddit
I just tested it (thanks for reminding me about this feature) on six 3090's.
Just a crude "Hi" prompt:
"Hi"
Baseline:
MoE As Draft:
That's Q8_0 for both models.
PerceptionGrouchy187@reddit (OP)
Cool results, thanks for testing it!
cviperr33@reddit
Thank you !!! Ive been testing it since yesterday and its amaziiiing , fits exactly at 23.5gb vram used and the speed is as fast as 26b for coding but is feels smarter and less prone to errors in agentic tool calls
drallcom3@reddit
I can't select a draft model in LM Studio. Is there a trick?
That model is 5.1gb here.
PaMRxR@reddit
Maybe it counts context as well, that's why it shows larger?
robertpro01@reddit
So dflash is already supported on llama.cpp? Is there a guide?
PerceptionGrouchy187@reddit (OP)
If you mean the draft model (speculative decoding), yeah it's built into llama.cpp.
snugglezone@reddit
They're asking about dflash which is a new form of diffusion based speculative decoding. Doesn't support Gemma yet, but it's in progress according to the GitHub issues
robertpro01@reddit
Yeah, that's exactly what I'm asking, I'm still a newbie here.
snugglezone@reddit
https://github.com/z-lab/dflash/issues/50
Not yet as far as I can tell.
sk1kn1ght@reddit
God... LlamaCpp makes me miss RSS feeds. I have no idea what's happening on the latest merges
PerceptionGrouchy187@reddit (OP)
TIL about dflash, thanks for the correction! I assumed it was a typo for "draft" earlier.
SkyFeistyLlama8@reddit
For us unified RAM laptop folks, would this make Gemma dense 31B comparable with the 26B MOE in terms of speed but with much higher intelligence?
XniX@reddit
Tested on my llama.cpp installation (RTX5090) and speed results are impressive!
Benchmark: gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL
Summary
Benchmark: gemma-4-31B-it-UD-Q4_K_XL + gemma-4-E2B-it-UD-Q4_K_XL
Summary
Benchmark: gemma-4-31B-it-UD-Q4_K_XL
Summary
---
( if u need the script it's here: https://gist.github.com/PierpaoloPernici/4f980ced0e6e8379a695016253f6cf27 )
camwasrule@reddit
What context length can you push on it? Nice one 🙋
Acceptable_Home_@reddit
Is self speculative decoding available on gemma4 26B A4B, even a gemma4 E2B model as draft model would offload to cpu for me, making it all just even slower
sk1kn1ght@reddit
Spec decoding is good for dense models. For Moe haven't seen any yet that benefit from it
camwasrule@reddit
Prompt processing eval speeds?
ComputersAndTrees@reddit
What's the full llama-server command you're using? Thanks!
PerceptionGrouchy187@reddit (OP)
Note:
turbo3KV cache types are from the TurboQuant fork. If you're on mainline llama.cpp, use--cache-type-k q8_0 --cache-type-v q8_0instead.behohippy@reddit
Using these exact settings, on a dual 5060ti 16g setup, I'm getting significantly slower pp and tg with the draft model. Compiled from master about 30 min ago (q8 k/v crashes as well now, my previous compile worked fine). With a large query (55k tokens), I'm getting 1.2 to 2.4t/s tg. The same model without draft will do 12t/s. According to logs, both models are fully sitting in vram.
BasilTrue2981@reddit
Try to pin the drafting model to only one GPU like:
StardockEngineer@reddit
You don’t need these flags. They’re defaults
--n-gpu-layers 99 \ --no-mmap \ --flash-attn on \
Or context size. It’ll autofit.
yeah-ok@reddit
Might be but at least it keeps the knowledge of what's going on fresh, defaults can change!
andres_garrido@reddit
Interesting setup. Have you tried combining speculative decoding with retrieval-heavy prompts?
In my case, once I start pulling multiple files into context, the bottleneck shifts away from token generation and more into context construction + memory pressure.
Curious if you’ve seen similar behavior when prompts get large.
alew3@reddit
After building from sources from this fork I still get the error below, is there a flag that needs to enable turboquant it in the build?
Addyad@reddit
I merged the latest llama.cpp with turboquant from with https://github.com/johndpope/llama-cpp-turboquant/tree/feature/Planarquant-kv-cache.
You can find it here: https://github.com/Addy-ad/llama-cpp-turbo-planar-iso/tree/addyad-latest
The from the feature/Planarquant-kv-cache branch, I noticed that turboquant variations works with Gemma4. But other special quants like iso(rotor), planar quants doesn't work. Because Gemma4 has this sliding window mechanism. Also, just couple of hours ago, llama.cpp added support for Gemma4 audio. The audio part works like a charm.
PerceptionGrouchy187@reddit (OP)
Thanks for sharing! Interesting that iso/planar quants don't work with Gemma4's sliding window. Re: audio — I thought that was E2B/E4B only, does 31B support it too?
Addyad@reddit
I don't think 31B supports audio.
https://huggingface.co/google/gemma-4-31B-it
according to the description here (the table) I see 31 has only text and image capabilities.
ScoreUnique@reddit
Turbo3 better than q8 in any way?
PerceptionGrouchy187@reddit (OP)
Less VRAM for KV cache with similar quality — lets you fit longer contexts. It's from the TurboQuant fork.
ComputersAndTrees@reddit
Appreciate it!
JamesEvoAI@reddit
I managed to adapt this for Strix Halo if anyone is interested, got a 2x speedup!
https://sleepingrobots.com/dreams/speculative-decoding-gemma4-strix-halo/
sampdoria_supporter@reddit
Very interesting, thank you! Had been wondering about this.
EdenistTech@reddit
Thanks for the tip. I tried it on my 5070Ti/5060Ti combo. I usually get \~25 t/s, but with the draft model loaded, it jumped to 40 t/s (128K ctx). Not too bad! I'll check if I can fit the Q5 quant I usually run.
wreckerone1@reddit
I have the same setup but only getting 32 TPS, care to share your llama server command?
Corosus@reddit
wow really, nice! i have the same cards but it was worse for me testing OPs setup, whats your os, ddr and link speed? i might have to go linux to avoid windows making pcie communication go through sys ram instead of direct pcie to pcie
PerceptionGrouchy187@reddit (OP)
This looks really valuable for multi-GPU setups going forward. The draft model fits on a single GPU so it avoids the cross-GPU communication overhead entirely.
somerussianbear@reddit
Now that’s a post that’s used AI but is absolutely no slop.
letsgoiowa@reddit
What is this and how do I use it?
tacticaltweaker@reddit
I'm having the same issue of only getting ~7t/s on Vulkan, even after redownloading the latest models. The 31B by itself gives me ~25t/s. Any ideas?
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
StephenSRMMartin@reddit
I tried Gemma4:31b + E2B draft, and it barely helped. To be clear, my draft model is fully on gpu, and my main model is split. I think that split basically handicaps it. With 10 layers (I could add more) on gpu, I get \~\~ 4 t/s; with speculative decoding it may be 5 -7 t/s.
Likewise, when I use Gemma4:26b + E2B draft, with draft fully in GPU, and using cpu-moe for 26b, I actually get half the performance. Without a draft model, it's \~ 30-36 t/s; with a draft model it drops down to 15-18 t/s.
I tried both with draft-max 2, 4, 8, 16, and 32. Nothing helped.
FoxiPanda@reddit
I tried to recreate your results and I largely did but I ran into a few weird issues.
One, when trying to run the TurboQuant version of llama.cpp I see the following errors that are causing draft-max to not work properly:
However, even at a max of 3, I still got some pretty good results:
BathroomSad6366@reddit
RTX 4090 is powerful but when it’s at 25-35% utilization it still draws a lot of watts. Is there a clean way to throttle or sleep the card when not needed?
sk1kn1ght@reddit
Are you on Linux or Windows? Haven't benched Gemma but qwen 27b dense, I had power limited it to 267watts and overclocked it. Was getting 84-86 fps on furmark with 44-46 TPS on Llama. Idle according to nvidiasmi was 15w? Or so.
If you are on Linux try a program called LACT
Lorian0x7@reddit
you should try with low quant like Q1, Q2 , and see if the speed up is the same despite the memory saved
PerceptionGrouchy187@reddit (OP)
Haven't tested lower quants for the draft model yet. My gut feeling is the acceptance rate would drop since the draft predictions get less accurate, but I don't have numbers to back that up. I just like Q4 as a sweet spot — good enough quality and still leaves plenty of VRAM headroom.
Lorian0x7@reddit
That's true for your 5090, but for a 4090, if the acceptance doesn't drop too much, Q2 or Q1 would save lots of precious memory.
PerceptionGrouchy187@reddit (OP)
Tested it. Couldn't find a Q1 quant for E2B, but IQ2_M (2.29GB) vs Q4 (3.17GB), same benchmark:
Only 3.5% less speedup while saving \~870MB. Math was actually faster with IQ2_M (93 vs 86 t/s) since the draft model runs quicker. Creative writing is basically baseline either way. Looks like a solid option for 4090 users.
sk1kn1ght@reddit
Wouldn't it also be of benefit to 5090? 1.5gb of extra cache should be able to hit 180k~
Lorian0x7@reddit
Thanks 🙏 that's what I was hoping for!
PerceptionGrouchy187@reddit (OP)
That's a fair point for 4090 users. Would be interesting to see the numbers.
OlegDoDo@reddit
Interesting results. I've been running qwen2.5:7b on CPU only (16GB RAM, no GPU) for document work — contracts, summaries, client files. Response time is 20–40 seconds but for that use case nobody's waiting on real-time replies anyway. Curious whether speculative decoding helps at all in CPU-only setups or if it's purely a GPU optimization.
PhilippeEiffel@reddit
I would expect similar gain factor based on the underlying principle.
PerceptionGrouchy187@reddit (OP)
Haven't tested CPU-only, but I'd guess the gains would be minimal since CPU has less parallelism to exploit. Worth a try though.
slippery@reddit
Thanks for running these benchmarks!
It's nice to see some independent verification and results.
Acceptable_Home_@reddit
Unfortunately for me even E2B draft model would have to sit on the cpu and it would slow down my gemma4 26B A4B (q4) even more,
getting 24tps rn with 5gb on vram (total 8) and other offloaded to cpu takes around 21gb with 48k ctx window (total 24)
PaMRxR@reddit
With multi-gpu definitely add an option like this: --device-draft CUDA0, otherwise it was pretty much same as baseline for me.
With that tg went from 23 -> 36 for me with IQ2_M (and 34 with UD-IQ2_M)
Gvara@reddit
Thanks for sharing your findings with us, very informative.
May I ask how are you running your menchmarks for the speculative decoding and how are you able to determine the accept rate? I'd like to check as I am getting the same tg speed with or without SD on my 3090.
Far-Low-4705@reddit
i just wish it didnt disable vision.
for me that completely negates its usability
Then-Topic8766@reddit
One more important thing. I have two GPUs (3090 and 4060ti, a total of 40 giga VRAM). So far I haven't had much luck with speculative decoding (small improvements). Today I did a bit of research on the llama.cpp flags for draft and decided to try '--device-draft "CUDA0"' (this puts the all draft model on a faster card). Bingo! Speed for 31b went from 18 t/s to 29 t/s for 'write song' and 40 t/s for 'write code for...'. The more code there is the faster the speed. Just fantastic!
IrisColt@reddit
Thanks!!! Speeding-up the 31B model would be amazing!
andres_garrido@reddit
Interesting results, especially the +50% on code.
Have you noticed if the speedup actually translates into better usability for multi-file tasks?
I’ve been experimenting with codebase-level queries locally, and one thing I’m seeing is that latency matters less than context quality once you start pulling multiple files. Even small delays in retrieval or planning end up dominating.
Curious if speculative decoding still holds up when prompts get large and less predictable (e.g. multi-file reasoning vs code completion).
Uncle___Marty@reddit
Great read OP. Thanks for listing your findings but did you use the 0.7 temp for the coding test as well??? I'd be interested to know the differences with it coding on 0.0.
brahh85@reddit
can you try this as draft model?
https://huggingface.co/HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive
any result would be interesting , it will show if the abliteration degrades the model or not, or if some areas get improved (probably math and coding)
anonynousasdfg@reddit
And how about the accuracy rate?
PerceptionGrouchy187@reddit (OP)
The accept rates are in the table (42-63% depending on content type). And since speculative decoding is lossless — the target model always verifies every token — output quality is identical to running without it. No accuracy tradeoff.
nilsfg@reddit
How did you measure the acceptance rate? I'm still relatively new to self-hosting and still learning about benchmarking, so not up to speed with everything
Miserable-Dare5090@reddit
what is the actual improvement you note with turboquant fork? as I understand it there is no merge to main lcpp for this
ethertype@reddit
turboquant permits longer context by compressing the KV cache with lower loss than you would otherwise see with the same number of bits per Key and Value. So you can fit more KV cache = more context.
In addition, the performance appears to stay flat(ter) as context increases. Whereas longer context without tq tend to result in less tokens/s.
There are a few outliers where turboquant'ed KV-cache result in increased performance, but this is largely not the case.
But turboquant is a bit finicky to get right, and how sensitive the model is to quantization of the KV cache depends on the model/model-architecture. Some handle compression of K badly, whereas V-compression is generally safer.
As I understand it, it is quite an undertaking to test and validate this for all the different back-ends. And as icing on the cake, ggerganov does not appear to be particularly convinced about neither code nor results so far.
And finally, there's been a lot of benchmarking in the turboquant discussion. But largely PPL and KLD. I understand that these values aren't always a good proxy for actual benchmark results. (knowledge, code, creative writing etc.)
Miserable-Dare5090@reddit
Yep, I’m not asking for what turboquant is, I mean specifically in MLX jang quants whether it really improves compression that much vs dynamic quants with full attention paths, etc.
Otherwise, AWQ performs better
ethertype@reddit
Multiple forks mentioned in the above discussion. These two stood out in my opinion: - Amesian fork - TheTom fork
PerceptionGrouchy187@reddit (OP)
For me it's worth it
albuz@reddit
You may also want to try to squeeze a bit of VRAM by offloading per-layer embeddings of a draft model with --override-tensor-draft "per_layer_token_embd\.weight=CPU". It should not affect inference speed in theory.
PerceptionGrouchy187@reddit (OP)
Good tip! In my case the E2B draft model's embeddings are already on CPU (\~1.8GB) due to the auto-fit mechanism, so there's nothing extra to offload. But useful to know for setups where VRAM is tighter.
Danmoreng@reddit
Wouldn’t it increase performance to also have the draft model fully on GPU and instead reduce context a bit?
You can use the fit and fit-ctx params for the main model instead of ngl and ctx-size and use fit-target to create space for the draft model on GPU. That’s what I did when running Qwen3.5 with mmproj on GPU: https://github.com/Danmoreng/local-qwen3-coder-env?tab=readme-ov-file#server-optimization-details
Past-Reception-424@reddit
The E2B model as a draft candidate is a smart pick actually. Similar architecture means better acceptance rates than using a random smaller model. Almost 30 percent speedup with basically no quality loss is the dream setup for local inference
cibernox@reddit
Question: how good do you reckon the 31B is with MoE and how knowledgeable is in general? I’m building a RAG system with qwen and while it’s smart, it’s not the greatest conversator. Gemma is much nicer to talk to
PerceptionGrouchy187@reddit (OP)
31B dense feels smarter than the 26B MoE in my experience, but for RAG the MoE might be enough since it mostly needs to synthesize retrieved context rather than reason from scratch. And yeah, Gemma is just nicer to talk to than Qwen — hard to quantify but you feel it.
cibernox@reddit
If you had to give it a number, how much smarter is Gemma 26b over the 8B E4E in tool calling?
juaps@reddit
Hi everyone, I know this might be a silly question, but I’m curious about how you all set up the draft model, I’m using LMstudio, and I have both models exact and LMstudio doesn’t allow me to set Gemma 4 e2b as the draft of 31b. The instructions in the documentation are unclear, and neither Claude, GPT, nor Grok seems to know this either. Can someone please provide me with a hint? Thanks!
Beginning-Window-115@reddit
unfortunately for me it seems to be slower on average compared to just running the model by itself
PerceptionGrouchy187@reddit (OP)
Are you seeing the "vocabs not compatible" warning in the server logs? I had the same issue initially — turned out my 31B GGUF had add_bos_token = false (early release bug) while E2B had true, which forced token translation mode and killed all performance. Re-downloading the latest 31B GGUF from Unsloth fixed it. Check your server logs for that warning.
Beginning-Window-115@reddit
I think since im running ud q3 for both of them it's probably messing with the acceptance rate. So ill redownload at 4bit.
ethertype@reddit
What is and how old is your main model? You may have to re-download. main and draft models must be compatible, and the initial drop of the larger models were not fully compatible (for drafting) with the smaller ones.
Beginning-Window-115@reddit
I got it after the fix
Odd-Ordinary-5922@reddit
have you tried different values on:
PerceptionGrouchy187@reddit (OP)
Thanks for the suggestion! I ran a sweep and here are the results:
draft-max 8 is the sweet spot for mixed workloads. 16 hits 99 t/s on math but regresses on creative/translation, so the average is about the same. Updated the post.
Beginning-Window-115@reddit
how come draft-min 1 when the default is 0? any reason?
PerceptionGrouchy187@reddit (OP)
Tested it. Couldn't find a Q1 quant for E2B, but IQ2_M (2.29GB) vs Q4 (3.17GB), same benchmark:
Only 3.5% less speedup while saving \~870MB. Math was actually faster with IQ2_M (93 vs 86 t/s) since the draft model runs quicker. Creative writing is basically baseline either way. Looks like a solid option for 4090 users.
PerceptionGrouchy187@reddit (OP)
Honestly no strong reason — I followed the setting from another post that recommended it. Haven't tested --draft-min 0 vs 1 specifically. The default might work just as well.
sid_276@reddit
Why can’t vision be used?
PerceptionGrouchy187@reddit (OP)
Vision is explicitly blocked by llama.cpp — when mmproj is loaded, the server refuses to initialize speculative decoding with
"speculative decoding is not supported with multimodal". So you have to pick one or the other for now.I tried patching the check out of the source but ran into deeper assertions in the token handling code, so it's not a trivial fix. Would be nice if upstream supported this though — the draft model only needs text tokens anyway.