Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code)

Posted by PerceptionGrouchy187@reddit | LocalLLaMA | View on Reddit | 99 comments

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code)

Following up on my previous Gemma 4 31B benchmark post, I tested speculative decoding with Gemma 4 E2B (4.65B) as the draft model.

The results were much better than I expected, so I wanted to share some controlled benchmark numbers.

Setup

Benchmark Results

Same server config for both, max_tokens=500, temp=0.7, warm-up query discarded before measuring.

Query Type Baseline (t/s) SpecDec (t/s) Accept Rate Speedup
Math explanation 57.45 85.86 62.9% +49.5%
Korean poetry 56.93 62.34 44.1% +9.5%
Code generation 57.15 86.05 60.7% +50.5%
Science explanation 57.19 71.14 50.9% +24.4%
Translation + analysis 57.14 63.26 42.2% +10.7%
Average 57.17 73.73 52.2% +29.0%

Even at 42% acceptance rate, speculative decoding is still +10% faster because there's zero token translation overhead when the vocabs are compatible.

The GGUF Version Trap

I initially got terrible results — the draft model was slower than no draft at all (7.31 t/s vs 57 t/s baseline). Every draft model combo gave this warning:

the target and draft vocabs are not compatible - tokens will be translated between the two

After digging into speculative.cpp, I found the compatibility check compares add_bos_token between target and draft. My 31B GGUF was from early April when Gemma 4 first dropped, and it had add_bos_token = false. The E2B model (downloaded later) had add_bos_token = true. This single metadata mismatch forced llama.cpp into token translation mode, killing all performance gains.

Re-downloading the 31B GGUF (Unsloth re-quantized all Gemma 4 GGUFs recently with the fix) made the warning disappear and unlocked the full +29% speedup.

TL;DR: If you downloaded your Gemma 4 GGUF in early April 2026, re-download it. The tokenizer metadata was fixed.

Practical Tips

Add these flags to your existing llama-server command:

-md gemma-4-E2B-it-UD-Q4_K_XL.gguf
-ngld 99
--draft-max 8
--draft-min 1
--parallel 1

Things to watch out for:

Content-dependent speedup

The gains scale with how predictable the output is:

Even the worst case is still a net positive, which is the key difference from having incompatible vocabs where even 65% acceptance rate resulted in zero gains.