Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code)

Posted by PerceptionGrouchy187@reddit | LocalLLaMA | View on Reddit | 99 comments

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code)

Following up on my previous Gemma 4 31B benchmark post, I tested speculative decoding with Gemma 4 E2B (4.65B) as the draft model.

The results were much better than I expected, so I wanted to share some controlled benchmark numbers.

Setup

GPU: RTX 5090 (32GB VRAM)
Main model: Gemma 4 31B UD-Q4_K_XL (18.3GB)
Draft model: Gemma 4 E2B UD-Q4_K_XL (3.0GB)
Backend: llama.cpp fork with TurboQuant KV cache (turbo3)
Config: 128K context, parallel=1, Flash Attention, --draft-max 8 --draft-min 1

Benchmark Results

Same server config for both, max_tokens=500, temp=0.7, warm-up query discarded before measuring.

Query Type	Baseline (t/s)	SpecDec (t/s)	Accept Rate	Speedup
Math explanation	57.45	85.86	62.9%	+49.5%
Korean poetry	56.93	62.34	44.1%	+9.5%
Code generation	57.15	86.05	60.7%	+50.5%
Science explanation	57.19	71.14	50.9%	+24.4%
Translation + analysis	57.14	63.26	42.2%	+10.7%
Average	57.17	73.73	52.2%	+29.0%

Even at 42% acceptance rate, speculative decoding is still +10% faster because there's zero token translation overhead when the vocabs are compatible.

The GGUF Version Trap

I initially got terrible results — the draft model was slower than no draft at all (7.31 t/s vs 57 t/s baseline). Every draft model combo gave this warning:

the target and draft vocabs are not compatible - tokens will be translated between the two

After digging into speculative.cpp, I found the compatibility check compares add_bos_token between target and draft. My 31B GGUF was from early April when Gemma 4 first dropped, and it had add_bos_token = false. The E2B model (downloaded later) had add_bos_token = true. This single metadata mismatch forced llama.cpp into token translation mode, killing all performance gains.

Re-downloading the 31B GGUF (Unsloth re-quantized all Gemma 4 GGUFs recently with the fix) made the warning disappear and unlocked the full +29% speedup.

TL;DR: If you downloaded your Gemma 4 GGUF in early April 2026, re-download it. The tokenizer metadata was fixed.

Practical Tips

Add these flags to your existing llama-server command:

-md gemma-4-E2B-it-UD-Q4_K_XL.gguf
-ngld 99
--draft-max 8
--draft-min 1
--parallel 1

Things to watch out for:

--parallel 1 is mandatory — with auto (=4), the draft model's KV cache is allocated 4x, eating VRAM and tanking speed to 7 t/s
No vision — speculative decoding and multimodal can't be used together
Q4 draft is fine — Q8 (4.8GB) doesn't improve speed over Q4 (3.0GB), and Q4 leaves more VRAM headroom
Extra VRAM \~2.3GB — total \~23.4GB with 128K context on a 32GB card (256K fits too, \~25.5GB).

Content-dependent speedup

The gains scale with how predictable the output is:

Code / Math (structured, repetitive patterns): \~60% accept rate → +50% speed
Explanations (semi-structured): \~50% accept rate → +24%
Creative / Translation (less predictable): \~42% accept rate → +10%

Even the worst case is still a net positive, which is the key difference from having incompatible vocabs where even 65% acceptance rate resulted in zero gains.

[-]

CheatCodesOfLife@reddit

Have you tried using a low quant of the 26B MoE as a draft model?

[-]

PerceptionGrouchy187@reddit (OP)

Too heavy for a draft model

[-]

CheatCodesOfLife@reddit

For me it's better to just use ik_llama.cpp though with graph-split:

slot print_timing: id 0 | task 0 | prompt eval time = 96.53 ms / 17 tokens ( 5.68 ms per token, 176.12 tokens per second) eval time = 1543.67 ms / 63 tokens ( 24.50 ms per token, 40.81 tokens per second) total time = 1640.20 ms / 80 tokens

[-]

PaMRxR@reddit

Prompt eval is much slower with -sm graph though, or?

[-]

CheatCodesOfLife@reddit

Usually it's faster. But it depends on your PCIe bandwidth. Here's a bigger prompt since you don't get an accurate reading when the prompt_len < pp speed:

2 x 3090 with q5_k_m with NVLINK -sm graph:

prompt eval time =    6126.97 ms / 11010 tokens (    0.56 ms per token,  1796.97 tokens per second)
   eval time =   16665.81 ms /   791 tokens (   21.07 ms per token,    47.46 tokens per second)
  total time =   22792.77 ms / 11801 tokens

2 x 3090 with q5_k_m with pcie4.0 x8 -sm graph:

prompt eval time =    7969.46 ms / 11010 tokens (    0.72 ms per token,  1381.52 tokens per second)
   eval time =   17110.39 ms /   666 tokens (   25.69 ms per token,    38.92 tokens per second)
  total time =   25079.85 ms / 11676 tokens

2 x 3090 with q5_k_m with pcie4.0 x8 -sm layer:

prompt eval time =   10521.12 ms / 11010 tokens (    0.96 ms per token,  1046.47 tokens per second)
   eval time =   19655.27 ms /   613 tokens (   32.06 ms per token,    31.19 tokens per second)
  total time =   30176.39 ms / 11623 tokens

See, pretty much always faster

[-]

CheatCodesOfLife@reddit

I just tested it (thanks for reminding me about this feature) on six 3090's.

Just a crude "Hi" prompt:

"Hi"

Baseline:

prompt eval time =     104.80 ms /    16 tokens (    6.55 ms per token,   152.67 tokens per second)
   eval time =    2747.77 ms /    67 tokens (   41.01 ms per token,    24.38 tokens per second)
  total time =    2852.57 ms /    83 tokens

MoE As Draft:

prompt eval time =     107.20 ms /    16 tokens (    6.70 ms per token,   149.25 tokens per second)
   eval time =    2378.23 ms /    81 tokens (   29.36 ms per token,    34.06 tokens per second)
  total time =    2485.43 ms /    97 tokens

That's Q8_0 for both models.

[-]

PerceptionGrouchy187@reddit (OP)

Cool results, thanks for testing it!

[-]

cviperr33@reddit

Thank you !!! Ive been testing it since yesterday and its amaziiiing , fits exactly at 23.5gb vram used and the speed is as fast as 26b for coding but is feels smarter and less prone to errors in agentic tool calls

[-]

drallcom3@reddit

I can't select a draft model in LM Studio. Is there a trick?

Draft model: Gemma 4 E2B UD-Q4_K_XL (3.0GB)

That model is 5.1gb here.

[-]

PaMRxR@reddit

Maybe it counts context as well, that's why it shows larger?

[-]

robertpro01@reddit

So dflash is already supported on llama.cpp? Is there a guide?

[-]

PerceptionGrouchy187@reddit (OP)

If you mean the draft model (speculative decoding), yeah it's built into llama.cpp.

[-]

snugglezone@reddit

They're asking about dflash which is a new form of diffusion based speculative decoding. Doesn't support Gemma yet, but it's in progress according to the GitHub issues

[-]

robertpro01@reddit

Yeah, that's exactly what I'm asking, I'm still a newbie here.

[-]

snugglezone@reddit

https://github.com/z-lab/dflash/issues/50

Not yet as far as I can tell.

[-]

sk1kn1ght@reddit

God... LlamaCpp makes me miss RSS feeds. I have no idea what's happening on the latest merges

[-]

PerceptionGrouchy187@reddit (OP)

TIL about dflash, thanks for the correction! I assumed it was a typo for "draft" earlier.

[-]

SkyFeistyLlama8@reddit

For us unified RAM laptop folks, would this make Gemma dense 31B comparable with the 26B MOE in terms of speed but with much higher intelligence?

[-]

XniX@reddit

Tested on my llama.cpp installation (RTX5090) and speed results are impressive!

Benchmark: gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL

Warmup run (pre-loading VRAM)... DONE
Run 1/3... OK | TTFT: 260ms | PP: 4096.9 | TG: 170.1
Run 2/3... OK | TTFT: 258ms | PP: 4148.6 | TG: 169.9
Run 3/3... OK | TTFT: 254ms | PP: 4225.5 | TG: 169.7

Summary

Provider	TTFT (ms)	PP (tok/s)	TG (tok/s)	Tok Gen
HOLODECK	257.2	4157.0	169.89	1024

Benchmark: gemma-4-31B-it-UD-Q4_K_XL + gemma-4-E2B-it-UD-Q4_K_XL

Warmup run (pre-loading VRAM)... DONE
Run 1/3... OK | TTFT: 512ms | PP: 2085.8 | TG: 142.5
Run 2/3... OK | TTFT: 581ms | PP: 1832.8 | TG: 144.3
Run 3/3... OK | TTFT: 569ms | PP: 1876.7 | TG: 143.4

Summary

Provider	TTFT (ms)	PP (tok/s)	TG (tok/s)	Tok Gen
HOLODECK	553.9	1931.8	143.39	1024

Benchmark: gemma-4-31B-it-UD-Q4_K_XL

Warmup run (pre-loading VRAM)... DONE
Run 1/3... OK | TTFT: 610ms | PP: 1749.8 | TG: 59.4
Run 2/3... OK | TTFT: 577ms | PP: 1846.7 | TG: 59.2
Run 3/3... OK | TTFT: 508ms | PP: 2106.2 | TG: 59.0

Summary

Provider	TTFT (ms)	PP (tok/s)	TG (tok/s)	Tok Gen
HOLODECK	565.0	1900.9	59.22	1024

---

( if u need the script it's here: https://gist.github.com/PierpaoloPernici/4f980ced0e6e8379a695016253f6cf27 )

[-]

camwasrule@reddit

What context length can you push on it? Nice one 🙋

[-]

Acceptable_Home_@reddit

Is self speculative decoding available on gemma4 26B A4B, even a gemma4 E2B model as draft model would offload to cpu for me, making it all just even slower

[-]

sk1kn1ght@reddit

Spec decoding is good for dense models. For Moe haven't seen any yet that benefit from it

[-]

camwasrule@reddit

Prompt processing eval speeds?

[-]

ComputersAndTrees@reddit

What's the full llama-server command you're using? Thanks!

[-]

PerceptionGrouchy187@reddit (OP)

llama-server \
  --model gemma-4-31B-it-UD-Q4_K_XL.gguf \
  -md gemma-4-E2B-it-UD-Q4_K_XL.gguf \
  -ngld 99 \
  --draft-max 8 \
  --draft-min 1 \
  --n-gpu-layers 99 \
  --no-mmap \
  --flash-attn on \
  --cache-type-k turbo3 \
  --cache-type-v turbo3 \
  --ctx-size 131072 \
  --parallel 1 \
  --threads 16 \
  --host 127.0.0.1 \
  --port 8006

Note: turbo3 KV cache types are from the TurboQuant fork. If you're on mainline llama.cpp, use --cache-type-k q8_0 --cache-type-v q8_0 instead.

[-]

behohippy@reddit

Using these exact settings, on a dual 5060ti 16g setup, I'm getting significantly slower pp and tg with the draft model. Compiled from master about 30 min ago (q8 k/v crashes as well now, my previous compile worked fine). With a large query (55k tokens), I'm getting 1.2 to 2.4t/s tg. The same model without draft will do 12t/s. According to logs, both models are fully sitting in vram.

[-]

BasilTrue2981@reddit

Try to pin the drafting model to only one GPU like:

--gpu-layers-draft all \
--device-draft CUDA1 \
--gpu-layers all \
--device CUDA0,CUDA1 \

[-]

StardockEngineer@reddit

You don’t need these flags. They’re defaults

--n-gpu-layers 99 \ --no-mmap \ --flash-attn on \

Or context size. It’ll autofit.

[-]

yeah-ok@reddit

Might be but at least it keeps the knowledge of what's going on fresh, defaults can change!

[-]

andres_garrido@reddit

Interesting setup. Have you tried combining speculative decoding with retrieval-heavy prompts?

In my case, once I start pulling multiple files into context, the bottleneck shifts away from token generation and more into context construction + memory pressure.

Curious if you’ve seen similar behavior when prompts get large.

[-]

alew3@reddit

After building from sources from this fork I still get the error below, is there a flag that needs to enable turboquant it in the build?

error while handling argument "--cache-type-k": Unsupported cache type: turbo3

[-]

Addyad@reddit

I merged the latest llama.cpp with turboquant from with https://github.com/johndpope/llama-cpp-turboquant/tree/feature/Planarquant-kv-cache.

You can find it here: https://github.com/Addy-ad/llama-cpp-turbo-planar-iso/tree/addyad-latest

The from the feature/Planarquant-kv-cache branch, I noticed that turboquant variations works with Gemma4. But other special quants like iso(rotor), planar quants doesn't work. Because Gemma4 has this sliding window mechanism. Also, just couple of hours ago, llama.cpp added support for Gemma4 audio. The audio part works like a charm.

[-]

PerceptionGrouchy187@reddit (OP)

Thanks for sharing! Interesting that iso/planar quants don't work with Gemma4's sliding window. Re: audio — I thought that was E2B/E4B only, does 31B support it too?

[-]

Addyad@reddit

I don't think 31B supports audio.

https://huggingface.co/google/gemma-4-31B-it

according to the description here (the table) I see 31 has only text and image capabilities.

[-]

ScoreUnique@reddit

Turbo3 better than q8 in any way?

[-]

PerceptionGrouchy187@reddit (OP)

Less VRAM for KV cache with similar quality — lets you fit longer contexts. It's from the TurboQuant fork.

[-]

ComputersAndTrees@reddit

Appreciate it!

[-]

JamesEvoAI@reddit

I managed to adapt this for Strix Halo if anyone is interested, got a 2x speedup!

https://sleepingrobots.com/dreams/speculative-decoding-gemma4-strix-halo/

[-]

sampdoria_supporter@reddit

Very interesting, thank you! Had been wondering about this.

[-]

EdenistTech@reddit

Thanks for the tip. I tried it on my 5070Ti/5060Ti combo. I usually get \~25 t/s, but with the draft model loaded, it jumped to 40 t/s (128K ctx). Not too bad! I'll check if I can fit the Q5 quant I usually run.

[-]

wreckerone1@reddit

I have the same setup but only getting 32 TPS, care to share your llama server command?

[-]

Corosus@reddit

wow really, nice! i have the same cards but it was worse for me testing OPs setup, whats your os, ddr and link speed? i might have to go linux to avoid windows making pcie communication go through sys ram instead of direct pcie to pcie

[-]

PerceptionGrouchy187@reddit (OP)

This looks really valuable for multi-GPU setups going forward. The draft model fits on a single GPU so it avoids the cross-GPU communication overhead entirely.

[-]

somerussianbear@reddit

Now that’s a post that’s used AI but is absolutely no slop.

[-]

letsgoiowa@reddit

What is this and how do I use it?

[-]

tacticaltweaker@reddit

I'm having the same issue of only getting ~7t/s on Vulkan, even after redownloading the latest models. The 31B by itself gives me ~25t/s. Any ideas?

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

StephenSRMMartin@reddit

I tried Gemma4:31b + E2B draft, and it barely helped. To be clear, my draft model is fully on gpu, and my main model is split. I think that split basically handicaps it. With 10 layers (I could add more) on gpu, I get \~\~ 4 t/s; with speculative decoding it may be 5 -7 t/s.

Likewise, when I use Gemma4:26b + E2B draft, with draft fully in GPU, and using cpu-moe for 26b, I actually get half the performance. Without a draft model, it's \~ 30-36 t/s; with a draft model it drops down to 15-18 t/s.

I tried both with draft-max 2, 4, 8, 16, and 32. Nothing helped.

[-]

FoxiPanda@reddit

I tried to recreate your results and I largely did but I ran into a few weird issues.

One, when trying to run the TurboQuant version of llama.cpp I see the following errors that are causing draft-max to not work properly:

draft size 8 exceeds max 3, truncating
draft size 5 exceeds max 3, truncating
draft size 4 exceeds max 1, truncating

However, even at a max of 3, I still got some pretty good results:

Query Type	My Speed (t/s)	My Accept Rate	Your Accept Rate	Your Speed
Math explanation	64.11	~55%*	62.9%	85.86
Korean poetry	58.75	~45%*	44.1%	62.34
Code generation	64.17	54.9%	60.7%	86.05
Science explanation	59.54	45.5%	50.9%	71.14
Translation + analysis	83.69	68.3%	42.2%	63.26

[-]

BathroomSad6366@reddit

RTX 4090 is powerful but when it’s at 25-35% utilization it still draws a lot of watts. Is there a clean way to throttle or sleep the card when not needed?

[-]

sk1kn1ght@reddit

Are you on Linux or Windows? Haven't benched Gemma but qwen 27b dense, I had power limited it to 267watts and overclocked it. Was getting 84-86 fps on furmark with 44-46 TPS on Llama. Idle according to nvidiasmi was 15w? Or so.

If you are on Linux try a program called LACT

[-]

Lorian0x7@reddit

you should try with low quant like Q1, Q2 , and see if the speed up is the same despite the memory saved

[-]

PerceptionGrouchy187@reddit (OP)

Haven't tested lower quants for the draft model yet. My gut feeling is the acceptance rate would drop since the draft predictions get less accurate, but I don't have numbers to back that up. I just like Q4 as a sweet spot — good enough quality and still leaves plenty of VRAM headroom.

[-]

Lorian0x7@reddit

That's true for your 5090, but for a 4090, if the acceptance doesn't drop too much, Q2 or Q1 would save lots of precious memory.

[-]

PerceptionGrouchy187@reddit (OP)

Tested it. Couldn't find a Q1 quant for E2B, but IQ2_M (2.29GB) vs Q4 (3.17GB), same benchmark:

Draft Quant	Size	Math	Poetry	Code	Science	Translation	Avg

baseline	—	57.45	56.93	57.15	57.19	57.14	57.17
IQ2_M	2.29GB	93.43	57.57	76.02	66.52	65.25	71.76 (+25.5%)
Q4	3.17GB	85.86	62.34	86.05	71.14	63.26	73.73 (+29.0%)

Only 3.5% less speedup while saving \~870MB. Math was actually faster with IQ2_M (93 vs 86 t/s) since the draft model runs quicker. Creative writing is basically baseline either way. Looks like a solid option for 4090 users.

[-]

sk1kn1ght@reddit

Wouldn't it also be of benefit to 5090? 1.5gb of extra cache should be able to hit 180k~

[-]

Lorian0x7@reddit

Thanks 🙏 that's what I was hoping for!

[-]

PerceptionGrouchy187@reddit (OP)

That's a fair point for 4090 users. Would be interesting to see the numbers.

[-]

OlegDoDo@reddit

Interesting results. I've been running qwen2.5:7b on CPU only (16GB RAM, no GPU) for document work — contracts, summaries, client files. Response time is 20–40 seconds but for that use case nobody's waiting on real-time replies anyway. Curious whether speculative decoding helps at all in CPU-only setups or if it's purely a GPU optimization.

[-]

PhilippeEiffel@reddit

I would expect similar gain factor based on the underlying principle.

[-]

PerceptionGrouchy187@reddit (OP)

Haven't tested CPU-only, but I'd guess the gains would be minimal since CPU has less parallelism to exploit. Worth a try though.

[-]

slippery@reddit

Thanks for running these benchmarks!

It's nice to see some independent verification and results.

[-]

Acceptable_Home_@reddit

Unfortunately for me even E2B draft model would have to sit on the cpu and it would slow down my gemma4 26B A4B (q4) even more,

getting 24tps rn with 5gb on vram (total 8) and other offloaded to cpu takes around 21gb with 48k ctx window (total 24)

[-]

PaMRxR@reddit

With multi-gpu definitely add an option like this: --device-draft CUDA0, otherwise it was pretty much same as baseline for me.

With that tg went from 23 -> 36 for me with IQ2_M (and 34 with UD-IQ2_M)

[-]

Gvara@reddit

Thanks for sharing your findings with us, very informative.

May I ask how are you running your menchmarks for the speculative decoding and how are you able to determine the accept rate? I'd like to check as I am getting the same tg speed with or without SD on my 3090.

[-]

Far-Low-4705@reddit

i just wish it didnt disable vision.

for me that completely negates its usability

[-]

Then-Topic8766@reddit

One more important thing. I have two GPUs (3090 and 4060ti, a total of 40 giga VRAM). So far I haven't had much luck with speculative decoding (small improvements). Today I did a bit of research on the llama.cpp flags for draft and decided to try '--device-draft "CUDA0"' (this puts the all draft model on a faster card). Bingo! Speed for 31b went from 18 t/s to 29 t/s for 'write song' and 40 t/s for 'write code for...'. The more code there is the faster the speed. Just fantastic!

[-]

IrisColt@reddit

Thanks!!! Speeding-up the 31B model would be amazing!

[-]

andres_garrido@reddit

Interesting results, especially the +50% on code.

Have you noticed if the speedup actually translates into better usability for multi-file tasks?

I’ve been experimenting with codebase-level queries locally, and one thing I’m seeing is that latency matters less than context quality once you start pulling multiple files. Even small delays in retrieval or planning end up dominating.

Curious if speculative decoding still holds up when prompts get large and less predictable (e.g. multi-file reasoning vs code completion).

[-]

Uncle___Marty@reddit

Great read OP. Thanks for listing your findings but did you use the 0.7 temp for the coding test as well??? I'd be interested to know the differences with it coding on 0.0.

[-]

brahh85@reddit

can you try this as draft model?
https://huggingface.co/HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive

any result would be interesting , it will show if the abliteration degrades the model or not, or if some areas get improved (probably math and coding)

[-]

anonynousasdfg@reddit

And how about the accuracy rate?

[-]

PerceptionGrouchy187@reddit (OP)

The accept rates are in the table (42-63% depending on content type). And since speculative decoding is lossless — the target model always verifies every token — output quality is identical to running without it. No accuracy tradeoff.

[-]

nilsfg@reddit

How did you measure the acceptance rate? I'm still relatively new to self-hosting and still learning about benchmarking, so not up to speed with everything

[-]

Miserable-Dare5090@reddit

what is the actual improvement you note with turboquant fork? as I understand it there is no merge to main lcpp for this

[-]

ethertype@reddit

turboquant permits longer context by compressing the KV cache with lower loss than you would otherwise see with the same number of bits per Key and Value. So you can fit more KV cache = more context.

In addition, the performance appears to stay flat(ter) as context increases. Whereas longer context without tq tend to result in less tokens/s.

There are a few outliers where turboquant'ed KV-cache result in increased performance, but this is largely not the case.

But turboquant is a bit finicky to get right, and how sensitive the model is to quantization of the KV cache depends on the model/model-architecture. Some handle compression of K badly, whereas V-compression is generally safer.

As I understand it, it is quite an undertaking to test and validate this for all the different back-ends. And as icing on the cake, ggerganov does not appear to be particularly convinced about neither code nor results so far.

And finally, there's been a lot of benchmarking in the turboquant discussion. But largely PPL and KLD. I understand that these values aren't always a good proxy for actual benchmark results. (knowledge, code, creative writing etc.)

[-]

Miserable-Dare5090@reddit

Yep, I’m not asking for what turboquant is, I mean specifically in MLX jang quants whether it really improves compression that much vs dynamic quants with full attention paths, etc.

Otherwise, AWQ performs better

[-]

ethertype@reddit

Really long discussion Lots of interesting findings. (And some people with mental challenges 'participating' in useless ways. )

Multiple forks mentioned in the above discussion. These two stood out in my opinion: - Amesian fork - TheTom fork

[-]

PerceptionGrouchy187@reddit (OP)

For me it's worth it

[-]

albuz@reddit

You may also want to try to squeeze a bit of VRAM by offloading per-layer embeddings of a draft model with --override-tensor-draft "per_layer_token_embd\.weight=CPU". It should not affect inference speed in theory.

[-]

PerceptionGrouchy187@reddit (OP)

Good tip! In my case the E2B draft model's embeddings are already on CPU (\~1.8GB) due to the auto-fit mechanism, so there's nothing extra to offload. But useful to know for setups where VRAM is tighter.

[-]

Danmoreng@reddit

Wouldn’t it increase performance to also have the draft model fully on GPU and instead reduce context a bit?

You can use the fit and fit-ctx params for the main model instead of ngl and ctx-size and use fit-target to create space for the draft model on GPU. That’s what I did when running Qwen3.5 with mmproj on GPU: https://github.com/Danmoreng/local-qwen3-coder-env?tab=readme-ov-file#server-optimization-details

[-]

Past-Reception-424@reddit

The E2B model as a draft candidate is a smart pick actually. Similar architecture means better acceptance rates than using a random smaller model. Almost 30 percent speedup with basically no quality loss is the dream setup for local inference

[-]

cibernox@reddit

Question: how good do you reckon the 31B is with MoE and how knowledgeable is in general? I’m building a RAG system with qwen and while it’s smart, it’s not the greatest conversator. Gemma is much nicer to talk to

[-]

PerceptionGrouchy187@reddit (OP)

31B dense feels smarter than the 26B MoE in my experience, but for RAG the MoE might be enough since it mostly needs to synthesize retrieved context rather than reason from scratch. And yeah, Gemma is just nicer to talk to than Qwen — hard to quantify but you feel it.

[-]

cibernox@reddit

If you had to give it a number, how much smarter is Gemma 26b over the 8B E4E in tool calling?

[-]

juaps@reddit

Hi everyone, I know this might be a silly question, but I’m curious about how you all set up the draft model, I’m using LMstudio, and I have both models exact and LMstudio doesn’t allow me to set Gemma 4 e2b as the draft of 31b. The instructions in the documentation are unclear, and neither Claude, GPT, nor Grok seems to know this either. Can someone please provide me with a hint? Thanks!

[-]

Beginning-Window-115@reddit

unfortunately for me it seems to be slower on average compared to just running the model by itself

[-]

PerceptionGrouchy187@reddit (OP)

Are you seeing the "vocabs not compatible" warning in the server logs? I had the same issue initially — turned out my 31B GGUF had add_bos_token = false (early release bug) while E2B had true, which forced token translation mode and killed all performance. Re-downloading the latest 31B GGUF from Unsloth fixed it. Check your server logs for that warning.

[-]

Beginning-Window-115@reddit

I think since im running ud q3 for both of them it's probably messing with the acceptance rate. So ill redownload at 4bit.

[-]

ethertype@reddit

What is and how old is your main model? You may have to re-download. main and draft models must be compatible, and the initial drop of the larger models were not fully compatible (for drafting) with the smaller ones.

[-]

Beginning-Window-115@reddit

I got it after the fix

[-]

Odd-Ordinary-5922@reddit

have you tried different values on:

--draft-max
--draft-min

[-]

PerceptionGrouchy187@reddit (OP)

Thanks for the suggestion! I ran a sweep and here are the results:

draft-max	Math	Poetry	Code	Science	Translation	Avg
baseline	57.45	56.93	57.15	57.19	57.14	57.17
2	73.43	60.49	68.69	62.46	62.42	65.50
4	83.31	60.88	73.12	65.29	67.98	70.12
8	85.86	62.34	86.05	71.14	63.26	73.73
16	99.35	62.58	78.74	68.39	58.31	73.47

draft-max 8 is the sweet spot for mixed workloads. 16 hits 99 t/s on math but regresses on creative/translation, so the average is about the same. Updated the post.

[-]

Beginning-Window-115@reddit

how come draft-min 1 when the default is 0? any reason?

[-]

PerceptionGrouchy187@reddit (OP)

Tested it. Couldn't find a Q1 quant for E2B, but IQ2_M (2.29GB) vs Q4 (3.17GB), same benchmark:

Draft Quant	Size	Math	Poetry	Code	Science	Translation	Avg
baseline	—	57.45	56.93	57.15	57.19	57.14	57.17
IQ2_M	2.29GB	93.43	57.57	76.02	66.52	65.25	71.76 (+25.5%)
Q4	3.17GB	85.86	62.34	86.05	71.14	63.26	73.73 (+29.0%)

[-]

PerceptionGrouchy187@reddit (OP)

Honestly no strong reason — I followed the setting from another post that recommended it. Haven't tested --draft-min 0 vs 1 specifically. The default might work just as well.

[-]

sid_276@reddit

Why can’t vision be used?

[-]

PerceptionGrouchy187@reddit (OP)

Vision is explicitly blocked by llama.cpp — when mmproj is loaded, the server refuses to initialize speculative decoding with "speculative decoding is not supported with multimodal". So you have to pick one or the other for now.

I tried patching the check out of the source but ran into deeper assertions in the token handling code, so it's not a trivial fix. Would be nice if upstream supported this though — the draft model only needs text tokens anyway.