Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results

Posted by LayerHot@reddit | LocalLLaMA | View on Reddit | 27 comments

Benchmarked Gemma 4 MTP and z-lab's DFlash on a single H100 80GB using vLLM and NVIDIA's SPEED-Bench qualitative dataset.

Setup:

Hardware: 1x H100 80GB
Runtime: vLLM
Dataset: SPEED-Bench qualitative
Prompts: 880 total, 80 prompts across each of 11 categories
Models: google/gemma-4-31B-it and google/gemma-4-26B-A4B-it
MTP drafts: Google's matching Gemma 4 assistant models
DFlash drafts: z-lab's matching Gemma 4 DFlash models
MTP used num_speculative_tokens=8
DFlash used num_speculative_tokens=15
Context length / max model length: 32768
Temperature: 0
Prefix caching was disabled

Results:

For Gemma 4 31B dense, MTP was 3.11x faster and DFlash was 3.03x faster than baseline decoding at concurrency 1. Baseline hit 40.3 output tok/s, MTP hit 125.3 output tok/s, and DFlash hit 122.1 output tok/s. At concurrency 16, baseline reached 375 tok/s, MTP reached 953 tok/s, and DFlash reached 725 tok/s.

For Gemma 4 26B-A4B MoE, the result flipped. DFlash was 1.73x faster and MTP was 1.49x faster than baseline decoding at concurrency 1. Baseline hit 177.1 output tok/s, MTP hit 264.2 output tok/s, and DFlash hit 306.4 output tok/s. At concurrency 16, baseline reached 975 tok/s, MTP reached 1808 tok/s, and DFlash reached 1957 tok/s.
The MoE speedups were smaller than the dense-model speedups because the baseline MoE target is already relatively cheap to run. Gemma 4 26B-A4B has 25.2B total parameters, but only 3.8B active parameters during inference. That means speculative decoding has less target-model compute to remove compared with the dense 31B model.

The gains were not uniform across workloads. Coding, math, STEM, and reasoning benefited more because these tasks often have more predictable token patterns. Writing, summarization, and roleplay improved less because there are many valid ways for the model to continue the text.
Higher per-position acceptance did not automatically mean higher throughput. MTP accepted more draft tokens, but DFlash showed better throughput on the MoE model. Acceptance is only one side of it. DFlash drafts the whole block in a single forward pass, while MTP drafts token by token. When the target is this fast, the cheaper draft path can matter more even with lower acceptance.
Most accepted draft tokens came from the first few positions. Position-1 acceptance was around 80% for MTP and 75% for DFlash, but by position 8 it dropped to under 20% for both.

For a real deployment, try both approaches on your own setup and workload instead of assuming one will always be better. The results can change with the model, prompts, hardware, and serving configuration. Hope these numbers give people a useful reference point.

All the benchmark setup and scripts used for benchmarking and to reproduce these results are in the Github repository.

You can read about more results and in-depth analysis in our blog: https://jarvislabs.ai/blog/gemma-4-mtp-vs-dflash-benchmark

SpecDecoding metrics: Mean acceptance length: 2.95, Accepted throughput: 26.10 tokens/s, Drafted throughput: 107.20 tokens/s, Accepted: 261 tokens, Drafted: 1072 tokens, Per-position acceptance rate: 0.731, 0.433, 0.291, 0.187, 0.104, 0.082, 0.060, 0.060, Avg Draft acceptance rate: 24.3%

[-]

-elmuz-@reddit

How can DFlash produce a general throughput that is be on-par or faster than the one with MTP given the fact that DFlash has a significantly lower acceptance rate? Or did I misunderstand the numbers here?

[-]

coder543@reddit

DFlash uses diffusion to generate a larger batch of tokens. MTP is autoregressive, one token at a time, and it isn't worth generating more than 2 or 3 tokens with MTP. For the same cost, you can generate more tokens in DFlash, and when those additional tokens are right, they are helpful for performance.

[-]

Qwoctopussy@reddit

even if generating 15 tokens with dflash is not more expensive than generating less, do you still pay for it in the main model’s PP time? i.e., could you see performance gains from truncating the drafted tokens even if it doesn’t make the draft faster?

that seemed to be the case when i was messing with ngram spec decode settings — if i set it to draft too many tokens, it would sit there for a while before rejecting the draft. there’s a sweet spot in draft length that maximizes speed—not too short, not too long—even when generating the draft is basically free

[-]

coder543@reddit

It depends on how much compute you have relative to how much bandwidth you have. CPUs suck at drafting, for example.

[-]

-elmuz-@reddit

Ok, I got the point thank you both

[-]

Kitchen-Year-8434@reddit

Lower acceptance but more tokens generated faster. If you can gen 15 dflash in the time you gen 5 mtp for instance, even if acceptance is lower it can come out on top in aggregate.

bettertoknow@reddit

I can not reproduce this at all on a single H100. I just tried with 31b and MTP, using 0.20.2rc1.dev49+g9b4e83934.

There is no hard log data in the results you've shared, leading me to suspect the LLM that orchestrated the tests and crunched the numbers picked up the draft token generation average output speed vs the actual average token output speed. Especially with using 8 num_speculative_tokens for MTP-- the acceptance rate is only >50% on the first 3-4 token positions, after that its near 0% and just overhead/waste of compute. But it would show >100/ts draft token generation speed at concurrency 1. eg:

Can you please spot-check some of your results?

LayerHot@reddit (OP)

Btw the published tok/s is the `output_throughput` field from the JSON (completed tokens / wall time)

All 440 raw JSONs are here if you want to spot-check:

https://huggingface.co/datasets/Gladiator/gemma4-mtp-dflash-speed-bench-results

The "125 tok/s @ c=1" for 31B MTP is the mean across 11 SPEED-Bench categories — per-category it ranges from 76 (roleplay, acc_len 2.62) to 173 (coding, acc_len 5.90). Your acc_len 2.95 sits right in our roleplay/QA range, so prompt mix probably explains a lot of the gap.

Thank you for the raw data, this clears it up-- I was leaning too much on the misleading 'max_output_tokens_per_s' figure, where without MTP enabled its e.g. 42.0 and with MTP enabled its 32.0 (gemma4_31b_baseline_summarization_c1.json / gemma4_31b_mtp_nt8_summarization_c1.json)

..and the better/correct metric to follow (as you've mentioned) is output_throughput, which does show the ~2-3x increase.

leonbollerup@reddit

i love the focus on performance here..

but what about the quality.. is anyone actually testing quality of the output ??

IrisColt@reddit

In my use case, it tears through complex probabilistic problems and tricky simplex integrals every bit as accurately as the original Gemma 4.

i have no doubt its good.. but it should be verified - i am building a test platform for this - but i am nowhere ready.. it seems to be that we need verified data on the actually quality and not just focus on performance..

Thanks for your effort, systematic verification is always welcomed.

Baseline hit 40.3 output tok/s

How is your baseline only 25% faster than my baseline when I am using an RTX 3090 and you an H100? Genuinely asking.

danish334@reddit

Nice. I too noticed the acceptance rate of dflash wasn't as good as mtp but zlab do mention lossless inference. You should benchmark their claim.

fiery_prometheus@reddit

The draft tokens are verified by the larger model, incorrect predictions should be rejected, the process is inherently "lossless"? You just waste compute in case of rejections afaik

U r right.

Organic_Scarcity_495@reddit

solid benchmarks, thanks for sharing. the position-1 vs position-8 acceptance dropoff is the key detail — most of the speedup comes from the first 2-3 draft tokens. that lines up with what we've seen too. the practical takeaway is that for short generations (like classification or routing calls) speculative decoding barely helps. it shines on long-form where the draft has room to accumulate savings across many positions.

AdventurousFly4909@reddit

DFlash is better for small activated moe models. Like qwen 35B. Because for the price of 3 tokens with the normal MTP you can generate 15 with dflash(idk I am just guessing but it should be around that figure). So token 4 5 6 and etc hit you will obviously have better throughput.

FullOf_Bad_Ideas@reddit

That's great. It looks like there's nothing that really meaningfully speeds up tasks that don't have boilerplate and are actually information dense, like roleplay. You can't skip a big model when you need brains.

FBIFreezeNow@reddit

Slower than expected on a H100. Weird..

MadPelmewka@reddit

Heard about DDtree? Can it be tested with Gemma 4 right now?

DunderSunder@reddit

What was the prompt length and does it matter?

MrLlamaGnome@reddit

Nice writeup! As a GTX 1050 3GB potato enjoyer, I wonder how this comparison would change on more constrained hardware... Is one method more compute or I/O dependent in practice than the other?

Thanks! I only tested on an H100, so I would not extrapolate too hard to truly constrained cards. The main constraint is VRAM first. Both approaches need the target model plus draft model plus KV cache. My guess is on smaller GPUs, the gains can shrink or disappear because the draft model overhead starts competing with the target model.

EveningIncrease7579@reddit

What about vram peak usage, has difference?

I set `--gpu-memory-utilization 0.95` in vLLM for all runs, so I didn't measured or profiled the exact peak usage as it occupied 95% of vram anyways.