Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results

Posted by LayerHot@reddit | LocalLLaMA | View on Reddit | 27 comments

Benchmarked Gemma 4 MTP and z-lab's DFlash on a single H100 80GB using vLLM and NVIDIA's SPEED-Bench qualitative dataset.

Setup:

Results:

For a real deployment, try both approaches on your own setup and workload instead of assuming one will always be better. The results can change with the model, prompts, hardware, and serving configuration. Hope these numbers give people a useful reference point.

All the benchmark setup and scripts used for benchmarking and to reproduce these results are in the Github repository.

You can read about more results and in-depth analysis in our blog: https://jarvislabs.ai/blog/gemma-4-mtp-vs-dflash-benchmark