Turns out Gemma 4 had MTP (multi token prediction) all along

Posted by Electrical-Monitor27@reddit | LocalLLaMA | View on Reddit | 39 comments

Turns out Gemma 4 had MTP (multi token prediction) all along

Hey Everyone, While I was trying to utilize Gemma 4 through the LiteRT api in my android app, I noticed that Gemma 4 was throwing errors when loading it on my Google Pixel 9 test device of the "mtp weights being an incompatible tensor shape". I did some digging and found out there's additional MTP prediction heads within the LiteRT files for speculative decoding and much faster outputs.

Well turns out I got confirmation today from a Google employee that Gemma 4 DOES INDEED have MTP but it was "removed on purpose" for "ensuring compatibility and broad usability".

Well would've been great to be honest if they released the full model instead, considering we already didn't get the Gemma 124B model leaked in Jeff Dean's tweet by accident. Would've been great to have much faster Gemma 4 generation outputs, ideally on the already fast MoE. Maybe someone can reverse engineer and extract the tensors and the math based on the compute graph in LiteRT?

Here's a link to the conversation:

https://huggingface.co/google/gemma-4-E4B-it/discussions/5