Multi-Token Prediction (MTP) for LLaMA.cpp - Gemma 4 speedup by 40%

Posted by gladkos@reddit | LocalLLaMA | View on Reddit | 107 comments

Implemented Multi-Token Prediction for LLaMA.cpp. 

Quantized Gemma 4 assistant models into GGUF format. 

Ran tests on a MacBook Pro M5Max. Gemma 26B with MTP drafts tokens 40% faster. 

Prompt: Write a Python program to find the nth Fibonacci number using recursion

Outputs:

LLaMA.cpp: 97 tokens/s

LLaMA.cpp + MTP: 138 tokens/s  

Gemma4-assistant GGUF Quantized model: https://huggingface.co/collections/AtomicChat/gemma-4-assistant-gguf

Local AI models app: http://atomic.chat

Patched llama.cpp: https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant