Speculative decoding in llama.cpp for Gemma 4 31B IT / Qwen 3.5 27B?

Posted by No_Algae1753@reddit | LocalLLaMA | View on Reddit | 34 comments

Has anyone here tested speculative decoding in llama.cpp with Gemma 4 31B IT or Qwen 3.5 27B?

For Gemma, I was thinking about using a smaller same-family draft model.
For Qwen 3.5, I’m not sure if it works well at all in llama.cpp.

If you tried it, which draft model worked best and did you get a real speedup?