Anyone on Strix Halo using NPU for speculative decoding (llama.cpp / Qwen3.6-27B)?

Posted by Intelligent-Form6624@reddit | LocalLLaMA | View on Reddit | 1 comments

Using a Bosgame M5: Strix Halo (128GB RAM) on Ubuntu 24.04

According to this thread, any draft model can be used: https://www.reddit.com/r/LocalLLaMA/s/CHGiWGMbvT

Anyone using the NPU (FastFlowLM) to speed up your main models? If so, how’s it going? Any tips?

[-]

spaceman_@reddit

That thread is about self-drafting and mostly works when running a thread that goes back and iterates over it's own output (not uncommon in coding agents).

Any model can be used to do this type of speculative decoding, in theory, but some architectures are untested and may contain bugs when using this feature.

This is different from speculative decoding using a draft model - which is what you're asking. A number of misconceptions you seem to have:

You can't use "any model" as a draft model, ideally you'd use a smaller model of the same family (like Gemma E2B for Gemma 31B). Important when picking a draft model is that it is a lot faster than the final model, and that it uses the EXACT SAME TOKENIZER as the final model.
You can't use an external service or process to draft tokens in the current implementation - you need to use a llama.cpp model and backend in the same llama-server process.

Both of these requirements (tokenizer and in-process drafting model) are theoretically solvable but they would also introduce a lot of latency in the token drafting, essentially killing most or all of the benefit of using a drafting model at all.