Anyone on Strix Halo using NPU for speculative decoding (llama.cpp / Qwen3.6-27B)?
Posted by Intelligent-Form6624@reddit | LocalLLaMA | View on Reddit | 1 comments
Using a Bosgame M5: Strix Halo (128GB RAM) on Ubuntu 24.04
According to this thread, any draft model can be used: https://www.reddit.com/r/LocalLLaMA/s/CHGiWGMbvT
Anyone using the NPU (FastFlowLM) to speed up your main models? If so, how’s it going? Any tips?
spaceman_@reddit
That thread is about self-drafting and mostly works when running a thread that goes back and iterates over it's own output (not uncommon in coding agents).
Any model can be used to do this type of speculative decoding, in theory, but some architectures are untested and may contain bugs when using this feature.
This is different from speculative decoding using a draft model - which is what you're asking. A number of misconceptions you seem to have:
Both of these requirements (tokenizer and in-process drafting model) are theoretically solvable but they would also introduce a lot of latency in the token drafting, essentially killing most or all of the benefit of using a drafting model at all.