Anyone on Strix Halo using NPU for speculative decoding (llama.cpp / Qwen3.6-27B)?

Posted by Intelligent-Form6624@reddit | LocalLLaMA | View on Reddit | 1 comments

Using a Bosgame M5: Strix Halo (128GB RAM) on Ubuntu 24.04

According to this thread, any draft model can be used: https://www.reddit.com/r/LocalLLaMA/s/CHGiWGMbvT

Anyone using the NPU (FastFlowLM) to speed up your main models? If so, how’s it going? Any tips?