PSA: Ubuntu 26.04 makes it easier to get started with AMD XDNA2 NPU
Posted by jfowers_amd@reddit | LocalLLaMA | View on Reddit | 12 comments
Posted by jfowers_amd@reddit | LocalLLaMA | View on Reddit | 12 comments
DevelopmentBorn3978@reddit
on a side note, I've just discovered that on strix halo (using linux) the npu power mode could be set from "performance" (or "default" ) to "turbo" through the command xrt-smi configure -d 0000:c6:00.1 --pmode turbo, (where "0000:C6:00.1" is the bdf reported by the command xrt-smi examine). Still to be tested for quantifying effective performances gains tho
DevelopmentBorn3978@reddit
on a side note to the side note, executing the same above prompt with llama-cli compiled for vulkan using the same model (ud-q8_k_xl) gives back these statistics: [ Prompt: 370.7 t/s | Generation: 43.1 t/s ] even if it takes longer to answer
Glittering-Call8746@reddit
How useful is a 2b model ? Use case pls ?
RobotRobotWhatDoUSee@reddit
Very interested, but don't know much about NPU performance. On something like a strix halo machine, should I think of this as a way to run another small fast model in parallel with a bigger slower model on the igpu?
Or what should I think of as NPU use cases?
ImportancePitiful795@reddit
You are in for a treat what can do with the NPU.
https://www.youtube.com/@FastFlowLM-YT
For a 1.5W little tiny chip (as small as little finger nail) inside the I/O die, is extremely powerful.
A full "chiplet" size version (no bigger than the 8 core chiplet CCD of the AMD 395), could be running way larger models with massive efficiency at high speeds.
RobotRobotWhatDoUSee@reddit
Ah, interesting, thanks for the NPU experiments channel!
ImportancePitiful795@reddit
There are so many things for Strix Halo currently on the works. There is even a Beta for full MLX support.
DevelopmentBorn3978@reddit
Being quite energy efficient and quite fast especially regarding the time to first token of responses, one use case I've envisioned so far for npu is almost-realtime-voice-transcription without boggling the cpu/gpu hw. My prototyped experiments into this field so far beared to early nice results mostly by accessing the fastfloflm server (part of the lemonade framework yet independently usable) serving whisper models. Hopefully more will come if/when the iron compiler will be incorporated not just into windows only copilot+ stuff but instead also into projects like llama.cpp (or maybe into vllm) and also when it will be more feasible to indipendently train/quantize/fine tune more recent models for running into npu and in hybrid mode (that should be sort of speculative decoding, where prompt/rag is first computed by the fast but less capable npu and then passed to the more powerful but slower gpu) than the one shared by amd through onnxmodelzoo and similar repositories. Other exciting use cases could be related fast visual recognition
RobotRobotWhatDoUSee@reddit
Oh very interesting use case, thank you!
DevelopmentBorn3978@reddit
on a side note, I've just discovered that on strix halo, the npu power mode could be set from performance (default) to turbo mode through the command
xrt-smi configure -d 0000:c6:00.1 --pmode turbo. Still to be tested thoDevelopmentBorn3978@reddit
Thanks a lot for the much needed linux advancements for npu! Q: would it be as easy as on ubuntu to install/upgrade it on arch (and derivatives) distros or on any other shades of penguins?
jfowers_amd@reddit (OP)
Cheers! I’m not an expert on Arch, but there are maintainers on the Lemonade Discord who would be happy to help you.