PSA: Ubuntu 26.04 makes it easier to get started with AMD XDNA2 NPU

Posted by jfowers_amd@reddit | LocalLLaMA | View on Reddit | 12 comments

https://lemonade-server.ai/flm_npu_linux.html

[-]

DevelopmentBorn3978@reddit

on a side note, I've just discovered that on strix halo (using linux) the npu power mode could be set from "performance" (or "default" ) to "turbo" through the command xrt-smi configure -d 0000:c6:00.1 --pmode turbo, (where "0000:C6:00.1" is the bdf reported by the command xrt-smi examine). Still to be tested for quantifying effective performances gains tho

[-]

DevelopmentBorn3978@reddit

on a side note to the side note, executing the same above prompt with llama-cli compiled for vulkan using the same model (ud-q8_k_xl) gives back these statistics: [ Prompt: 370.7 t/s | Generation: 43.1 t/s ] even if it takes longer to answer

[-]

Glittering-Call8746@reddit

How useful is a 2b model ? Use case pls ?

[-]

RobotRobotWhatDoUSee@reddit

Very interested, but don't know much about NPU performance. On something like a strix halo machine, should I think of this as a way to run another small fast model in parallel with a bigger slower model on the igpu?

Or what should I think of as NPU use cases?

[-]

ImportancePitiful795@reddit

You are in for a treat what can do with the NPU.

https://www.youtube.com/@FastFlowLM-YT

For a 1.5W little tiny chip (as small as little finger nail) inside the I/O die, is extremely powerful.

A full "chiplet" size version (no bigger than the 8 core chiplet CCD of the AMD 395), could be running way larger models with massive efficiency at high speeds.

[-]

RobotRobotWhatDoUSee@reddit

Ah, interesting, thanks for the NPU experiments channel!

[-]

ImportancePitiful795@reddit

There are so many things for Strix Halo currently on the works. There is even a Beta for full MLX support.

[-]

DevelopmentBorn3978@reddit

Being quite energy efficient and quite fast especially regarding the time to first token of responses, one use case I've envisioned so far for npu is almost-realtime-voice-transcription without boggling the cpu/gpu hw. My prototyped experiments into this field so far beared to early nice results mostly by accessing the fastfloflm server (part of the lemonade framework yet independently usable) serving whisper models. Hopefully more will come if/when the iron compiler will be incorporated not just into windows only copilot+ stuff but instead also into projects like llama.cpp (or maybe into vllm) and also when it will be more feasible to indipendently train/quantize/fine tune more recent models for running into npu and in hybrid mode (that should be sort of speculative decoding, where prompt/rag is first computed by the fast but less capable npu and then passed to the more powerful but slower gpu) than the one shared by amd through onnxmodelzoo and similar repositories. Other exciting use cases could be related fast visual recognition

[-]

jfowers_amd@reddit (OP)

Cheers! I’m not an expert on Arch, but there are maintainers on the Lemonade Discord who would be happy to help you.