Performance of AMD NPU, such as Ryzen 7 8845HS in some mini PCs, for local LLM inference?
Posted by hedgehog0@reddit | LocalLLaMA | View on Reddit | 16 comments
Dear all,
Recently I bought a Beelink SER5 and noticed thay SER8 has Ryzen 7 8845HS as CPU, GPU, and even NPU. The specific AI performance data from the website is below: https://www.amd.com/en/products/processors/laptop/ryzen/8000-series/amd-ryzen-7-8845hs.html
AI Engine Capabilities
Brand Name AMD Ryzen™ AI
Performance Up to 16 TOPS
Total Processor Performance Up to 38 TOPS
NPU Performance Up to 16 TOPS
So I was wondering that how good or bad would be for local LLM inference, or even fine-tuning?
Thanks a lot!
Pajonico@reddit
If I understood the replys, it's better to get the ser8 with 8745hs (same as 8845hs but whithout the apu) and an external accelerator (via egpu).
Is that right?
Ordinary_Blood_5867@reddit
The NPU might become usable with some future, driver update.
Pajonico@reddit
With the recent rocm updates the NPU is usable. It's not as fast as a full fledged GPU, bit it's usable.
WeaknessWorldly@reddit
Which one? I have not been able to use it in any way
Pajonico@reddit
Here's my working ROCm setup on a GMKtec NucBox K8 Plus (Ryzen 7 8845HS, Radeon 780M with 16GB shared VRAM), running CachyOS.
**Hardware**: Radeon 780M (HawkPoint1, gfx1100.3 — reported as gfx_target_version 110003)
**Kernel parameters** (in `/etc/default/grub` → `GRUB_CMDLINE_LINUX_DEFAULT`):
```
amd_iommu=on iommu=pt amdgpu.cwsr_enable=0
```
- `amd_iommu=on iommu=pt` — passthrough mode for IOMMU, needed for GPU device access in containers
- `amdgpu.cwsr_enable=0` — disables Compute Static Wave Restore, fixes hangs/crashes with some ROCm workloads on iGPUs
**Environment variable** (in `\~/.zshrc` or `\~/.bashrc` or `/etc/environment`):
```bash
export HSA_OVERRIDE_GFX_VERSION=11.0.2
```
This is the key one — the 780M reports as gfx1100.3 but ROCm doesn't officially support it. Setting `HSA_OVERRIDE_GFX_VERSION=11.0.2` makes it behave as gfx1100.2 (Radeon RX 7900 series) which is fully supported. Without this, most ROCm apps will refuse to run or crash.
**Device permissions**: Make sure your user is in the `render` and `video` groups:
```bash
sudo usermod -aG render,video $USER
```
Devices needed: `/dev/kfd` (kernel fusion driver) and `/dev/dri/renderD128` (render node).
**For Docker/containers** (e.g. Immich ML, Ollama):
```yaml
devices:
- /dev/kfd:/dev/kfd
- /dev/dri:/dev/dri
environment:
- HSA_OVERRIDE_GFX_VERSION=11.0.2
group_add:
- video
- render
```
**ROCm version**: 7.2.1 (via CachyOS repos — includes pytorch-opt-rocm, hipblas, rocblas, miopen-hip, etc.)
**VRAM**: 16GB shared (configurable in BIOS — some NucBox models default to 4GB, you want to set it to max). Check with:
```bash
cat /sys/class/drm/card*/device/mem_info_vram_total
# Should show 17179869184 (16 GB)
```
This setup runs PyTorch with ROCm, Immich ML (photo recognition), and various LLM inference workloads without issues. The 780M is surprisingly capable for an iGPU — roughly equivalent to a discrete GTX 1650 in ML benchmarks.
TheActualStudy@reddit
Short answer: AMD iGPU 780m is one of the slowest options. https://dev.to/maximsaplin/running-local-llms-cpu-vs-gpu-a-quick-speed-test-2cjn
This architecture can dynamically allocate memory between the CPU and the GPU/NPU internally. However, there's an issue with using "UMA" to extend VRAM dynamically. The current PyTorch and llama.cpp implementations run slower than on CPU and not all BIOSes support explicitly allocating VRAM arbitrarily. Even when you can get a model loaded into the BIOS allocated VRAM, the speed boost is modest. Overall, I don't think this is good bang-for-buck if you don't already own it. However, if you're a chip integrator, and you design the firmware with LLM enthusiasts in-mind, there is some potential here.
VayuAir@reddit
This is being changed with Linux kernel 6.10
https://lore.kernel.org/dri-devel/CAPM=9txzvSpHASKuse2VFjbdVKftTfWNtPP8Jibck6jC_n_c1Q@mail.gmail.com/
https://www.phoronix.com/news/Linux-6.10-AMDKFD-Small-APUs
AMD engineer Lang Yu summed up the change as: "Small APUs(i.e., consumer, embedded products) usually have a small carveout device memory which can't satisfy most compute workloads memory allocation requirements.
We can't even run a Basic MNIST Example with a default 512MB carveout. https://github.com/pytorch/examples/tree/main/mnist.
Though we can change BIOS settings to enlarge carveout size, which is inflexible and may bring complaint. On the other hand, the memory resource can't be effectively used between host and device.
The solution is MI300A approach, i.e., let VRAM allocations go to GTT."
kryptkpr@reddit
NPU is just using local RAM right, no dedicated fast GDDR/HBM? That's going to be the bottleneck for LLMs.
hedgehog0@reddit (OP)
Yes I believe so.
kryptkpr@reddit
Performance is then going to largely depend on how much memory bandwidth you have, and to a lesser degree which quantizations these kernels will support (less bits = lower memory = faster)
hedgehog0@reddit (OP)
Thank you for the explanation. I also heard that P100 is popular for LLM purpose as well, how does it compare to P40?
Also I read on this sub that P40 seems to be a server GPU, meaning that it’s intended to be install in a server rack, with external cooling. What would be more budget friendly, other than 3060 and second-hand 3090/4090?
kryptkpr@reddit
I have both P40 and P100 because I couldn't decide! Batch performance of P40 is poor compared to P100, but single stream can in some cases actually be better.
My P40 I have in a proper rack machine intended for them.. my P100 live in a home built frame floating above the server:
You can 3D print adapters for various fans and blowers or some people just use aluminum foil tape to attach them directly to the GPUs.
anobfuscator@reddit
You can get a 3d printed fan shroud and fan for a P40 for like $20.
Kafka-trap@reddit
Like many others have said memory speed is the limiting factor on performance having said that 256bit memory bus AMD APUs that should support reasonably fast lpddr5 are close to being released (hopefully it will come with this announcement)
https://videocardz.com/newz/amd-reportedly-set-to-launch-next-gen-ryzen-for-mini-pcs-in-august
fallingdowndizzyvr@reddit
It's not the compute power holding back LLM inference speed. It's memory bandwidth. That uses slow system RAM. And thus running with the CPU, GPU or NPU it'll bottlenecked by that.
dsjlee@reddit
There is tutorial from AMD on how to prepare model to be run on AMD NPU Developer Blog: Build a Chatbot with Ryzen™ AI Pro... - AMD Community