[Benchmark] RK3588 NPU vs Raspberry Pi 5 - Llama 3.1 8B, Qwen 3B, DeepSeek 1.5B tested

Posted by tre7744@reddit | LocalLLaMA | View on Reddit | 16 comments

Been lurking here for a while, finally have some data worth sharing.

I wanted to see if the 6 TOPS NPU on the RK3588S actually makes a difference for local inference compared to Pi 5 running CPU-only. Short answer: yes.

Hardware tested: - Indiedroid Nova (RK3588S, 16GB RAM, 64GB eMMC) - NPU driver v0.9.7, RKLLM runtime 1.2.1 - Debian 12

Results:

Model	Nova (NPU)	Pi 5 16GB (CPU)	Difference
DeepSeek 1.5B	11.5 t/s	~6-8 t/s	1.5-2x faster
Qwen 2.5 3B	7.0 t/s	~2-3 t/s*	2-3x faster
Llama 3.1 8B	3.72 t/s	1.99 t/s	1.87x faster

Pi 5 8B number from Jeff Geerling's benchmarks. I don't have a Pi 5 16GB to test directly.

*Pi 5 3B estimate based on similar-sized models (Phi 3.5 3.8B community benchmarks)

The thing that surprised me:

The Nova's advantage isn't just speed - it's that 16GB RAM + NPU headroom lets you run the 3B+ models that actually give correct answers, at speeds the Pi 5 only hits on smaller models. When I tested state capital recall, Qwen 3B got all 50 right. DeepSeek 1.5B started hallucinating around state 30.

What sucked:

Pre-converted models from mid-2024 throw "model version too old" errors. Had to hunt for newer conversions (VRxiaojie and c01zaut on HuggingFace work).
Ecosystem is fragmented compared to ollama pull whatever.
Setup took ~3 hours to first inference. Documentation and reproducibility took longer.

NPU utilization during 8B inference: 79% average across all 3 cores, 8.5GB RAM sustained. No throttling over 2+ minute runs.

Happy to answer questions if anyone wants to reproduce this.

Setup scripts and full methodology: github.com/TrevTron/indiedroid-nova-llm

Methodology note: Hardware provided by AmeriDroid. Benchmarks are my own.

[-]

EugenePopcorn@reddit

The Rk3588S is also said to have a strong mobile GPU. How does its Vulkan performance compare with rkllm?

[-]

tre7744@reddit (OP)

I didn't test Vulkan/GPU inference on this run - I was more focused specifically on the NPU path with RKLLM.

The Mali-G610 is decent for graphics but I'd expect the NPU to win for inference workloads - that's what it's optimized for, even if it was originally designed more for vision tasks than LLMs.

If anyone has llama.cpp Vulkan numbers on RK3588, I would love to see them too

[-]

jdchmiel@reddit

the Mali GPU was way faster than the NPU when I last tested on an orange Pi 5 over a year ago. At that point, even getting the NPU to be useful was bleeding edge and not expected to be very useful due to the vision AI task centric focus of the early dev kit docs

[-]

arbv@reddit

If you have tested an RK3588 board - can you tell me what to expect? I am considering to buy one and use the GPU to run BGE-m3 reranker and embedding model on the device (both ~0.6B).

I am not sure if it is viable.

[-]

jdchmiel@reddit

it has been a long time since i ran anything other than Minecraft for my kids on the orange pi5.; what are you looking for, requests/s with that specific model?

[-]

arbv@reddit

Oh, then don't bother. Thank you. I don't even know how to measure the performance of an embedding model outside of a complete stack.

Can you recall what you were running and what the performance was?

[-]

tre7744@reddit (OP)

Good to know - that tracks with what I've read about early RK3588 NPU support. Sounds like RKLLM has come a long way since then.

[-]

rolyantrauts@reddit

I am not that sure as the NPU rating is for small models int4 that fit into the reserved memory area.
Anything but small image models have to fit into standard memory and you get a DMA overhead passing data back and to NPU.
CPU is also very strong due to ML vector instructions also shared by the Pi on A76 Arm V8.2

[-]

rolyantrauts@reddit

GPU is approx 75% cpu as cpu is very strong due to arm v8.2 having ML vector instructions such as Mat/MUL

[-]

tre7744@reddit (OP)

That makes sense - the 8B model was definitely hitting standard memory (8.5GB sustained). Good context on the DMA overhead, that might explain why the NPU advantage shrinks at larger model sizes too

[-]

rolyantrauts@reddit

The reserved mem area details are in this doc and you need to set it up but tiny anyway

https://github.com/rockchip-linux/rknpu2/blob/master/doc/RK3588_NPU_SRAM_usage.md

[-]

tre7744@reddit (OP)

I appreciate the link I hadn't dug into the SRAM reserved memory setup yet. Makes sense that int4 small models would hit closer to the 6 TOPS ceiling. Might be worth testing a smaller quantized model to see the difference

[-]

FullstackSensei@reddit

Sounds like a chatgpt written marketing post for the indiedroid board.

The rockchip NPU driver has always been a pain to work with. The whole software stack is kind of a blackbox and you're kind of dependent on rockchip to support models.

Mesa had been working on an open source driver for a while. The driver and userspace runtime were merged in mainline 6.18 over six months ago. The runtime is based on TF-Lite. The whole thing was developed using the RK3588 as a testbed.

[-]

tre7744@reddit (OP)

Not ChatGPT - I wrote this myself over a week of testing. You can check the commit history on the GitHub repo if you want receipts.

You're right the Rockchip stack isn't as polished as Ollama. I said that in the post. But "pain to work with" might be outdated - took me about 3 hours to first inference, documented the gotchas along the way.

The Mesa/TF-Lite mainline work is interesting though.

[-]

exaknight21@reddit

Can you qwen3:4b instruct

[-]

tre7744@reddit (OP)

I didn't test Qwen3 specifically as I used Qwen 2.5 3B for the Qwen benchmarks. But there's a Qwen3-4B converted for RKLLM v1.2.0 here: https://huggingface.co/ThomasTheMaker/Qwen3-4B-RKLLM-v1.2.0 (haven't tested)

Should work with the same setup. Let me know if you try it - curious how it compares.