Strix Halo, Debian 13@6.16.12&6.17.8, Qwen3Coder-Q8 CTX<=131k, llama.cpp@Vulkan&ROCm, Power & Efficiency

Posted by Educational_Sun_8813@reddit | LocalLLaMA | View on Reddit | 21 comments

Hi, i wanted to check kernel improvement in support of strix halo under Debian GNU/Linux, while latest minor versions of 6.16.x improved GTT wanted to check if can be even better. So i tested it on Debian 13 with latest kernel from testing 6.16.12+deb14+1-amd64, and one precompiled performance optimized kernel 6.17.8-x64v3-xanmod1. I ran tests agains Qwen3-Coder-Q8 in full context, but i did benchmark up to 131k. Llama.cpp versions i used for tests: Vulkan build: 5be353ec4 (7109) and ROCm TheROCK precompiled build: 416e7c7 (1). Side notice i managed to compile finally llama.cpp with external libs from AMD for HIP support, so from now one i will use same build for Vulkan and ROCM. Since i wanted also to find sweet spot in energy efficiency so i tried to capture also power usage, and compare it with computing performance. So in the end i tested that model with two backends, and kernels, changing context in few steps, to find out.

In the end seems that latest kernel from testing 6.16.12 works just great! Performance kernel speed is maybe fraction better (max 2%). Besides stock kernel had 4W in idle (in balanced mode), while performance kernel had always minimum 9-10W. And i use fans with 0RPM <= PWM 5% so it's completly silent when idle. And audible under heavy load especially with ROCm. Anyway most optimal power setting for computations is latency-performance and it's not worth to use accelerator-performance in the long run.

Here just notice for strix halo Debian users (and other distros probably too, but current Arch and Fedora have newer kernel), you need to use at least 6.16.x to have better experience with that platform. For Debian GNU/Linux easiest way is to install newer kernel from backports, or move to testing for the latest one. I just noticed that with apt update just now that there is 6.16.12 in stable, so it's great nothing to for Debian users. :) And testing moved to 6.17.8+deb14-amd64 so great, anyway i will have now that kernel, so will test it soon again from debian branch. haha, what an irony, but it took me quite time to write it down. So update: and just tested 6.17.8+deb14-amd64 and idle now is 6W in balance mode now, bit more, than before, but less than the custom kernel.

Performance wise Vulkan is faster in TG, while significantly slower in PP especially with long context. On the other hand ROCm is much faster in PP, and bit slower in TG, but overal improvement in PP is so big that it does not matter for long context (it's around x2.7 faster in 131k CTX window). Vulkan is very fast for shorter chats, but over 32k CTX it's getting much slower. Under load (tested with accelerator-performance profile in tuned) ROCm can draw around 120W (this backend use also more CPU for PP), while Vulkan peak was around 70W.

I found that best values for -ub batch size is 512(it's default) for Vulkan, but 2048 for ROCm (it's faster ~16% than default). After that you have to increase -b logical batch size to 8192 for best performance with ROCm. For Vulkan just leave default logical batch size.

BONUS section, agent test: After tests i wanted to check Qwen3-coder-Q8 model in some tooling so i tried to install kubectl-ai, and connect it to my local llama-server, and perform some tasks on local kubernetes (4 nodes). Model was able based on the natural language promp install Jupyter hub from helm charts, using ~50k tokens for that. And one could run notebooks in some 8-10 minutes. That model works really good on strix halo, worth to check if you didn't yet.

I hope someone will find it valuable, and diagram clear enough. :)

[-]

machineglow@reddit

Hi,

Just wonder how things are going with the Strix Halo 128GB Chip? I'm just starting my local LLM journey and curious how the PP and TG performance is in agentic applications like openclaw, hermes, etc.... with newer MoE or Dense models from Gemma 4, Qwen3.6, Deepseek, etc... Have you tried any of them out?

Thanks!

[-]

Educational_Sun_8813@reddit (OP)

yes, i'm trying all the models which can fit, dense models are slow but if you not in a rush they can produce meaningful output. MoE are working great, i use most of the models in Q8, with few exceptions for models >100B to run them in Q5, and Q6, and i'm doing own quants for them.

[-]

machineglow@reddit

Have you run Agents? curious what your experience is with how much context length you're able to get out of the 30B-70B models...

[-]

Educational_Sun_8813@reddit (OP)

for 30-70B models you can run full context

[-]

_murb@reddit

I really need to get rocm working on mine and do some testing. I've been using vulkan with gps-oss 120b and qwen3-coder. I've noticed that with vulkan it won't load more than 64gb into vram even when set to 96gb in bios. I'm running arch (mainly for 6.18 kernel) and a gmktech but debating on returning for a framework due to noise. How do you find the noise levels with your system?

[-]

fallingdowndizzyvr@reddit

I've noticed that with vulkan it won't load more than 64gb into vram even when set to 96gb in bios.

I don't have that problem. In fact, ROCm is the one I'm having a problem with. I can go up to 112GB with Vulkan with my config since it will use GTT. ROCm on the otherhand won't use GTT so I'm stuck at 96GB.

[-]

Educational_Sun_8813@reddit (OP)

i'm using gtt, no issue now with two backends, in the previous versions of rec. i had also problems crossing 64... but vulkan worked all the time

[-]

fallingdowndizzyvr@reddit

but vulkan worked all the time

That's weird. Since Vulkan was broken and the fix was only made a couple of weeks ago.

https://github.com/ggml-org/llama.cpp/pull/17110

[-]

Educational_Sun_8813@reddit (OP)

but it's some windows issue, not related to it at all, vulkan is working fine since day one on that device under GNU/Linux

[-]

fallingdowndizzyvr@reddit

No. That PR fixed it in Windows and Linux. It was not working before that PR in Linux.

Here's the output from B6931 before that PR.

"Vulkan0: AMD Radeon Graphics (RADV GFX1151) (84650 MiB, 84530 MiB free)"

Here's the output from B7018 after that PR got merged.

"Vulkan0: AMD Radeon Graphics (RADV GFX1151) (126976 MiB, 126795 MiB free)"

That PR definitely made Vulkan see the GTT under Linux that it didn't see before.

[-]

No-Statement-0001@reddit

did you set the kernel params to increase how much unified RAM can be shared?

[-]

Educational_Sun_8813@reddit (OP)

yes, i use additional flags in grub: amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432 those are documented as a optimal in many resources, nothing special here

[-]

Educational_Sun_8813@reddit (OP)

in idre is noisleless, but under load can be louder, with power settings i found best so performance-latency it's just ok, and fan is lauder only during some parts of the process more. alse you can choose different fans, i settled on arctic cooling with higher static pressure than default noctua, it has also slightlly higher air flow, and there is an option to add other fan in case of some heavier load, but it's improve cooling only when two are working full speed, which i found unecessy.

[-]

ElSrJuez@reddit

Thanks for this, didnt quite get most of it, r u following a guide?

[-]

Educational_Sun_8813@reddit (OP)

what do you mean "following a guide"?

[-]

Shadowmind42@reddit

This looks really interesting. Could you post a higher resolution version or a link to a page?

[-]

Educational_Sun_8813@reddit (OP)

hi, i just checked and if you click picture it's go full res, it's 2000x1000 and with clear text, i only posted it here and crossposted to few other channels

[-]

Shadowmind42@reddit

Thanks for looking. My phone just rendered it somewhat blurry. What do you think of the Strix Halo so far? Was it worth it? Do you have a laptop or a mini desktop?

[-]

Educational_Sun_8813@reddit (OP)

i like it, i have framework desktop, but since then i found that there are other options (and cheaper) for example something like that: https://www.bosgamepc.com/products/bosgame-m5-ai-mini-desktop-ryzen-ai-max-395 but still, any device with this APU can run GLM-4.5-airQ4, qwen3coder-Q8 and few other models which are very good. And itself is quite power efficient so another point for me.

[-]

Shadowmind42@reddit

I have a R9700 AI pro showing up on Monday. I'm excited to do some benchmarks on that device.

[-]

Shadowmind42@reddit

Thanks for looking. My phone just rendered it somewhat blurry. What do you think of the Strix Halo so far? Was it worth it? Do you have a laptop or a mini desktop?