Hipfire dev update: full AMD arch validation incoming (RDNA 1 thru 4, plus Strix Halo and bc250)

Posted by schuttdev@reddit | LocalLLaMA | View on Reddit | 54 comments

Hipfire local dev lab coming together. MS-S1 MAX (Strix Halo, RDNA 3.5) + R9700 (RDNA 4 Pro) just landed. 9070 XT and 6950 XT incoming.

With the 5700 XTs, 7900 XTX, and Skillfish already here, that's every dp4a/WMMA capability tier AMD has shipped:

- no dp4a: 5700 XT, Skillfish (gfx1013)

- dp4a: 6950 XT

- WMMA: 7900 XTX

- iGPU+WMMA: Strix Halo

- RDNA 4: R9700, 9070 XT

Excited to see how much perf I can squeeze out! Also glad I’ll be able to validate PR’s against any RDNA target. Hipfire is just getting started!

[-]

Accomplished_Code141@reddit

I'm testing Hipfire in my rig and is great, thanks for all the effort. I have a RDNA2 gpu, W6800 and I'm getting 22-24 t/s with Qwen 3.6 27B (vs 17-19 t/s in llama.cpp vulkan) and 105-110 t/s with qwen3.6-35b-a3b (vs 78 t/s in llama.cpp vulkan). I'm having some issues with long contexts and some weird loops in the reasoning steps, not sure if it's related to hipfire or the harness I'm using (that works fine with llama.cpp using vulkan backend)

[-]

NoMS_TOCP@reddit

Same here, but I'm using a heavily OCed 8700G IGPU (GFX1103), with llama.cpp I get \~33t/s with qwen3.6-35b-a3b and it got over 50t/s with Hipfire, but with the same issues as you, context loops, repetitions and all.

One thing I observed is that with default 1.05 repetition-penalty, the simply cuts the token processing early on, with 1.00 with keeps going until it start looping. I'm on cachyos.

Let's hope for a fix, this project is great news for Radeon owners. :)

[-]

Wise-Hunt7815@reddit

I’ve got a Strix Halo too! Thanks for your contributions—great work!

[-]

my_name_isnt_clever@reddit

Same. I hope OP is considering looking into how the NPU can be utilized. I haven't seen much talk about it since AMD has been dragging their feet on Linux support.

[-]

alphatrad@reddit

Let me know if you need a volunteer to attempt to validate numbers or something to prove results. 👍

[-]

Doct0r0710@reddit

Where's the patreon / kofi / etc link? I already decided of picking up an RX 6800 instead of selling my RX 6700 and saving up for a 7900 XTX, if you can pull off multi-gpu generation then it's even better.

[-]

MisticRain69@reddit

Also got a strix halo but running a 3090 egpu. Does strix halo still have that AMD egpu glitch to where it power limits the AMD egpu to the PL of the igpu? If its fixed then a mi 50 could be tempting for me to add a 2nd EGPU in the future provided I have enough pcie lanes for another EGPU running at x4.

[-]

Mountain_Patience231@reddit

u wish i could run 2x9070xt in Hipfire

[-]

ps5cfw@reddit

Does hipfire support Hybrid CPU + GPU inference? If so I'd gladly try It, it's the only way I can run Qwen 35B on my 6800xt

[-]

schuttdev@reddit (OP)

Hipfire does not support hybrid inference yet 🤔what sort of speeds are you getting with your current inference backend

[-]

RoomyRoots@reddit

I actually tried the hybrid yesterday and that was a breaking point for me since to run 27B and 35B I need that.

[-]

theepicflyer@reddit

I'm getting 30-35 tok/s at the start of a fresh context. Slows down as the context fills but I'm not sure by how much.


llama-server \ --mmproj models/Qwen3.6-mmproj-FP16.gguf \ --no-mmproj-offload \ -m models/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf \ --alias "qwen3.6-35b-a3b" \ -ngl 99 \ --cpu-moe \ -c 400000 \ --parallel 2 \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --flash-attn on \ -b 4096 -ub 2048 \ --threads 8 --threads-batch 8 \ --mlock \ --no-context-shift \ --jinja \ --chat-template-kwargs '{"preserve_thinking": false}' \ --presence-penalty 0.0 \ --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 \ --host 0.0.0.0 --port 8080 \

9070XT with Ryzen 7700.

[-]

ps5cfw@reddit

I am getting 12 to 16 token per second on token generation ant upwards to 1000 token per second on prompt processing with llama.cpp ROCM, running on a 6800xt and 96GB of dual channel DDR4 memory at 3333 MHz with a Ryzen 5800x.

It's not bad, but this is with qwen 35b, 122b is much slower and it would be nice to have something a bit faster, if that can happen (not being optimistic though, I know my setup is too limited for the purpose)

[-]

Rosht54@reddit

I have about 25 tok/s in qwen3.6-35b-a3b on my RX 6750 XT, with CPU offload (--n-cpu-moe 26). Looks like you need to do some tinkering with your inference app My CPU is Ryzen 5600, with 32GB dual-channel DDR4-3200

[-]

Rosht54@reddit

My command line looks like this:

../llama-rocm/llama-server --model Qwen_Qwen3.6-35B-A3B-Q5_K_M.gguf \
  --n-gpu-layers all \
  --n-cpu-moe 26 \
  --temp 0.6 \
  --top-p 0.95 \
  --min-p 0.00 \
  --top-k 20 \
  --ctx-size 32768 \
  --host 0.0.0.0 \
  --chat-template_kwargs '{"preserve_thinking":true}' \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --no-mmap \
  --jinja

[-]

ps5cfw@reddit

I am running 140k unquantized cache at q8_k_XL, that will definitely affect speeds.

[-]

Rosht54@reddit

Oh, okay. I just find myself a bit uncomfortable when speed is less than 20 tok/s, so I prefer to use model versions with less quantization

[-]

schuttdev@reddit (OP)

I’ll see what I can do, seems like an interesting problem to solve re: MoE

[-]

robstaerick@reddit

What of performance gains do we expect with hipfire on a Strix AI 395+ with the 8060s 128gb without additional dedicated GPU?

[-]

eur0child@reddit

I have desperately tried to compile and use it for my 9070 XT to no avail this week. I get hipmalloc errors with Qwen3.6 27B and even with Qwen3.5 9B (which should fit easily in 16 GB VRAM). Looking forward for the improvements !

[-]

schuttdev@reddit (OP)

What OS are you running? And yes both of those should 100% fit at mq4 on your card.

[-]

eur0child@reddit

I tried on Windows and Ubuntu through WSL. Is it expected?

[-]

schuttdev@reddit (OP)

Darn. Yeah it’s my build for windows that’s the problem. I haven’t used windows in a while, but it comes preinstalled on the Strix Halo, so I will definitely look into it while I’m booted into windows tomorrow. Hopefully I can solve the issue for both WSL and native while I’m there.

[-]

eur0child@reddit

I also have a native ubuntu partition, I'll try again on it and report if I am also having issues there (which would imply the issues lies between the chair and screen :D)

[-]

AntuaW@reddit

How are you going to connect R9700 to stróż halo?

[-]

schuttdev@reddit (OP)

USB4 dock today, pci riser ~Thursday night

[-]

Formal-Exam-8767@reddit

How the you plan to power it when using pci riser?

[-]

Due_Net_3342@reddit

via oculink. I done it myself with 7900xtx, but currently hipfire doesn’t support multi-gpu so not really helpful right now

[-]

schuttdev@reddit (OP)

That’s going to be a fun one to untangle tomorrow. But yes I’m aiming for smaller quants + multigpu PoC tomorrow

[-]

putrasherni@reddit

actuallly connecting 4 R9700 with a strix halo will be great , but it don't work like that

[-]

putrasherni@reddit

I wish

[-]

MDSExpro@reddit

Does it support multiGPU setups? I have 8x R9700 that would like more love than ROCm's version vLLM gets.

[-]

schuttdev@reddit (OP)

It’s posted as a feature under issues, will be working on it tomorrow

[-]

BevinMaster@reddit

I guess I have to check hip fire then :), I have a bunch of 32GB gfx1030 and 24GB gfx1100 gpus, was doing locally some vllm builds with custom hip kernels to make it work.

[-]

schuttdev@reddit (OP)

Hipfire is at its core a very similar shape to what you’ve been doing with vllm then, don’t be afraid to contribute!

[-]

BevinMaster@reddit

Yeah I have a bunch of rdna2 v620 and rdna3 7900xtx I am really not satisfied with concurrency use of llamacpp so I have been trying to find a good solution :)

[-]

shoraaa@reddit

whether i will upgrade my rx 6800 to a 7900xtx or 3090 depend on your works now lol gl

[-]

schuttdev@reddit (OP)

🫡 I won’t fail you

[-]

BringMeTheBoreWorms@reddit

I thought i would give this a go earlier today, and tested qwen 3.5 0.8b 4b 9b models and they all had 1.5 to 2 times higher t/s and x10 on prefill. If this is an indicator of what can be done then AMD cards will seriously compete.

This will be great if it can be productionized.

[-]

schuttdev@reddit (OP)

I believe it is possible. We are early on now, my thesis with this has always been that rebuilding from zero targeting AMD silicon directly via custom HIP will always beat CUDA-shaped code that targets…not AMD lol

[-]

BringMeTheBoreWorms@reddit

Im testing it right now and it seems to be a damn good start. I was ok with qwen3.627b chugging at 24 to 30 t/s but if it can over 40 consistently then that's a nice boost.

I just picked up a R9700 earlier this afternoon to go with the 2 xtx cards i have, so will try and test some more later.

[-]

SkyFeistyLlama8@reddit

I wish Qualcomm engineers or someone who knows the ins and outs of the Adreno GPU can do something like this. The llama OpenCL Adreno backend needs updating.

[-]

schuttdev@reddit (OP)

🤔Using my method in hipfire as a reference, it’s possible. I’ll lay that out. 1. Inspect how the cpu talks to the GPU 2. Find the layer that dispatches commands from the cpu 3. Inspect those commands 4. Research the silicon, instruction set, which instructions are low overhead 5. From the commands you’ve inspected, bootstrap your own commands, ensuring to respect and optimize for the arch. 6. You now have qualaunch!

But honestly I wasn’t so rigid about it, I just kept throwing ideas out there based on the research until something stuck and was measurably better than the baseline, and I still do that.

[-]

coyo-teh@reddit

Is Strix Point planned, would it even be useful?

[-]

schuttdev@reddit (OP)

I don’t see why not. Any usable compute is potentially useful.

[-]

CornerLimits@reddit

Do you have any plan to support the vega mi50? Its old but has dp4a and is pretty popular in this field

[-]

schuttdev@reddit (OP)

Anything is technically possible to support as long as it can accept HIP instructions. There are agent skills in the repo for porting to any arch/smoke testing it. If you end up going that route, please do create an issue/PR and I will address it

[-]

Kokospalme@reddit

Looked at some of the models for my RX 9070. But with Q4 quantization being around 15GB for good models und no lower quantization available in hipfire I see no usecase for me.

Yes the benchmarks may be fast but Qwen3.6-27B-Q3_K_S.gguf on llama.cpp uses only 11.5GB of VRAM which allows a huge kv cache. Thus ensuring the model is actually useful with 16GB vram in total.

[-]

schuttdev@reddit (OP)

Will be working on lower quants when I wake up, kicking off research phase currently.

[-]