Hipfire dev update: full AMD arch validation incoming (RDNA 1 thru 4, plus Strix Halo and bc250)
Posted by schuttdev@reddit | LocalLLaMA | View on Reddit | 54 comments
Hipfire local dev lab coming together. MS-S1 MAX (Strix Halo, RDNA 3.5) + R9700 (RDNA 4 Pro) just landed. 9070 XT and 6950 XT incoming.
With the 5700 XTs, 7900 XTX, and Skillfish already here, that's every dp4a/WMMA capability tier AMD has shipped:
- no dp4a: 5700 XT, Skillfish (gfx1013)
- dp4a: 6950 XT
- WMMA: 7900 XTX
- iGPU+WMMA: Strix Halo
- RDNA 4: R9700, 9070 XT
Excited to see how much perf I can squeeze out! Also glad I’ll be able to validate PR’s against any RDNA target. Hipfire is just getting started!
Accomplished_Code141@reddit
I'm testing Hipfire in my rig and is great, thanks for all the effort. I have a RDNA2 gpu, W6800 and I'm getting 22-24 t/s with Qwen 3.6 27B (vs 17-19 t/s in llama.cpp vulkan) and 105-110 t/s with qwen3.6-35b-a3b (vs 78 t/s in llama.cpp vulkan). I'm having some issues with long contexts and some weird loops in the reasoning steps, not sure if it's related to hipfire or the harness I'm using (that works fine with llama.cpp using vulkan backend)
NoMS_TOCP@reddit
Same here, but I'm using a heavily OCed 8700G IGPU (GFX1103), with llama.cpp I get \~33t/s with qwen3.6-35b-a3b and it got over 50t/s with Hipfire, but with the same issues as you, context loops, repetitions and all.
One thing I observed is that with default 1.05 repetition-penalty, the simply cuts the token processing early on, with 1.00 with keeps going until it start looping. I'm on cachyos.
Let's hope for a fix, this project is great news for Radeon owners. :)
Wise-Hunt7815@reddit
I’ve got a Strix Halo too! Thanks for your contributions—great work!
my_name_isnt_clever@reddit
Same. I hope OP is considering looking into how the NPU can be utilized. I haven't seen much talk about it since AMD has been dragging their feet on Linux support.
alphatrad@reddit
Let me know if you need a volunteer to attempt to validate numbers or something to prove results. 👍
Doct0r0710@reddit
Where's the patreon / kofi / etc link? I already decided of picking up an RX 6800 instead of selling my RX 6700 and saving up for a 7900 XTX, if you can pull off multi-gpu generation then it's even better.
MisticRain69@reddit
Also got a strix halo but running a 3090 egpu. Does strix halo still have that AMD egpu glitch to where it power limits the AMD egpu to the PL of the igpu? If its fixed then a mi 50 could be tempting for me to add a 2nd EGPU in the future provided I have enough pcie lanes for another EGPU running at x4.
Mountain_Patience231@reddit
u wish i could run 2x9070xt in Hipfire
ps5cfw@reddit
Does hipfire support Hybrid CPU + GPU inference? If so I'd gladly try It, it's the only way I can run Qwen 35B on my 6800xt
schuttdev@reddit (OP)
Hipfire does not support hybrid inference yet 🤔what sort of speeds are you getting with your current inference backend
RoomyRoots@reddit
I actually tried the hybrid yesterday and that was a breaking point for me since to run 27B and 35B I need that.
theepicflyer@reddit
I'm getting 30-35 tok/s at the start of a fresh context. Slows down as the context fills but I'm not sure by how much.
9070XT with Ryzen 7700.
ps5cfw@reddit
I am getting 12 to 16 token per second on token generation ant upwards to 1000 token per second on prompt processing with llama.cpp ROCM, running on a 6800xt and 96GB of dual channel DDR4 memory at 3333 MHz with a Ryzen 5800x.
It's not bad, but this is with qwen 35b, 122b is much slower and it would be nice to have something a bit faster, if that can happen (not being optimistic though, I know my setup is too limited for the purpose)
Rosht54@reddit
I have about 25 tok/s in qwen3.6-35b-a3b on my RX 6750 XT, with CPU offload (--n-cpu-moe 26). Looks like you need to do some tinkering with your inference app My CPU is Ryzen 5600, with 32GB dual-channel DDR4-3200
Rosht54@reddit
My command line looks like this:
ps5cfw@reddit
I am running 140k unquantized cache at q8_k_XL, that will definitely affect speeds.
Rosht54@reddit
Oh, okay. I just find myself a bit uncomfortable when speed is less than 20 tok/s, so I prefer to use model versions with less quantization
schuttdev@reddit (OP)
I’ll see what I can do, seems like an interesting problem to solve re: MoE
robstaerick@reddit
What of performance gains do we expect with hipfire on a Strix AI 395+ with the 8060s 128gb without additional dedicated GPU?
eur0child@reddit
I have desperately tried to compile and use it for my 9070 XT to no avail this week. I get hipmalloc errors with Qwen3.6 27B and even with Qwen3.5 9B (which should fit easily in 16 GB VRAM). Looking forward for the improvements !
schuttdev@reddit (OP)
What OS are you running? And yes both of those should 100% fit at mq4 on your card.
eur0child@reddit
I tried on Windows and Ubuntu through WSL. Is it expected?
schuttdev@reddit (OP)
Darn. Yeah it’s my build for windows that’s the problem. I haven’t used windows in a while, but it comes preinstalled on the Strix Halo, so I will definitely look into it while I’m booted into windows tomorrow. Hopefully I can solve the issue for both WSL and native while I’m there.
eur0child@reddit
I also have a native ubuntu partition, I'll try again on it and report if I am also having issues there (which would imply the issues lies between the chair and screen :D)
AntuaW@reddit
How are you going to connect R9700 to stróż halo?
schuttdev@reddit (OP)
USB4 dock today, pci riser ~Thursday night
Formal-Exam-8767@reddit
How the you plan to power it when using pci riser?
Due_Net_3342@reddit
via oculink. I done it myself with 7900xtx, but currently hipfire doesn’t support multi-gpu so not really helpful right now
schuttdev@reddit (OP)
That’s going to be a fun one to untangle tomorrow. But yes I’m aiming for smaller quants + multigpu PoC tomorrow
putrasherni@reddit
actuallly connecting 4 R9700 with a strix halo will be great , but it don't work like that
putrasherni@reddit
I wish
MDSExpro@reddit
Does it support multiGPU setups? I have 8x R9700 that would like more love than ROCm's version vLLM gets.
schuttdev@reddit (OP)
It’s posted as a feature under issues, will be working on it tomorrow
BevinMaster@reddit
I guess I have to check hip fire then :), I have a bunch of 32GB gfx1030 and 24GB gfx1100 gpus, was doing locally some vllm builds with custom hip kernels to make it work.
schuttdev@reddit (OP)
Hipfire is at its core a very similar shape to what you’ve been doing with vllm then, don’t be afraid to contribute!
BevinMaster@reddit
Yeah I have a bunch of rdna2 v620 and rdna3 7900xtx I am really not satisfied with concurrency use of llamacpp so I have been trying to find a good solution :)
shoraaa@reddit
whether i will upgrade my rx 6800 to a 7900xtx or 3090 depend on your works now lol gl
schuttdev@reddit (OP)
🫡 I won’t fail you
BringMeTheBoreWorms@reddit
I thought i would give this a go earlier today, and tested qwen 3.5 0.8b 4b 9b models and they all had 1.5 to 2 times higher t/s and x10 on prefill. If this is an indicator of what can be done then AMD cards will seriously compete.
This will be great if it can be productionized.
schuttdev@reddit (OP)
I believe it is possible. We are early on now, my thesis with this has always been that rebuilding from zero targeting AMD silicon directly via custom HIP will always beat CUDA-shaped code that targets…not AMD lol
BringMeTheBoreWorms@reddit
Im testing it right now and it seems to be a damn good start. I was ok with qwen3.627b chugging at 24 to 30 t/s but if it can over 40 consistently then that's a nice boost.
I just picked up a R9700 earlier this afternoon to go with the 2 xtx cards i have, so will try and test some more later.
SkyFeistyLlama8@reddit
I wish Qualcomm engineers or someone who knows the ins and outs of the Adreno GPU can do something like this. The llama OpenCL Adreno backend needs updating.
schuttdev@reddit (OP)
🤔Using my method in hipfire as a reference, it’s possible. I’ll lay that out. 1. Inspect how the cpu talks to the GPU 2. Find the layer that dispatches commands from the cpu 3. Inspect those commands 4. Research the silicon, instruction set, which instructions are low overhead 5. From the commands you’ve inspected, bootstrap your own commands, ensuring to respect and optimize for the arch. 6. You now have qualaunch!
But honestly I wasn’t so rigid about it, I just kept throwing ideas out there based on the research until something stuck and was measurably better than the baseline, and I still do that.
coyo-teh@reddit
Is Strix Point planned, would it even be useful?
schuttdev@reddit (OP)
I don’t see why not. Any usable compute is potentially useful.
CornerLimits@reddit
Do you have any plan to support the vega mi50? Its old but has dp4a and is pretty popular in this field
schuttdev@reddit (OP)
Anything is technically possible to support as long as it can accept HIP instructions. There are agent skills in the repo for porting to any arch/smoke testing it. If you end up going that route, please do create an issue/PR and I will address it
Kokospalme@reddit
Looked at some of the models for my RX 9070. But with Q4 quantization being around 15GB for good models und no lower quantization available in hipfire I see no usecase for me.
Yes the benchmarks may be fast but Qwen3.6-27B-Q3_K_S.gguf on llama.cpp uses only 11.5GB of VRAM which allows a huge kv cache. Thus ensuring the model is actually useful with 16GB vram in total.
schuttdev@reddit (OP)
Will be working on lower quants when I wake up, kicking off research phase currently.
Quereller@reddit
Looking at your project with great interest.
ismaelgokufox@reddit
Owners of RX 6800 salute you! Thanks for all the effort!
Hydroskeletal@reddit
awesome stuff. this could really open up the dual card space.
o0genesis0o@reddit
Eagerly waiting for your performance metrics from that 9700. I'm tempted by that card too, but just a bit skittish about AMD driver, given how my 780M has been useless since kernel 6.18 last year.
drubus_dong@reddit
Aa I'm using the R9700, I appreciate your effort very much.