AMD Hipfire - a new inference engine optimized for AMD GPU's
Posted by Thrumpwart@reddit | LocalLLaMA | View on Reddit | 66 comments
Came across hipfire the other day. It's a brand new inference engine focused on all AMD GPU's (not just the latest).
It uses a special mq4 quantization method. The hipfire creator is pumping out models on huggingface.
I don't know enough about quantization to know how good these quants are in terms of quality, but as an RDNA3 aficionado I'm happy AMD is getting some attention.
Localmaxxing is a new LLM benchmarking site, and shows some pretty dramatic speedups for hipfire inference.
Skid_gates_99@reddit
dropping a hipfire engine plus a custom quant format plus a benchmarking site simultaneously is wild commitment. solo dev energy. the rdna3 community has been carrying that "what about us" torch for like two years, nice to see something land.
curious how mq4 holds up against q4_k_m on the same model, anyone benched yet?
shing3232@reddit
mq4 shouldn't that much different from Q4km from the look of it.
Flamenverfer@reddit
intput
$ HIP_VISIBLE_DEVICES=0 hipfire run ~/.hipfire/models/qwen3.5-27b.mq4 "Write me a very long python script"
output
GPU: gfx1100 (25.8 GB VRAM, HIP 6.3) pre-compiled kernels: .hipfire_kernels/gfx1100 [hipfire] DFlash disabled (dflash_mode=off). loading token_embd... (Q8_0 raw, 1350 MB)
#Layers
loading layer 63/64 (FullAttention)... KV cache: asym3 (K rotated-3b 100B + V Q8 272B = 372 B/head, 5.5x vs fp32, physical_cap=32768 / max_seq=32768) [qwen3_5] 5120d 64L 248320 vocab
[512 tok, 42 tok/s]
Thrumpwart@reddit (OP)
25.8GB VRAM? What Frankenstein card are you running?
Flamenverfer@reddit
It has to be a bug! lol. I have two plain jane xtx cards.
alphatrad@reddit
I just started testing this on my RX 7900 XTX tonight.
I gotta finish some tests.... but I tested a 9B on the XTX with DFlash on code prompt. 306.27 tok/s vs AR baseline 106 t/s = 2.86× speedup with coherent output.
THATS A BIG BUMP - doesn't work with my R9700 though. Currently testing an MXFP4 that seems to really bump things up on that card.
But the Hipfire has REAL POTENTIAL. How it translates to daily use.... I dunno yet.
Sometimes speed tests aren't reality when you're talking wall time and doing actually coding and stuff.
soyalemujica@reddit
How do you mean that MXFP4 has potential in the 7900 XTX?
alphatrad@reddit
Not on the 7900 XTX my R9700
ArkaunGaming@reddit
Is there any explanation as to why it doesn't work with R9700s?
alphatrad@reddit
Not fully implemented yet - experimental - you can try it. I just skipped it for now.
My other test was with MXFP4 - Qwen3.6-35B on llama.cpp vs vLLM (MXFP4) is a massive jump on my coding benchmarks. Best run scoring 50 out of 64 pts.
TLDR: FP4 quant is way nicer to MoE router + vLLM’s KV cache crushes it.
Downside: Lost TPS / tokens per prompt and wall time went up.
I had dual RX 7900 XTX's which are FAST... but the cards so SO HUGE so I'm in the middle of swapping over to 9700's. I can only fit two of the XTX's in my case. I will be able to fit 3 of the other ones.
politerate@reddit
Exited to test on my 7900xtx, though no support for gfx906/mi50
Thrumpwart@reddit (OP)
It's very new but has alot of potential. Hoping for good things for this - he seems to be really squeezing every drop of performance out of AMD GPU's.
ahaw_work@reddit
Any chance for gfx906 (Mi50/Mi60) support?
schuttdev@reddit
gfx908 is supported, so I don't see why not. I have an arch port and tuning skill in the repo's .skills dir if you'd like to point your agent at it
RoomyRoots@reddit
As much as I despise Rust, I guess I will try this. I do hope whatever is happening in the background can be implemented on llama.cpp.
Cool-Chemical-5629@reddit
This part of the info from Github says a different story:
RDNA is the latest technology. Scrolling down also reveals ROCm being involved. That's also something you won't get running on older cards such as Radeon RX Vega 56.
Remove_Ayys@reddit
Looks like vibecoded slop TBH.
SufficientlySuper@reddit
Yeah pretty much. I vibeslop ported it to windows using Kimi k2.6 just to test it. because I couldn't be bothered to setup wsl2 or boot into Linux on my PC with my 7900xt.
And yes it is fast.. but honestly pretty useless at this state. Loaded it up with pi and tried to use it and it make qwen 3.6 27b dumb as a box of rocks.
Definitely needs another few months to cook if it gains any traction maybe it will start to get better though.
Nindaleth@reddit
Yeah... 100% of my requests result in a loop after a few hundered tokens. Sure, it dumps bullshit tokens fast :D
That's with same models, HW and inference settings that I use with llama.cpp with no problem.
FullstackSensei@reddit
Would've been easier if they just supported GGUF, even on a limited set of quants. Heck, wish the entire industry adapted GGUF instead of every other guy try to roll their own.
schuttdev@reddit
working on it. can't promise 1:1 perf/quality with default mq4/mq6 tho
crantob@reddit
Thanks for your great work. Would love to see it extended to gfx902 which is in a lot of office laptops.
As you go lower down the hardware stack, the potential impact of your work increases. Just a thought.
Thanks again Mistr Schutt for your work.
fallingdowndizzyvr@reddit
How would that help? GGUF is just a file format. It's not magic. So even if he adopted the format, it's not like you could load the resulting file into llama.cpp and have it run. It wouldn't support the quantization. Similarly, it's not like you could load say a Q4_0 GGUF into this program and have it run. It's not the proper quantization. The quantization is the point.
crantob@reddit
I'll inform you that GGUF has a particular format for memory mapping: that is: the data in the file is stored the same way as a model that is loaded in RAM. That is what makes the memory mapping possible. Dependent on how this inference engine works, it's likely not advisable to wrap the weights in GGUF metadata as the engine would remain incompatible with existing GGUFs and llama.cpp would not be able to load data in this format, leading to great confusiuon.
FullstackSensei@reddit
Of course it's just a file format, but it's a very widely supported one. For one, you wouldn't have to wait for OP to quantize and upload a model. For another, there are several people already doing an amazing job quantizing models with great quality. For a 3rd, even if the quantization isn't currently supported, you could add it, because as you said it's just a file format.
fallingdowndizzyvr@reddit
Again, none of that matters since this is a unique quantization. So any existing GGUF file wouldn't work. Since it's not this quantization.
You don't have to wait now. The dev already provides the means for you to make your own quantization. It's limited to Qwen 3.X models though.
None of them are making this quantization. Whether it's a GGUF file or not doesn't matter.
It's a chicken and egg problem then. Since if someone does add support to it in llama.cpp, then you would be able to make GGUF files with this quantization. Until that happens though, there's no advantage for it to be a GGUF file.
droptableadventures@reddit
Yeah, this. This could support GGUF, but they'd be storing the model quantised as MQ4 in the GGUF. So if you loaded it into llama.cpp, you'd just get an "unknown quantization type ..." error.
Puzzleheaded_Base302@reddit
gguf has it's limitation. why do you think in production environment, people primarily use safetensors with vllm and sglang? are they stupid enough that never heard of gguf?
crantob@reddit
Skimming their readme It seems the custom format they innovated for speed is not compatible with GGUF.
buttplugs4life4me@reddit
Exactly how I feel about ONNX as well. xkcd comes to mind
FullstackSensei@reddit
Literally first thing that came to my mind as I wrote it
PrzemChuck@reddit
The results for strix halo from poster above seem to suffer from slow prefill - perhaps the biggest weakness of this hardware. The project is very promising, AMD is suuper slow with software support. Already starred the repo, waiting for updates!
autonomousdev_@reddit
ran some models through hipfire last week, mistral and a couple small qwen ones. like 2-3x faster on my 7900xtx vs ollama with llama.cpp. still kinda rough though, some ops not supported yet. worth it if youre on amd and dont mind stuff breaking sometimes
Own_Suspect5343@reddit
Here’s a quick Strix Halo / Radeon 8060S test of hipfire vs llama.cpp on Qwen3.5.
Hardware/software:
AMD Ryzen AI Max+ 395 / Radeon 8060S Graphics
gfx1151
ROCm 7.2
hipfire v0.1.8-alpha checkout
llama.cpp build d6f303004 / b8738
Models tested:
llama.cpp: Qwen3.5-9B Q4_K_M GGUF
hipfire: Qwen3.5-9B MQ4
hipfire: Qwen3.5-9B MQ4 + DFlash draft
General/prose bench:
llama.cpp Q4_K_M:
pp128: 1021.7 tok/s
pp512: 1078.6 tok/s
pp1024: 1084.9 tok/s
decode: 34.5 tok/s
hipfire MQ4, no draft:
pp128: 302.4 tok/s
pp512: 285.2 tok/s
pp1024: 283.1 tok/s
decode: 45.0 tok/s
hipfire MQ4 + DFlash:
pp128: 291.1 tok/s
pp512: 274.7 tok/s
pp1024: 271.5 tok/s
decode: 37.0 tok/s
So for this prose-style bench, hipfire AR decode was about 30% faster than llama.cpp decode, but llama.cpp was much faster on prefill. DFlash was slower on this prose prompt, which seems expected because
speculative decoding only helps when the draft has high acceptance.
I also tested DFlash on code prompts using dflash_spec_demo:
merge_sort prompt:
hipfire AR: 45.6 tok/s
hipfire DFlash: 157.2 tok/s
speedup: 3.45x
tau: 10.90
accept rate: 0.727
LRUCache prompt:
hipfire AR: 44.7 tok/s
hipfire DFlash: 93.9 tok/s
speedup: 2.10x
tau: 6.56
accept rate: 0.438
Takeaway: on Strix Halo, llama.cpp currently wins prefill by a lot, hipfire wins AR decode, and DFlash is very workload-dependent. It can lose on prose, but gives large speedups on structured/code generation.
GPU-Appreciator@reddit
Thanks for this, will be watching hipfire closely.
Own_Suspect5343@reddit
i don't know why prefill speed pretty low(
New_Spray_7886@reddit
Noob question - where is the list of the current supported architectures? I’ve looked around the docs on the github but not finding it, curious about gfx1030
kamikazechaser@reddit
So it still requires ROCm installed? The Why section comparing it to llama made it sound otherwise.
schuttdev@reddit
you need the HIP runtime (libamdhip64.so) but not the full ROCm SDK/dev tooling.
kamikazechaser@reddit
Thanks for clarifying, I have created an issue about fedora related issues.
Awwtifishal@reddit
The part I like the least about it. Lemonade-server is the same and I couldn't figure out how to run a GGUF that I had already downloaded.
schuttdev@reddit
I appreciate the feedback, I made that call early on. I already have a config tui, may as well incorporate a chat tui.
wh33t@reddit
Someone please test on mi50
politerate@reddit
Not listed in supported architectures
schuttdev@reddit
I don't see why it couldn't be scoped+ported, I've tried to distill the development process to the .skills/ such that the barrier of porting to any AMD hardware is lower
DefNattyBoii@reddit
Nice work! How is the speed on 7900XTX for qwen3.6 27B on longer contexts? It seems like Localmaxxing only has 4k ctx done.
schuttdev@reddit
I have triattention sidecars in the oven currently for both 3.5 and 3.6 27b (as well as 35b), so once those are done (12ish hours from now) you should be able to get full 132k ctx at near 4096 perf using token eviction.
Fit_Advice8967@reddit
Phenomenal. Go go local models! Getting two framework desktop 128gb when it was normally priced was a good move
Quiet-Owl9220@reddit
Sounds awesome, love to see improvements targeted specifically at my hardware. But I only see censored Qwen models so far... hope that changes.
schuttdev@reddit
any qwen3.5+ model is supported (up to 35b currently), including finetunes. carnice and qwopus are in the MQ registry, Ornstein works too https://www.localmaxxing.com/runs/cmofgd9iz000ml4055xsdgvuy
sarcasmguy1@reddit
Does it support offloading like llama.cpp?
Own_Suspect5343@reddit
This is great! I can help to implement some features. What is missing now?
Own_Suspect5343@reddit
It's great! I can make MR for strix halo support on this week.
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
DUFRelic@reddit
Any plans to support the MI50?
KvAk_AKPlaysYT@reddit
Curious, how is it different from lemonade?
schuttdev@reddit
lemonade wraps llama.cpp/FastFlowLM, hipfire is its own engine, own kernels, own quant, own dispatch path through dlopen'd HIP.
FullstackSensei@reddit
I think lemonade wraps llama.cpp
fallingdowndizzyvr@reddit
Amongst other things. It also wraps FastFlowLM.
NaturalCriticism3404@reddit
Cries in gfx1010
schuttdev@reddit
I started the project on the gfx1010, you may not get the full suite, but it *should* run. If you have issues, let me know and I'll swap back to my 5700xt and rebuild for it.
NaturalCriticism3404@reddit
Awesome I will try it for sure. I previously tried compiling llama.cpp for this 5500XT, but the performance was consistently worse than vulkan in all my tests. I also attempted to compile pytorch for training, but it required patches and other stuff beyond my current expertise.
Charming_Support726@reddit
Looks promising. I've got a gfx1152 and a gfx1201. Both seem to be not fully supported yet.
Maybe a good project to keep an eye on.
taking_bullet@reddit
GFX1200 is supported, so gfx1201 will work fine. These are almost the same GPU (RX 9070 and 9070 XT).
SemaMod@reddit
I tried searching the repo to no avail, but does the engine natively support multi-gpu setups?
RedParaglider@reddit
Hell yea, thanks for posting. This is the shit I love this sub for!