[-]

Skid_gates_99@reddit

dropping a hipfire engine plus a custom quant format plus a benchmarking site simultaneously is wild commitment. solo dev energy. the rdna3 community has been carrying that "what about us" torch for like two years, nice to see something land.

curious how mq4 holds up against q4_k_m on the same model, anyone benched yet?

[-]

shing3232@reddit

mq4 shouldn't that much different from Q4km from the look of it.

[-]

Flamenverfer@reddit

intput

$ HIP_VISIBLE_DEVICES=0 hipfire run ~/.hipfire/models/qwen3.5-27b.mq4 "Write me a very long python script"

output

GPU: gfx1100 (25.8 GB VRAM, HIP 6.3) pre-compiled kernels: .hipfire_kernels/gfx1100 [hipfire] DFlash disabled (dflash_mode=off). loading token_embd... (Q8_0 raw, 1350 MB)

#Layers

loading layer 63/64 (FullAttention)... KV cache: asym3 (K rotated-3b 100B + V Q8 272B = 372 B/head, 5.5x vs fp32, physical_cap=32768 / max_seq=32768) [qwen3_5] 5120d 64L 248320 vocab

[512 tok, 42 tok/s]

[-]

Thrumpwart@reddit (OP)

25.8GB VRAM? What Frankenstein card are you running?

[-]

Flamenverfer@reddit

It has to be a bug! lol. I have two plain jane xtx cards.

 ./build/bin/llama-server --list-devices    
    Available devices:
  ROCm0: AMD Radeon RX 7900 XTX (24560 MiB, 24524 MiB free)
  ROCm1: AMD Radeon RX 7900 XTX (24560 MiB, 24494 MiB free)

[-]

alphatrad@reddit

I just started testing this on my RX 7900 XTX tonight.

I gotta finish some tests.... but I tested a 9B on the XTX with DFlash on code prompt. 306.27 tok/s vs AR baseline 106 t/s = 2.86× speedup with coherent output.

THATS A BIG BUMP - doesn't work with my R9700 though. Currently testing an MXFP4 that seems to really bump things up on that card.

But the Hipfire has REAL POTENTIAL. How it translates to daily use.... I dunno yet.

Sometimes speed tests aren't reality when you're talking wall time and doing actually coding and stuff.

[-]

soyalemujica@reddit

How do you mean that MXFP4 has potential in the 7900 XTX?

[-]

alphatrad@reddit

Not on the 7900 XTX my R9700

[-]

ArkaunGaming@reddit

Is there any explanation as to why it doesn't work with R9700s?

[-]

alphatrad@reddit

Not fully implemented yet - experimental - you can try it. I just skipped it for now.

My other test was with MXFP4 - Qwen3.6-35B on llama.cpp vs vLLM (MXFP4) is a massive jump on my coding benchmarks. Best run scoring 50 out of 64 pts.

TLDR: FP4 quant is way nicer to MoE router + vLLM’s KV cache crushes it.

Downside: Lost TPS / tokens per prompt and wall time went up.

I had dual RX 7900 XTX's which are FAST... but the cards so SO HUGE so I'm in the middle of swapping over to 9700's. I can only fit two of the XTX's in my case. I will be able to fit 3 of the other ones.

[-]

politerate@reddit

Exited to test on my 7900xtx, though no support for gfx906/mi50

[-]

Thrumpwart@reddit (OP)

It's very new but has alot of potential. Hoping for good things for this - he seems to be really squeezing every drop of performance out of AMD GPU's.

[-]

ahaw_work@reddit

Any chance for gfx906 (Mi50/Mi60) support?

[-]

schuttdev@reddit

gfx908 is supported, so I don't see why not. I have an arch port and tuning skill in the repo's .skills dir if you'd like to point your agent at it

[-]

RoomyRoots@reddit

As much as I despise Rust, I guess I will try this. I do hope whatever is happening in the background can be implemented on llama.cpp.

[-]

Cool-Chemical-5629@reddit

It's a brand new inference engine focused on all AMD GPU's (not just the latest).

This part of the info from Github says a different story:

RDNA-native LLM inference engine in Rust.

RDNA is the latest technology. Scrolling down also reveals ROCm being involved. That's also something you won't get running on older cards such as Radeon RX Vega 56.

[-]

Remove_Ayys@reddit

Looks like vibecoded slop TBH.

[-]

SufficientlySuper@reddit

Yeah pretty much. I vibeslop ported it to windows using Kimi k2.6 just to test it. because I couldn't be bothered to setup wsl2 or boot into Linux on my PC with my 7900xt.

And yes it is fast.. but honestly pretty useless at this state. Loaded it up with pi and tried to use it and it make qwen 3.6 27b dumb as a box of rocks.

Definitely needs another few months to cook if it gains any traction maybe it will start to get better though.

[-]

Nindaleth@reddit

Yeah... 100% of my requests result in a loop after a few hundered tokens. Sure, it dumps bullshit tokens fast :D

That's with same models, HW and inference settings that I use with llama.cpp with no problem.

[-]

FullstackSensei@reddit

Would've been easier if they just supported GGUF, even on a limited set of quants. Heck, wish the entire industry adapted GGUF instead of every other guy try to roll their own.

[-]

schuttdev@reddit

working on it. can't promise 1:1 perf/quality with default mq4/mq6 tho

[-]

crantob@reddit

Thanks for your great work. Would love to see it extended to gfx902 which is in a lot of office laptops.

As you go lower down the hardware stack, the potential impact of your work increases. Just a thought.

Thanks again Mistr Schutt for your work.

[-]

fallingdowndizzyvr@reddit

How would that help? GGUF is just a file format. It's not magic. So even if he adopted the format, it's not like you could load the resulting file into llama.cpp and have it run. It wouldn't support the quantization. Similarly, it's not like you could load say a Q4_0 GGUF into this program and have it run. It's not the proper quantization. The quantization is the point.

[-]

crantob@reddit

I'll inform you that GGUF has a particular format for memory mapping: that is: the data in the file is stored the same way as a model that is loaded in RAM. That is what makes the memory mapping possible. Dependent on how this inference engine works, it's likely not advisable to wrap the weights in GGUF metadata as the engine would remain incompatible with existing GGUFs and llama.cpp would not be able to load data in this format, leading to great confusiuon.

[-]

FullstackSensei@reddit

Of course it's just a file format, but it's a very widely supported one. For one, you wouldn't have to wait for OP to quantize and upload a model. For another, there are several people already doing an amazing job quantizing models with great quality. For a 3rd, even if the quantization isn't currently supported, you could add it, because as you said it's just a file format.

[-]

fallingdowndizzyvr@reddit

Again, none of that matters since this is a unique quantization. So any existing GGUF file wouldn't work. Since it's not this quantization.

For one, you wouldn't have to wait for OP to quantize and upload a model.

You don't have to wait now. The dev already provides the means for you to make your own quantization. It's limited to Qwen 3.X models though.

For another, there are several people already doing an amazing job quantizing models with great quality.

None of them are making this quantization. Whether it's a GGUF file or not doesn't matter.

For a 3rd, even if the quantization isn't currently supported, you could add it, because as you said it's just a file format.

It's a chicken and egg problem then. Since if someone does add support to it in llama.cpp, then you would be able to make GGUF files with this quantization. Until that happens though, there's no advantage for it to be a GGUF file.

[-]

droptableadventures@reddit

Yeah, this. This could support GGUF, but they'd be storing the model quantised as MQ4 in the GGUF. So if you loaded it into llama.cpp, you'd just get an "unknown quantization type ..." error.

[-]

Puzzleheaded_Base302@reddit

gguf has it's limitation. why do you think in production environment, people primarily use safetensors with vllm and sglang? are they stupid enough that never heard of gguf?

[-]

crantob@reddit

Skimming their readme It seems the custom format they innovated for speed is not compatible with GGUF.

[-]

buttplugs4life4me@reddit

Exactly how I feel about ONNX as well. xkcd comes to mind

[-]

FullstackSensei@reddit

Literally first thing that came to my mind as I wrote it

[-]

PrzemChuck@reddit

The results for strix halo from poster above seem to suffer from slow prefill - perhaps the biggest weakness of this hardware. The project is very promising, AMD is suuper slow with software support. Already starred the repo, waiting for updates!

[-]

autonomousdev_@reddit

ran some models through hipfire last week, mistral and a couple small qwen ones. like 2-3x faster on my 7900xtx vs ollama with llama.cpp. still kinda rough though, some ops not supported yet. worth it if youre on amd and dont mind stuff breaking sometimes

[-]

Own_Suspect5343@reddit

Here’s a quick Strix Halo / Radeon 8060S test of hipfire vs llama.cpp on Qwen3.5.

Hardware/software:

AMD Ryzen AI Max+ 395 / Radeon 8060S Graphics

gfx1151

ROCm 7.2

hipfire v0.1.8-alpha checkout

llama.cpp build d6f303004 / b8738

Models tested:

llama.cpp: Qwen3.5-9B Q4_K_M GGUF

hipfire: Qwen3.5-9B MQ4

hipfire: Qwen3.5-9B MQ4 + DFlash draft

General/prose bench:

llama.cpp Q4_K_M:

pp128: 1021.7 tok/s

pp512: 1078.6 tok/s

pp1024: 1084.9 tok/s

decode: 34.5 tok/s

hipfire MQ4, no draft:

pp128: 302.4 tok/s

pp512: 285.2 tok/s

pp1024: 283.1 tok/s

decode: 45.0 tok/s

hipfire MQ4 + DFlash:

pp128: 291.1 tok/s

pp512: 274.7 tok/s

pp1024: 271.5 tok/s

decode: 37.0 tok/s

So for this prose-style bench, hipfire AR decode was about 30% faster than llama.cpp decode, but llama.cpp was much faster on prefill. DFlash was slower on this prose prompt, which seems expected because

speculative decoding only helps when the draft has high acceptance.

I also tested DFlash on code prompts using dflash_spec_demo:

merge_sort prompt:

hipfire AR: 45.6 tok/s

hipfire DFlash: 157.2 tok/s

speedup: 3.45x

tau: 10.90

accept rate: 0.727

LRUCache prompt:

hipfire AR: 44.7 tok/s

hipfire DFlash: 93.9 tok/s

speedup: 2.10x

tau: 6.56

accept rate: 0.438

Takeaway: on Strix Halo, llama.cpp currently wins prefill by a lot, hipfire wins AR decode, and DFlash is very workload-dependent. It can lose on prose, but gives large speedups on structured/code generation.

[-]

GPU-Appreciator@reddit

Thanks for this, will be watching hipfire closely.

[-]

Own_Suspect5343@reddit

i don't know why prefill speed pretty low(

[-]

New_Spray_7886@reddit

Noob question - where is the list of the current supported architectures? I’ve looked around the docs on the github but not finding it, curious about gfx1030

[-]

kamikazechaser@reddit

So it still requires ROCm installed? The Why section comparing it to llama made it sound otherwise.

[-]

schuttdev@reddit

you need the HIP runtime (libamdhip64.so) but not the full ROCm SDK/dev tooling.

[-]

kamikazechaser@reddit

Thanks for clarifying, I have created an issue about fedora related issues.

[-]

Awwtifishal@reddit

Ollama-style UX

The part I like the least about it. Lemonade-server is the same and I couldn't figure out how to run a GGUF that I had already downloaded.

[-]

schuttdev@reddit

I appreciate the feedback, I made that call early on. I already have a config tui, may as well incorporate a chat tui.

[-]

wh33t@reddit

Someone please test on mi50

[-]

politerate@reddit

Not listed in supported architectures

[-]

schuttdev@reddit

I don't see why it couldn't be scoped+ported, I've tried to distill the development process to the .skills/ such that the barrier of porting to any AMD hardware is lower

[-]

DefNattyBoii@reddit

Nice work! How is the speed on 7900XTX for qwen3.6 27B on longer contexts? It seems like Localmaxxing only has 4k ctx done.

[-]

schuttdev@reddit

I have triattention sidecars in the oven currently for both 3.5 and 3.6 27b (as well as 35b), so once those are done (12ish hours from now) you should be able to get full 132k ctx at near 4096 perf using token eviction.

[-]

Fit_Advice8967@reddit

Phenomenal. Go go local models! Getting two framework desktop 128gb when it was normally priced was a good move

[-]

Quiet-Owl9220@reddit

Sounds awesome, love to see improvements targeted specifically at my hardware. But I only see censored Qwen models so far... hope that changes.

[-]

schuttdev@reddit

any qwen3.5+ model is supported (up to 35b currently), including finetunes. carnice and qwopus are in the MQ registry, Ornstein works too https://www.localmaxxing.com/runs/cmofgd9iz000ml4055xsdgvuy

[-]

sarcasmguy1@reddit

Does it support offloading like llama.cpp?

[-]

Own_Suspect5343@reddit

This is great! I can help to implement some features. What is missing now?

[-]

Own_Suspect5343@reddit

It's great! I can make MR for strix halo support on this week.

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

DUFRelic@reddit

Any plans to support the MI50?

[-]

KvAk_AKPlaysYT@reddit

Curious, how is it different from lemonade?

[-]

schuttdev@reddit

lemonade wraps llama.cpp/FastFlowLM, hipfire is its own engine, own kernels, own quant, own dispatch path through dlopen'd HIP.

[-]

FullstackSensei@reddit

I think lemonade wraps llama.cpp

[-]

fallingdowndizzyvr@reddit

Amongst other things. It also wraps FastFlowLM.

[-]

NaturalCriticism3404@reddit

Cries in gfx1010

[-]

schuttdev@reddit

I started the project on the gfx1010, you may not get the full suite, but it *should* run. If you have issues, let me know and I'll swap back to my 5700xt and rebuild for it.

[-]

NaturalCriticism3404@reddit

Awesome I will try it for sure. I previously tried compiling llama.cpp for this 5500XT, but the performance was consistently worse than vulkan in all my tests. I also attempted to compile pytorch for training, but it required patches and other stuff beyond my current expertise.

[-]

Charming_Support726@reddit

Looks promising. I've got a gfx1152 and a gfx1201. Both seem to be not fully supported yet.

Maybe a good project to keep an eye on.

[-]