I got 3× faster HFQ4 prefill on Strix Halo in hipfire with an opt-in MMQ path

Posted by Own_Suspect5343@reddit | LocalLLaMA | View on Reddit | 13 comments

I recently contributed an experimental HFQ4-G256 MMQ prefill path to hipfire, an RDNA-focused LLM inference engine.

Disclaimer: I authored the PR, so this is partly a contribution note, but I am mainly looking for independent validation from other AMD users.

Before this PR, HFQ4 prefill in hipfire was going through a more generic/slower path. On my Strix Halo system, prompt processing was clearly the bottleneck: longer prefills were around \~310–340 tok/s.

The new path adds an opt-in MMQ-style prefill implementation. In this context, MMQ means a specialized quantized matrix-multiplication path: instead of treating prefill like a less optimized sequence of operations, it packs the work into tiled matrix-matrix kernels that are better suited for GPU execution. The implementation pre-quantizes prefill activations into a Q8_1 MMQ layout and uses i8 WMMA over 128×128 output/batch tiles with LDS staging.

After enabling it with:

HIPFIRE_MMQ=1

I see longer-prefill throughput around \~1140–1260 tok/s on Strix Halo / gfx1151.

What changed:

Adds an opt-in HIPFIRE_MMQ=1 path for HFQ4-G256 prefill.
Targets RDNA3 / RDNA3.5 for now: gfx1100, gfx1101, gfx1102, gfx1103, gfx1150, gfx1151.
Pre-quantizes prefill activations into a Q8_1 MMQ layout.
Uses i8 WMMA over 128×128 output/batch tiles with LDS staging.
Similar in shape to llama.cpp’s AMD MMQ prompt-processing path.
Not enabled by default.

Benchmark: Qwen3.5 9B HFQ4/MQ4 on Strix Halo / gfx1151

KV mode	pp	MMQ off, tok/s	MMQ on, tok/s	Speedup
q8	256	363.1	1127.6	3.11x
q8	512	352.0	1179.8	3.35x
q8	1024	328.9	1222.7	3.72x
q8	2048	318.2	1168.5	3.67x
asym4	256	368.6	1108.8	3.01x
asym4	512	360.7	1173.3	3.25x
asym4	1024	333.9	1223.0	3.66x
asym4	2048	312.3	1151.7	3.69x
asym3	256	361.4	1124.5	3.11x
asym3	512	359.8	1187.3	3.30x
asym3	1024	329.9	1259.1	3.82x
asym3	2048	314.1	1216.5	3.87x
asym2	256	374.0	1116.2	2.98x
asym2	512	356.6	1173.2	3.29x
asym2	1024	340.1	1208.5	3.55x
asym2	2048	311.4	1142.9	3.67x

So on longer prefills, this moved my Strix Halo results from roughly \~311–340 tok/s to \~1143–1259 tok/s.

Correctness validation so far:

batched prefill compared against sequential token-by-token forward pass
final prefill top token match
selected-logit drift within tolerance
next decode step after prefill also checked, to catch KV-cache write problems
tested across q8, asym4, asym3, asym2 KV modes

Caveats:

validated by me mainly on one Strix Halo / gfx1151 system
the path is experimental
it is not enabled by default
I would not call this the final/canonical MMQ implementation yet
more coherence and long-context testing would be useful

The maintainer also tested the merged path on gfx1100 and reported that HIPFIRE_MMQ=1 runs cleanly there, with a smaller but still positive result: +19.8% on 4B pp256.

What I would especially like to check now is whether this implementation generalizes well across other AMD GPUs and APUs, or whether the current tuning is mostly favorable to Strix Halo / gfx1151.

The basic correctness checks pass, but I am not yet fully confident that the KV-cache behavior is completely bulletproof. Subtle KV-cache issues might only appear in longer real workloads, so I would especially appreciate validation on long-context and multi-turn runs.

I would be very interested in results from people with:

7900 XTX / gfx1100
other RDNA3 cards
Strix Halo / gfx1151
RDNA3.5 APUs
and more
long-context agentic workloads where prefill matters more than short chat decode

PR: https://github.com/Kaden-Schutt/hipfire/pull/73

[-]

Flamenverfer@reddit

I ran a very quick one last night

intput $ HIP_VISIBLE_DEVICES=0 hipfire run ~/.hipfire/models/qwen3.5-27b.mq4 "Write me a very long python script"

output GPU: gfx1100 (25.8 GB VRAM, HIP 6.3) pre-compiled kernels: .hipfire_kernels/gfx1100 [hipfire] DFlash disabled (dflash_mode=off). loading token_embd... (Q8_0 raw, 1350 MB)

Layers

loading layer 63/64 (FullAttention)... KV cache: asym3 (K rotated-3b 100B + V Q8 272B = 372 B/head, 5.5x vs fp32, physical_cap=32768 / max_seq=32768) [qwen3_5] 5120d 64L 248320 vocab

[512 tok, 42 tok/s]

TheCTRL@reddit

Excellent benchmarks but answers with long context are still consistent ? loop issues and/or tool calling still works on long tasks ?

Own_Suspect5343@reddit (OP)

I haven’t tested this yet. I think I’ll have time tomorrow to look into this and make a fix if needed.

spaceman_@reddit

I have gfx1151, gfx1100 and gfx1201 hardware. I'll give this a shot later today.

Tried to run hipfire on my desktop (gfx1100 and gfx1201) and it loads the model but doesn't produce any output at all. I haven't even enabled HIPFIRE_MMQ yet so it's not caused by OPs change.

Will try on Strix Halo at another time, not working on my laptop today.

Glittering-Call8746@reddit

I have 7900xtx and 7900xt. It's been a year since I used them though drivers and support were horrible.. I moved to 3080 10gb . Would u be so kind to point me in the right direction to start installing with this repo ?

UnbeliebteMeinung@reddit

Please compare with llama.cpp so we see if this this is working... Without it people are just wasting time trying it out.

Benchmark: prefill only, Radeon 8060S / gfx1151, 3 repeats

pp	llama f16 KV	llama q8_0 KV	hipfire MMQ q8 KV	vs llama q8_0
256	1111.9	1122.3	1120.6	-0.2%
512	1083.5	1047.5	1182.4	+12.9%
1024	1085.6	1070.0	1217.4	+13.8%
2048	1077.0	1053.0	1168.3	+11.0%

This is not a perfect apples-to-apples comparison. llama.cpp is using a unsloth GGUF Q4_K_M model, while hipfire is using an MQ4 model.

Just a quick question, but does hipfire support using multiple gpus? And if yes, does it support multiple different gpus?

as i know, this is not implemented now, but should appeared in the future. i can't test it

Thank you

fivetide@reddit

this is really phenomenal. thank you so much! i encountered a little issue on batch sizes <128, https://github.com/Kaden-Schutt/hipfire/pull/84 should alleviate that. <3

onyxlabyrinth1979@reddit

Nice bump. The MMQ path makes sense for prefill, you are basically turning it into what GPUs are good at. I would watch KV cache correctness over long multi turn runs, that is where subtle bugs hide. Also curious how it holds under mixed batch sizes, not just long single prompts.