I got 3× faster HFQ4 prefill on Strix Halo in hipfire with an opt-in MMQ path
Posted by Own_Suspect5343@reddit | LocalLLaMA | View on Reddit | 13 comments
I recently contributed an experimental HFQ4-G256 MMQ prefill path to hipfire, an RDNA-focused LLM inference engine.
Disclaimer: I authored the PR, so this is partly a contribution note, but I am mainly looking for independent validation from other AMD users.
Before this PR, HFQ4 prefill in hipfire was going through a more generic/slower path. On my Strix Halo system, prompt processing was clearly the bottleneck: longer prefills were around \~310–340 tok/s.
The new path adds an opt-in MMQ-style prefill implementation. In this context, MMQ means a specialized quantized matrix-multiplication path: instead of treating prefill like a less optimized sequence of operations, it packs the work into tiled matrix-matrix kernels that are better suited for GPU execution. The implementation pre-quantizes prefill activations into a Q8_1 MMQ layout and uses i8 WMMA over 128×128 output/batch tiles with LDS staging.
After enabling it with:
HIPFIRE_MMQ=1
I see longer-prefill throughput around \~1140–1260 tok/s on Strix Halo / gfx1151.
What changed:
- Adds an opt-in
HIPFIRE_MMQ=1path for HFQ4-G256 prefill. - Targets RDNA3 / RDNA3.5 for now:
gfx1100,gfx1101,gfx1102,gfx1103,gfx1150,gfx1151. - Pre-quantizes prefill activations into a Q8_1 MMQ layout.
- Uses i8 WMMA over 128×128 output/batch tiles with LDS staging.
- Similar in shape to llama.cpp’s AMD MMQ prompt-processing path.
- Not enabled by default.
Benchmark: Qwen3.5 9B HFQ4/MQ4 on Strix Halo / gfx1151
| KV mode | pp | MMQ off, tok/s | MMQ on, tok/s | Speedup |
|---|---|---|---|---|
| q8 | 256 | 363.1 | 1127.6 | 3.11x |
| q8 | 512 | 352.0 | 1179.8 | 3.35x |
| q8 | 1024 | 328.9 | 1222.7 | 3.72x |
| q8 | 2048 | 318.2 | 1168.5 | 3.67x |
| asym4 | 256 | 368.6 | 1108.8 | 3.01x |
| asym4 | 512 | 360.7 | 1173.3 | 3.25x |
| asym4 | 1024 | 333.9 | 1223.0 | 3.66x |
| asym4 | 2048 | 312.3 | 1151.7 | 3.69x |
| asym3 | 256 | 361.4 | 1124.5 | 3.11x |
| asym3 | 512 | 359.8 | 1187.3 | 3.30x |
| asym3 | 1024 | 329.9 | 1259.1 | 3.82x |
| asym3 | 2048 | 314.1 | 1216.5 | 3.87x |
| asym2 | 256 | 374.0 | 1116.2 | 2.98x |
| asym2 | 512 | 356.6 | 1173.2 | 3.29x |
| asym2 | 1024 | 340.1 | 1208.5 | 3.55x |
| asym2 | 2048 | 311.4 | 1142.9 | 3.67x |
So on longer prefills, this moved my Strix Halo results from roughly \~311–340 tok/s to \~1143–1259 tok/s.
Correctness validation so far:
- batched prefill compared against sequential token-by-token forward pass
- final prefill top token match
- selected-logit drift within tolerance
- next decode step after prefill also checked, to catch KV-cache write problems
- tested across
q8,asym4,asym3,asym2KV modes
Caveats:
- validated by me mainly on one Strix Halo /
gfx1151system - the path is experimental
- it is not enabled by default
- I would not call this the final/canonical MMQ implementation yet
- more coherence and long-context testing would be useful
The maintainer also tested the merged path on gfx1100 and reported that HIPFIRE_MMQ=1 runs cleanly there, with a smaller but still positive result: +19.8% on 4B pp256.
What I would especially like to check now is whether this implementation generalizes well across other AMD GPUs and APUs, or whether the current tuning is mostly favorable to Strix Halo / gfx1151.
The basic correctness checks pass, but I am not yet fully confident that the KV-cache behavior is completely bulletproof. Subtle KV-cache issues might only appear in longer real workloads, so I would especially appreciate validation on long-context and multi-turn runs.
I would be very interested in results from people with:
- 7900 XTX /
gfx1100 - other RDNA3 cards
- Strix Halo /
gfx1151 - RDNA3.5 APUs
- and more
- long-context agentic workloads where prefill matters more than short chat decode
Flamenverfer@reddit
I ran a very quick one last night
intput $ HIP_VISIBLE_DEVICES=0 hipfire run ~/.hipfire/models/qwen3.5-27b.mq4 "Write me a very long python script"
output GPU: gfx1100 (25.8 GB VRAM, HIP 6.3) pre-compiled kernels: .hipfire_kernels/gfx1100 [hipfire] DFlash disabled (dflash_mode=off). loading token_embd... (Q8_0 raw, 1350 MB)
Layers
loading layer 63/64 (FullAttention)... KV cache: asym3 (K rotated-3b 100B + V Q8 272B = 372 B/head, 5.5x vs fp32, physical_cap=32768 / max_seq=32768) [qwen3_5] 5120d 64L 248320 vocab
[512 tok, 42 tok/s]
TheCTRL@reddit
Excellent benchmarks but answers with long context are still consistent ? loop issues and/or tool calling still works on long tasks ?
Own_Suspect5343@reddit (OP)
I haven’t tested this yet. I think I’ll have time tomorrow to look into this and make a fix if needed.
spaceman_@reddit
I have gfx1151, gfx1100 and gfx1201 hardware. I'll give this a shot later today.
spaceman_@reddit
Tried to run hipfire on my desktop (gfx1100 and gfx1201) and it loads the model but doesn't produce any output at all. I haven't even enabled
HIPFIRE_MMQyet so it's not caused by OPs change.Will try on Strix Halo at another time, not working on my laptop today.
Glittering-Call8746@reddit
I have 7900xtx and 7900xt. It's been a year since I used them though drivers and support were horrible.. I moved to 3080 10gb . Would u be so kind to point me in the right direction to start installing with this repo ?
UnbeliebteMeinung@reddit
Please compare with llama.cpp so we see if this this is working... Without it people are just wasting time trying it out.
Own_Suspect5343@reddit (OP)
Benchmark: prefill only, Radeon 8060S /
gfx1151, 3 repeatsThis is not a perfect apples-to-apples comparison. llama.cpp is using a unsloth GGUF
Q4_K_Mmodel, while hipfire is using anMQ4model.spaceman_@reddit
Just a quick question, but does hipfire support using multiple gpus? And if yes, does it support multiple different gpus?
Own_Suspect5343@reddit (OP)
as i know, this is not implemented now, but should appeared in the future. i can't test it
UnbeliebteMeinung@reddit
Thank you
fivetide@reddit
this is really phenomenal. thank you so much! i encountered a little issue on batch sizes <128, https://github.com/Kaden-Schutt/hipfire/pull/84 should alleviate that. <3
onyxlabyrinth1979@reddit
Nice bump. The MMQ path makes sense for prefill, you are basically turning it into what GPUs are good at. I would watch KV cache correctness over long multi turn runs, that is where subtle bugs hide. Also curious how it holds under mixed batch sizes, not just long single prompts.