Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP

Posted by sandropuppo@reddit | LocalLLaMA | View on Reddit | 19 comments

Hey fellow Llamas, keeping it short.

We just shipped DFlash and PFlash support for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, Strix Halo, 128 GiB unified memory). Same Luce DFlash stack from the RTX 3090 post a couple weeks back, now running on the consumer AMD APU class.

Repo: https://github.com/Luce-Org/lucebox-hub (MIT)

TL;DR

End-to-end on Qwen3.6-27B Q4_K_M with the Luce Q8_0 DFlash drafter: 26.85 tok/s decode and 20.2 s prefill at 16K context.

That is 2.23x faster decode and 3.05x faster prefill than llama.cpp HIP on the same silicon. At a 16K prompt + 1K generation workload, total wall clock drops from 147 s to 58 s, 2.5x faster end to end.

The same 128 GiB box hosts checkpoints up to \~100 GiB, a class of models a 24 GiB consumer GPU cannot touch (Qwen3.5-122B-A10B, MiniMax-M2.7-REAP 139B-A10B, full BF16 27B).

The numbers

Hardware: Ryzen AI MAX+ 395, Radeon 8060S iGPU (gfx1151), 128 GiB LPDDR5X-8000, ROCm 7.2.2 Target: Qwen3.6-27B Q4_K_M (15.65 GiB) Drafter: Lucebox/Qwen3.6-27B-DFlash-GGUF Q8_0 with DFLASH27B_DRAFT_SWA=2048 Bench: 10-prompt HumanEval-style, --n-gen 128 --ddtree-budget 22 --fast-rollback

Decode (Qwen3.6-27B Q4_K_M, tok/s):

Engine	tok/s	vs AR
llama.cpp HIP AR	12.02	1.00x
llama.cpp Vulkan AR	12.45	1.04x
Luce DFlash (this PR)	26.85	2.23x

Prefill (Qwen3.6-27B, 16K tokens):

Engine	TTFT	vs AR
llama.cpp HIP AR	61.69 s	1.00x
Luce PFlash	20.2 s	3.05x

Speedup grows with context: PFlash compress is O(S), AR prefill is O(S\^2). NIAH retrieval still passes at 16K.

Tuning note: --ddtree-budget=22 is the gfx1151 optimum. Higher budgets accept more tokens per step but each step gets more expensive on LPDDR5X. Bandwidth caps the benefit before tile utilization pays off. Contrast with gfx1100 (7900 XTX, GDDR6 936 GB/s) where budget=8 wins, tile waste matters more than launch amortization. Default ship is arch-aware.

Reproduce

bash

# 1. Build PR #119 for gfx1151
git clone https://github.com/Luce-Org/lucebox-hub.git
cd lucebox-hub
git fetch origin pull/119/head:pr119 && git checkout pr119
git submodule update --init --recursive
cd dflash
cmake -B build -S . \
  -DCMAKE_BUILD_TYPE=Release \
  -DDFLASH27B_GPU_BACKEND=hip \
  -DDFLASH27B_HIP_ARCHITECTURES=gfx1151 \
  -DDFLASH27B_HIP_SM80_EQUIV=ON
cmake --build build --target test_dflash -j

# 2. Models: Qwen3.6-27B target + Lucebox Q8_0 DFlash drafter
mkdir -p models/draft
hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models/
hf download Lucebox/Qwen3.6-27B-DFlash-GGUF dflash-draft-3.6-q8_0.gguf --local-dir models/draft/

# 3. Bench (DFlash decode + PFlash long-context prefill)
LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH \
DFLASH_BIN=$PWD/build/test_dflash \
DFLASH_TARGET=$PWD/models/Qwen3.6-27B-Q4_K_M.gguf \
DFLASH_DRAFT=$PWD/models/draft/dflash-draft-3.6-q8_0.gguf \
DFLASH27B_DRAFT_SWA=2048 \
DFLASH27B_PREFILL_UBATCH=512 \
  python3 scripts/bench_he.py --n-gen 128 --ddtree-budget 22

DFLASH27B_PREFILL_UBATCH=512 applies the PR #159 fix on top of PR #119. Once #159 merges, this is the daemon default.

What is still missing

BSA scoring kernel on HIP. The drafter compress-score path uses BSA (block-sparse attention) on CUDA. PR #119 disables it on HIP and falls back to ggml's flash_attn_ext, which the daemon's own warning flags as \~3.4x slower. A rocWMMA-native sparse-FA kernel closes the gap. After it lands, PFlash TTFT at 16K drops from 27.6 s to roughly 8 s. At 128K, projected 7-10x over llama.cpp AR.
Multi-row q4_K decode GEMV. RDNA-native multi-row pattern (R=4-8 output rows sharing activation register state) for the drafter forward, currently 30% of compress time at long context.
Phase 2 tile shape tuning for gfx1151. Current rocWMMA flashprefill tiles are tuned for gfx1100. Strix Halo has different LDS and VGPR characteristics.
70B+ MoE targets. 128 GiB headroom is wasted on a 27B. Qwen3.5-122B-A10B and MiniMax-M2.7-REAP 139B-A10B both fit. DFlash math ports cleanly to MoE; big work is wiring the expert-routed forward into the spec verify loop.

Constraints

ROCm 7.2.2+, gfx1151 tuned (gfx1100 also supported with arch-aware defaults), greedy verify only, no Vulkan / Metal / multi-GPU on this path yet.

We're working hard on this but we know we need to improve on many things.

Feedback is more than welcome :)

[-]

laul_pogan@reddit

Running Spark GB10 (also 128 GB LPDDR5X, different arch), and the bandwidth ceiling comment in the post is the real story here. Measured ~273 GB/s peak on that bus. At Q4 for a 27B model, you're moving ~15 GB per decode step at batch=1, so theoretical max sits around 18 tok/s before any kernel overhead. Their llama.cpp 12 tok/s baseline lands right where you'd expect with unoptimized ROCm ops eating the gap.

The --ddtree-budget arch split (budget=22 for gfx1151, budget=8 for gfx1100) makes sense from first principles. On LPDDR5X, each speculative step costs bandwidth whether the draft accepts or not. You need high acceptance rate to amortize it. GDDR6 at 936 GB/s has ~3.4x headroom, so you can afford more wasted steps before tile launch overhead wins instead.

One practical note for anyone loading Qwen3.x checkpoints from HF: some weights carry multimodal config lineage even when text-only. Watch for mrope_section keys in config.json and a model.language_model.* prefix in the safetensors shards. Both need stripping before clean load. Shouldn't affect gguf conversion directly, but if your conversion script reads the HF config for rope params, the mrope path silently produces garbage rotary embeddings at long context. Worth a sanity check on NIAH at 32K+ if scores look off.

The BSA kernel gap on HIP is the real unlock waiting. 3.4x slower fallback to ggml flash_attn_ext means PFlash is currently underperforming its own projection.

Due_Net_3342@reddit

“128 GiB headroom is wasted on a 27B.” no it is not, you should run q8 or full version

hurdurdur7@reddit

Even if you run it at Q8 you would only be using half of the VRAM. And it would be 2x slower than the numbers shown here, not usable for realtime tasks.

Really sad that AMD didn't find a better memory solution for this platform, this would have been an immense box at 600GB/s memory bandwidth.

snapo84@reddit

single session yes.... how does the vram consumption look if i do 16 parallel 131k ctx? ......

With the computing power of an APU that has less than 300GB/s memory bandwidth? Nobody would ever know ...

RevolutionaryPick241@reddit

Could you add instructions to run a llama-server like. I see the project includes a server.py, is it meant to be run as a llama-server replacement? And is mmproj supported?

lumos675@reddit

it's not a PR of llama.cpp unfortunately. Does'nt worth the time.

AI-Agent-Payments@reddit

The angle nobody's raised yet: Strix Halo's unified memory bandwidth is shared between the CPU and iGPU, so under concurrent load (OS, background tasks, another process doing anything memory-intensive) your 204 GB/s effective bandwidth can drop noticeably and these decode numbers will regress more than they would on a discrete GPU with dedicated VRAM. Worth running the same bench with a browser and a few background processes alive to see how stable the 26.85 tok/s actually is in practice. On a machine I tested with similar unified memory architecture, peak-to-floor variance under light multitasking was around 15-20%, which matters a lot if you're quoting wall-clock for real workloads.

Old-Sherbert-4495@reddit

quick question, does dflash only support q4km?? can't I use q3 something from unsloth?? to fit everything in 16 gb vram?

Dazzling_Equipment_9@reddit

Awesome! The speed boost is seriously exciting.

That said, I’m curious about the quality and stability. Does the output quality degrade significantly as the context length increases?

Also, to be honest, Q4_K_M isn’t really practical on Strix Halo. You might want to try validating it with Q8_0 quantization instead.

Thanks again for all the great work!

Edenar@reddit

Yeah, i guess it depends how dflash degrade quality vs prefill speed. But for decode, i got the same results with the MTP PR from llama.cpp (vulkan/radv backend). And i also realized while testing that Q4 is really not that good, i would say it's inferior to 35B A3B at Q8 for exemple. So since strix halo had high memory capacity, maybe switch to Q8_K_XL or at least Q6 something.

Dany0@reddit

Here's my feedback, you say it's welcome, so here you go:

DFlash has been proven to not be useful again & again because the drafter models we have have been only trained up to what like 32k?

And in practice it's only really faster up to 4k tokens. If you have a usecase like that, congrats

But just my base system prompt is 35k and that's after trimming it significantly from 64k, though that was by making it more context dependent, so in practice depending on the task it's closer to the original

I'm getting \~112tok/s on my 5090 with vllm b12x kernel + MTP, vision and 172k context. On a mix of coding & general tasks. Nothing else has come close so far

Glittering-Call8746@reddit

Cuda 12.x or 13.x ?

Anbeeld@reddit

Same Luce DFlash stack from the RTX 3090 post a couple weeks back

I hope this time it's less completely broken? When I tried it based on that post it surely was, with no disclamers about all the issues in the post body for some reason.

sandropuppo@reddit (OP)

Sorry to hear that. We do our best to make it working well for everyone. we fixed so many things over the last few weeks and I think our DFlash is better now. What errors/issues did you see? would love to help in case

Turned out your server implementation had like 3 GB VRAM spikes so the advertised config didn't actually work on 3090 outside of one-off prompts. I did 2 PRs fixing them into your repo, only to find out that reasoning is outputted as plain text, and tool calls didn't work at like, like they just didn't.

I understand you are writing your own thing from scratch, but if it's a technical demo just say so instead of baiting. Like none of the issues I've met were mentioned in that 3090 post, on the contrary it was presented as some kind of plug-and-play solution down to "no llama.cpp installation required".

Sufficient-Bid3874@reddit

Didn't read the whole post, so it might be there.

Is PFLASH lossless?

PFlash is not lossless but you can decide how much context to mantain

It's not.