PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090

Posted by sandropuppo@reddit | LocalLLaMA | View on Reddit | 90 comments

PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090

Hey fellow Llamas, thank you for all the nice words and great feedback on the last post I made. We have something new we thought would be useful to share. As always your time is precious, so I'll keep it short.

We built speculative prefill for long-context decode on quantized 27B targets, C++/CUDA only. A small drafter loaded in-process scores token importance over the full prompt; the heavy target only prefills the spans that matter.

Repo: github.com/Luce-Org/lucebox-hub (open source, MIT).

Head-to-head on Qwen3.6-27B Q4_K_M, RTX 3090, single-shot: 24.8 s TTFT vs \~257 s for vanilla llama.cpp = \~10.4× at 128K (and 13.5 s vs 134.95 s = 10.0× at 64K), with NIAH retrieval preserved end-to-end. No Python, no Triton, no PyTorch in the inference loop.

The problem

Q4_K_M Qwen3.6-27B on a 24 GB 3090 decodes fast (\~74 tok/s with DFlash spec decode), but prefill scales O(S²). On a 131K-token prompt, vanilla llama.cpp takes 248.4 s cold (llama-bench pp131072 --no-warmup -r 1, 527.6 tok/s). That is 4.1 minutes staring at a blank screen before the first token. Decode is fast, but the wait kills the UX. Warmed steady-state is better (169.3 s at 128K) but still painful, and grows quadratically as you push context.

Standing on shoulders

This work stands on two recent papers, both excellent reads:

Our contribution is the C++/CUDA composition of these two algorithms, in-process, on a 24 GB consumer card. As far as we are aware, the two papers had not been combined in an open implementation before.

What we built

Setup

bash

# clone with submodules (pulls llama.cpp/ggml + Block-Sparse-Attention)
git clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub
cd lucebox-hub/dflash

# build dflash + BSA kernel (sm_80+, ~10 min cold compile pulls cutlass)
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release \
                    -DCMAKE_CUDA_ARCHITECTURES=86 \
                    -DDFLASH27B_ENABLE_BSA=ON
cmake --build build --target test_dflash test_flashprefill_kernels -j

# fetch weights (target + drafter + spec-decode draft)
huggingface-cli download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models/
huggingface-cli download Qwen/Qwen3-0.6B model.safetensors tokenizer.json --local-dir models/drafter/
huggingface-cli download z-lab/Qwen3.6-27B-DFlash --local-dir models/draft/

# bench
cd ../pflash && pip install -e .
python tests/niah_gen.py --n 1 --ctx 131072 --out /tmp/niah_128k.jsonl
python tests/bench_niah_cpp.py \
  --bin    ../dflash/build/test_dflash \
  --target ../dflash/models/Qwen3.6-27B-Q4_K_M.gguf \
  --draft  ../dflash/models/draft/model.safetensors \
  --drafter-gguf ../dflash/models/drafter/qwen3-0.6b.gguf \
  --cases /tmp/niah_128k.jsonl --keep-ratio 0.05

Numbers

Single-shot on RTX 3090, Qwen3.6-27B Q4_K_M target, q4_0 KV, DFLASH_FP_USE_BSA=1 DFLASH_FP_ALPHA=0.85 keep_ratio=0.05. NIAH single-needle as the end-to-end retrieval check. Baseline is vanilla llama.cpp with default f16 KV (apples-to-oranges on KV; q4_0 KV costs \~3% AL at short context, 8.56 to 8.33, benchmarked).

Context PFlash TTFT llama.cpp cold Speedup (cold) llama.cpp warmed
64K 13.5 s 134.95 s 10.0x (smaller)
128K 24.8 s 248.4 s 10.0x 169.3 s

These are cold-cache numbers (first request after process boot). Warmed-vs-warmed is a smaller multiplier because llama.cpp settles into \~169 s at 128K once caches are hot. Both numbers are real and the right one depends on your workload; if you keep an engine resident, use warmed.

Decode after prefill is the standard DFlash spec-decode path with DDTree (\~74 tok/s sustained on Qwen3.6-27B Q4_K_M).

Quality

NIAH single-needle (magic-key + 7-digit answer randomly placed in filler) retrieved at every context tested from 32K through 128K, keep_ratio=0.05, DFLASH_FP_ALPHA=0.85.

Honest flag: NIAH single-needle is a structurally easy probe for an attention-based selection method like ours, since the algorithm is well-suited to finding a single high-attention span. RULER and NIAH multi-needle are next on the list; a fair audit should wait for those numbers.

Why the stack works

Speculative prefill solves a quality problem: how do you compress without losing the answer-relevant content? FlashPrefill solves a speed problem inside the drafter step: how do you make the drafter fast enough at 128K that it doesn't become the bottleneck. They compose cleanly because the target side (DFlash spec decode) is unchanged; it just receives a much shorter prompt with full attention enabled.

At 128K, drafter scoring is now the dominant cost (\~12 s of the 24.8 s TTFT). Target prefill on the compressed \~6.5K survivors is \~10 s; the remaining \~3 s is the park/unpark/free dance. The next obvious lever is a smaller or distilled drafter, which we have not done yet.

Tuning

bash

DFLASH_FP_USE_BSA=1     # dispatch sparse FA forward through BSA (sm_80+, required for 10x)
DFLASH_FP_ALPHA=0.85    # block-selection threshold; higher = stricter = fewer K-blocks per Q-row
DFLASH_FP_PROFILE=1     # log per-stage timings (mean_K / score / select / forward)

keep_ratio=0.05 is the default. 0.02 cuts target prefill from \~10 s to \~3 s but starts losing the needle. DFLASH_FP_ALPHA=0.99 cuts \~1 s at 128K with a small NIAH-margin loss. Calibration territory.

Any feedback is more than welcome!