PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090

Posted by sandropuppo@reddit | LocalLLaMA | View on Reddit | 90 comments

Hey fellow Llamas, thank you for all the nice words and great feedback on the last post I made. We have something new we thought would be useful to share. As always your time is precious, so I'll keep it short.

We built speculative prefill for long-context decode on quantized 27B targets, C++/CUDA only. A small drafter loaded in-process scores token importance over the full prompt; the heavy target only prefills the spans that matter.

Repo: github.com/Luce-Org/lucebox-hub (open source, MIT).

Head-to-head on Qwen3.6-27B Q4_K_M, RTX 3090, single-shot: 24.8 s TTFT vs \~257 s for vanilla llama.cpp = \~10.4× at 128K (and 13.5 s vs 134.95 s = 10.0× at 64K), with NIAH retrieval preserved end-to-end. No Python, no Triton, no PyTorch in the inference loop.

The problem

Q4_K_M Qwen3.6-27B on a 24 GB 3090 decodes fast (\~74 tok/s with DFlash spec decode), but prefill scales O(S²). On a 131K-token prompt, vanilla llama.cpp takes 248.4 s cold (llama-bench pp131072 --no-warmup -r 1, 527.6 tok/s). That is 4.1 minutes staring at a blank screen before the first token. Decode is fast, but the wait kills the UX. Warmed steady-state is better (169.3 s at 128K) but still painful, and grows quadratically as you push context.

Standing on shoulders

This work stands on two recent papers, both excellent reads:

Speculative Prefill (Liu et al, arXiv 2502.02789) and Cross-Family Speculative Prefill (SambaNova, ICLR 2026). Insight: a small draft model's attention pattern over a long prompt faithfully predicts which tokens matter for the answer. Run the draft, score per-token importance, keep the top spans, drop the rest.
FlashPrefill (Fan et al, 2026). Block-sparse attention so the drafter itself does not pay O(S²) at 128K.
mit-han-lab/Block-Sparse-Attention (BSA) for the FA-2-derived sm_80+ sparse forward.
ggml / llama.cpp for the runtime. We link libggml*.a and never libllama.

Our contribution is the C++/CUDA composition of these two algorithms, in-process, on a 24 GB consumer card. As far as we are aware, the two papers had not been combined in an open implementation before.

What we built

In-process composition. Drafter forward (custom Qwen3-0.6B BF16 ggml graph), FlashPrefill scoring, sparse attention, target prefill, and DFlash spec decode all run in one C++/CUDA process sharing one ggml allocator. No subprocess, no IPC, no Python, Triton, or PyTorch in the inference loop.
CUDA port of FlashPrefill. The reference (qhfan/FlashPrefill) is Triton. We wrote 4 CUDA kernels from scratch (mean_K, score, select, sparse_fwd) and dispatched the sparse forward through mit-han-lab/Block-Sparse-Attention. BSA ships as a libtorch C++ extension; pulling 2 GB of libtorch into a 24 GB inference loop was a non-starter, so we wired it in via a 3-header ATen/c10 stub set under dflash/deps/bsa_stubs/.
24 GB memory orchestration. Drafter (1.3 GB weights + KV + \~600 MB BSA scratch at 128K) and the DFlash daemon (15 GB target + 3 GB draft + 3 GB KV) do not coexist on a 3090. The daemon parks, unparks, and frees weights between stages over a stdin protocol; \~3 s per request, makes the whole pipeline fit on a single consumer card.

Setup

bash

# clone with submodules (pulls llama.cpp/ggml + Block-Sparse-Attention)
git clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub
cd lucebox-hub/dflash

# build dflash + BSA kernel (sm_80+, ~10 min cold compile pulls cutlass)
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release \
                    -DCMAKE_CUDA_ARCHITECTURES=86 \
                    -DDFLASH27B_ENABLE_BSA=ON
cmake --build build --target test_dflash test_flashprefill_kernels -j

# fetch weights (target + drafter + spec-decode draft)
huggingface-cli download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models/
huggingface-cli download Qwen/Qwen3-0.6B model.safetensors tokenizer.json --local-dir models/drafter/
huggingface-cli download z-lab/Qwen3.6-27B-DFlash --local-dir models/draft/

# bench
cd ../pflash && pip install -e .
python tests/niah_gen.py --n 1 --ctx 131072 --out /tmp/niah_128k.jsonl
python tests/bench_niah_cpp.py \
  --bin    ../dflash/build/test_dflash \
  --target ../dflash/models/Qwen3.6-27B-Q4_K_M.gguf \
  --draft  ../dflash/models/draft/model.safetensors \
  --drafter-gguf ../dflash/models/drafter/qwen3-0.6b.gguf \
  --cases /tmp/niah_128k.jsonl --keep-ratio 0.05

Numbers

Single-shot on RTX 3090, Qwen3.6-27B Q4_K_M target, q4_0 KV, DFLASH_FP_USE_BSA=1 DFLASH_FP_ALPHA=0.85 keep_ratio=0.05. NIAH single-needle as the end-to-end retrieval check. Baseline is vanilla llama.cpp with default f16 KV (apples-to-oranges on KV; q4_0 KV costs \~3% AL at short context, 8.56 to 8.33, benchmarked).

Context	PFlash TTFT	llama.cpp cold	Speedup (cold)	llama.cpp warmed
64K	13.5 s	134.95 s	10.0x	(smaller)
128K	24.8 s	248.4 s	10.0x	169.3 s

These are cold-cache numbers (first request after process boot). Warmed-vs-warmed is a smaller multiplier because llama.cpp settles into \~169 s at 128K once caches are hot. Both numbers are real and the right one depends on your workload; if you keep an engine resident, use warmed.

Decode after prefill is the standard DFlash spec-decode path with DDTree (\~74 tok/s sustained on Qwen3.6-27B Q4_K_M).

Quality

NIAH single-needle (magic-key + 7-digit answer randomly placed in filler) retrieved at every context tested from 32K through 128K, keep_ratio=0.05, DFLASH_FP_ALPHA=0.85.

Honest flag: NIAH single-needle is a structurally easy probe for an attention-based selection method like ours, since the algorithm is well-suited to finding a single high-attention span. RULER and NIAH multi-needle are next on the list; a fair audit should wait for those numbers.

Why the stack works

Speculative prefill solves a quality problem: how do you compress without losing the answer-relevant content? FlashPrefill solves a speed problem inside the drafter step: how do you make the drafter fast enough at 128K that it doesn't become the bottleneck. They compose cleanly because the target side (DFlash spec decode) is unchanged; it just receives a much shorter prompt with full attention enabled.

At 128K, drafter scoring is now the dominant cost (\~12 s of the 24.8 s TTFT). Target prefill on the compressed \~6.5K survivors is \~10 s; the remaining \~3 s is the park/unpark/free dance. The next obvious lever is a smaller or distilled drafter, which we have not done yet.

Tuning

bash

DFLASH_FP_USE_BSA=1     # dispatch sparse FA forward through BSA (sm_80+, required for 10x)
DFLASH_FP_ALPHA=0.85    # block-selection threshold; higher = stricter = fewer K-blocks per Q-row
DFLASH_FP_PROFILE=1     # log per-stage timings (mean_K / score / select / forward)

keep_ratio=0.05 is the default. 0.02 cuts target prefill from \~10 s to \~3 s but starts losing the needle. DFLASH_FP_ALPHA=0.99 cuts \~1 s at 128K with a small NIAH-margin loss. Calibration territory.

Any feedback is more than welcome!

[-]

randomfoo2@reddit

Interesting technique but if I'm reading this corrrectly this is a super lossy way to process prefill?

A small Qwen3-0.6B drafter reads the full 64K/128K prompt
FlashPrefill/BSA-style sparse attention makes that drafter pass cheaper
The drafter scores token/span importance and keeps a small subset
The 27B target only prefills the compressed prompt (retokenized from the drafter?)
After that, DFlash+DDTree does speculative decode on the compressed target KV

[-]

Shoddy-Tutor9563@reddit

Someone has to run benchmarks to see the difference (not me!)

[-]

xienze@reddit

Yeah I dunno why everyone in this space seems to forget that EVERYTHING in computing is a space/time/quality tradeoff. You generally don't get 10x improvements in well-researched areas without massive tradeoffs.

[-]

randomfoo2@reddit

Although sometimes.. you can. (about to publish some of my work after a few weeks of grinding kernels that literally scores >10x memory improvements w/ faster than vLLM prefill/decode at c=1 and c=8 with near 0 quality loss - 0.003 and 0.005 KLD).

[-]

FullOf_Bad_Ideas@reddit

Awesome - please link it once you publish your work, I'd love to read it

[-]

randomfoo2@reddit

OK, ended up being 6-8x (there's more that could be squeezed but it runs slower than I'd like) https://www.reddit.com/r/LocalLLaMA/comments/1t3vlrx/fastdms_64x_kvcache_compression_running_faster/

[-]

rpkarma@reddit

Yeah there is a lot of low hanging fruit right now because a lot of the useful research and tricks are all private inside proprietary labs

[-]

Thick-Protection-458@reddit

Well, sometimes it is also kinds of bandwidth threshold. Like flash attention, for example.

In which case you can make exact compute more optimal.

Other than that it is also a question of how much quality we lose (and how exactly).

[-]

Succubus-Empress@reddit

But 32 bit to 16 bit , tradeoff is minimal but gain is double speed and half memory.

[-]

ElementNumber6@reddit

Resolution only ever needs to be so sharp. Like 8k to 4k. Most won't notice a difference. 4k to 1080, however, and the pixels begin to show. 1080 to 420p, and now you've got some serious problems.

[-]

KallistiTMP@reddit

Spec decoding is generally lossless.

The tradeoff is it eats up your batch size.

If you're running at industrial scales where throughput is important, yeah, it tanks your throughput. Most personal users are only running one request at a time though, so it's kind of a free lunch for most hobbyists.

[-]

DominusIniquitatis@reddit

"Generally"? D:

[-]

FatheredPuma81@reddit

Because it is possible if someone smart enough were to dedicate an absurd amount of their time towards optimizing it. AI being so new means that there ARE a lot of areas that can be optimized and you can look at things like Turboquants (or more importantly the KV Cache Rotation PR in llama.cpp) to see that.

[-]

FatheredPuma81@reddit

Because it's simply not true. The only case this is true is when EVERYTHING is perfectly optimized and is basically handwritten by geniuses in machine code. There are almost always faster ways of doing things with minimal to no quality trade off the difficult part is convincing someone who's smart enough to find it.

I would say just look at video compression but that has also grown with compute (and hardware decoders) so it's hard to compare.

[-]

somerussianbear@reddit

Underrated comment

[-]

Intelligent-Form6624@reddit

wrong. it works like this:

browse reddit —-> see 10x magic post —-> win

[-]

pseudonerv@reddit

Yeah. It’s 10x faster. But how much dumber is the real question.

[-]

User_Deprecated@reddit

NIAH feels more like a retrieval test. It's mostly checking whether the model can find a specific fact buried in the context, which is kind of the easy case when the "needle" is already a clean span.

Where this probably breaks is when the answer needs stitching things together across the prompt. If the drafter drops one of those chunks, you just lose context without noticing. Multi-hop QA would stress that a lot more.

[-]

DefNattyBoii@reddit

Can this be done for 9B qwen 3.5 for 12 gv vram bros?

[-]

tomByrer@reddit

IIRC prefill's impact is smaller on smaller models. So might be only 2x, not 10x.

[-]

uhuge@reddit

for smallish context sure

[-]

Eyelbee@reddit

I tried a 70K token prompt on ud_q4_k_xl and prompt processing took just under 90 seconds.

[-]

tomByrer@reddit

90sec vs....?

[-]

Cferra@reddit

Does this scale to multiple 3090s?

[-]

tomByrer@reddit

I asked if PFlash can be on 2nd GPU in this GH issue: https://github.com/Luce-Org/lucebox-hub/issues/102

[-]

MarketsandMayhem@reddit

Will this work on lower grade cards like 3060?

[-]

warL0ck57@reddit

i am guessing yes, maybe it was only the hardware it was tested on.

[-]

tomByrer@reddit

VRAM memory bandwidth might be an issue;

Memory Bandwidth	360.0 GB/s	936.2 GB/s

I'll let you guess which one is which 😉
PFlash as they implemented has to load & unload it seems to make room for the `Qwen3.6-27B Q4_K_M target, q4_0 KV, DFLASH_FP_USE_BSA=1 DFLASH_FP_ALPHA=0.85 keep_ratio=0.05`.

[-]

sandropuppo@reddit (OP)

Yes should work with a bit of iterations

[-]

caetydid@reddit

The amount of optimizations popping up on the new Qwen models is insane! I am genuinely looking forward for all these to mature and getting merged into llama.cpp - I see a bright future for my local LLM stack sporting two 3090s!

[-]

Remove_Ayys@reddit

This is not a "10x speedup", this is a 10x speedup with a bunch of asterisks. Any kind of lossy optimizations need rigorous testing for quality.

[-]

crantob@reddit

nice to see Remove_Ayys putting things to right.

I like to tell people it's like skimming a book.

[-]

qwen_next_gguf_when@reddit

Can anyone replicate this ? The dflash thingy always oom on my 4090.

[-]

Boozybrain@reddit

I've yet to get Qwen running on a 3090, always OOM

[-]

Path-Exact@reddit

Try a NVFP4 quantized model, and do not offload context to the gpu, with that i get about 22-23gb of vram usage.

[-]

Boozybrain@reddit

I'm using the int4 quantized model but unsure about where context lives, will check on that. https://github.com/noonghunna/club-3090 is one of the repos I tried running.

[-]

RogerRamjet999@reddit

A bunch of people got the 4 bit quants to work fine (for 27B). If that's not what you're trying, then try that. If that is what you're trying, and it isn't working, it would seem that you need to go over your config and check everything.

[-]

bguberfain@reddit

I had the same issues, following all instructions… I give it up after burning for 1h on this.

[-]

Hot_Turnip_3309@reddit

nope

[-]

-dysangel-@reddit

There's nothing to replicate really - it's effectively what happens when you type /compact in Claude code, using a smaller model to extract out the important summary

[-]

S3ssionCalc@reddit

Like, not at all...

[-]

-dysangel-@reddit

Conceptually and practically it's the same procedure - trimming down a large context with a smaller model, so that the big model doesn't have to process everything. Sure it's a cool technique, but it doesn't need to be "replicated" in the same sense as a scientific finding, because it's already trivially obvious that it works, and what the trade-offs would be. A proper implementation with references and the ability to use tool calls to find missing context would be cool though. That's pretty much how Claude Code does things these days.

[-]

Such_Advantage_6949@reddit

24GB vram if not enough for this model at Q4 + context + dflash

[-]

ai_without_borders@reddit

the comparison against vanilla llama.cpp matters here -- llama.cpp's CUDA prefill path doesn't have proper flash attention at these context lengths, so part of that 10x is recovering that overhead anyway. the interesting claim is the speculative part: the drafter scores token importance and the heavy model only prefills the flagged spans, which is genuinely different from just flash attention -- it's an approximation. NIAH is the right benchmark to stress this because the failure mode for sparse prefill is the drafter systematically underweighting the relevant needle tokens. curious what architecture the drafter is and how much VRAM overhead it adds loading in-process

[-]

darkwalker247@reddit

supposedly the drafted is based on qwen3-0.6b. i wonder how this affects stability of conversations for larger prefill sizes

[-]

ai_without_borders@reddit

qwen3-0.6b has a 32k native context — at 128k it would need positional extrapolation, which is where the stability concern gets real. the drafter job is to score importance across the full window, so if it degrades past its training length the sparse prefill will systematically drop tokens in that region. not random errors but a reproducible blind spot. curious whether they rope-extended the drafter or if the 10x claim is only benchmarked under 32k

[-]

SectionCrazy5107@reddit

will this work on a V100?

[-]

siegevjorn@reddit

Tl AI DR THX

[-]

No_Conversation9561@reddit

does this support multi-gpu?

[-]

pixelpoet_nz@reddit

... but when I flash my P all I get is 18 months community service >:(

[-]

jamu85@reddit

I tried it yesterday and it ran nicely on my 3090. When do you add tool calls to the server?

[-]

Shinkai_I@reddit

This sounds like a more radical application of the RAG concept to KV Cache.
We're already struggling to combat the information loss caused by RAG Chunk fragmentation.
Now we might have to worry even more about information loss in KV Cache.

[-]

Shinkai_I@reddit

They tried to make the context window bigger, but now it's so slow that it only allows the model to read a small portion.

[-]

alex20_202020@reddit

Warmed steady-state is better (169.3 s at 128K)

What is "Warmed steady-state"? During conversation all previous is usually cached and response is fast, but here it is only 1.5 faster than cold. So what is it? When does it happen? TIA

[-]

wazymandias@reddit

Prefill at 128K is the metric that actually decides whether long-context agentic workflows are usable on consumer cards or not. Curious whether the 10x holds at 32K and 64K or whether it's a curve that only diverges hard at the top end. Decode tok/s comparison would also be nice for the people running this as a daily driver, not just for one-shot ingestion.

[-]

Daniel_H212@reddit

Vulkan/ROCm version pls

[-]

sandropuppo@reddit (OP)

Working on it… cooking for Ryzen strix halo

[-]

hughk@reddit

That would be interesting. I'm doing most of my Qwen stuff using unified memory on a Strix Halo. I do have a desktop with a 3090 but dont tend to run it so much now with the MiniPC.

[-]

Fedor_Doc@reddit

What is "(smaller)" value in llama.cpp warm column for 64K context? Is it the Time To First Token value? Can you share actual value in seconds?

llama.cpp warms models by default, so it should provide a better comparison. 7x prefill speed improvement is still respectable.

The question is, for what types of work this will be a valid optimization, considering possible reduction in output quality. Finding pre-defined string in a text is much easier with classic string search algorithm. No more complex worflows were tested, though

[-]

Obvious-Ad-2454@reddit

To be honest, 10x sounds too good to be true. But I am too lazy to replicate myself. So I will wait for others to do it. Anyway thank you for contributing.

[-]

No-Refrigerator-1672@reddit

10x prefill over llama.cpp on 4-bit quants is just casual reality of vLLM. If this pflash works, then it just brings the performance to proper level, nothing to be skeptical about.

[-]

FullOf_Bad_Ideas@reddit

llama.cpp has a pretty good prefill if you aren't offloading to CPU RAM, I don't believe the difference could be 10x on a model like Qwen 3.6 27B.

[-]

No-Refrigerator-1672@reddit

This graph is from my review of Chinese cards with modded vram. Vllm us clearly 10x faster. The llama numbers more or less agreed with other numbers I saw here for 3090. All engine versions and launch commands are availabe in said review, you're free to verify it yourself.

P.S. yes, this is single request performance, for multiple parallel requests vllm speeds up even more.

[-]

FullOf_Bad_Ideas@reddit

Thanks for sharing, I think for small dense models on single GPU the difference would be much smaller.

[-]

No-Refrigerator-1672@reddit

Yep, it's "just" 3x to 5x for Qwen3 VL 14B, same style graph is available in the same review. Llama.cpp only was faster than vLLM on MXFP4, which, I believe, is because Ampere does not support this quant, and vLLM featured no optimizations for such case.

[-]

FullOf_Bad_Ideas@reddit

Ok, I agree with you completely now. I was under the impression that the difference was smaller but seeing the numbers for Qwen 3 14B I'm fully convinced.

[-]

sandropuppo@reddit (OP)

I know , we were also a bit scared to release this because of the claim. But it’s true. That’s why we released everything to replicate it. A user on discord got already better than 10x as well

[-]

Obvious-Ad-2454@reddit

How does model accuracy change ?

[-]

tmvr@reddit

Q4_K_M Qwen3.6-27B on a 24 GB 3090 decodes fast (\~74 tok/s with DFlash spec decode), but prefill scales O(S²). On a 131K-token prompt, vanilla llama.cpp takes 248.4 s cold (llama-bench pp131072 --no-warmup -r 1, 527.6 tok/s). That is 4.1 minutes staring at a blank screen before the first token. Decode is fast, but the wait kills the UX. Warmed steady-state is better (169.3 s at 128K) but still painful, and grows quadratically as you push context.

Unless I'm missing something in your post or you missed someting I'm not too suprised you get 10x prefill results if you ran it like above. That model does not fit into 24GB VRAM with 131K tokens and default FP16 KV even when using the IQ4_XS quant, which is over a gigabyte smaller than Q4_K_M. With the settings above you ran out of VRAM, spilled over to system RAM and that killed you prefill performance.

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

a_beautiful_rhind@reddit

Aren't these all based on context being super homogeneous and predictable? So for code good, for other things basically nothing?

[-]

marutichintan@reddit

waiting for multi gpu support

[-]

kiwibonga@reddit

Hmm, vanilla llamacpp has awful prefill.

[-]

sudeposutemizligi@reddit

llama.cpp doesn't make that much waits for me what is 24 seconds waiting. that's vllm's habbit

[-]

alex20_202020@reddit

On a 131K-token prompt, vanilla llama.cpp takes 248.4 s cold

Maybe I am not understanding something, I am newbie in LLM. Does above means if one starts llama.cpp and gives it 131K of tokens as initial prompt? Cause otherwise KV cache is used for speed up. My use cases are far from that. How common is giving long initial input? What are typical use cases? TIA

[-]

hurdurdur7@reddit

Developers resuming work on their code or switching to a new task. On bigger projects a 60k-100k initial load is not that rare at all.

[-]

ga239577@reddit

If this can be replicated for ROCm that would be amazing!

[-]

inevitabledeath3@reddit

DFlash works on 3090? I had issues when I tried.

[-]

Long_comment_san@reddit

I cant read this AI writing. What year is it, 2023? Use minimax or kimi to make this readable

[-]

Rattling33@reddit

Great thanks for luce's effort! Also looking forward working on strix halo !

[-]

mrmontanasagrada@reddit

Very cool guys - this has a lot of locallama spirit!

Did you do any quality comparisons already?

And do you think we can combine this with rotorquant or similar new , even? perhaps that could give yet another multiple of speedup?

[-]

temperature_5@reddit

Someone run this and then have it make changes to a large python project to see if it remembers the code accurately. In production, of course!

[-]

Foreign_Risk_2031@reddit

Will streaming pre-fill work with this?

I'm doing streaming prefill for some low latency inputs, and I have a feeling this may break it

[-]