[-]

Zestyclose_Yak_3174@reddit

This sounds promising. However there have been so many projects that made huge promise that were either never fully developed or turned out to be wrong or overpromising. I really hope this time is different. Exposure is needed for these kind of projects. I am sure the future will use many components of similar breakthroughs to create a mix of eclectic inference optimizations. Just like the vanilla Turboquant, on its own not necessarily earth shattering but has potential. But all of the newer community improvements are looking really promising.

[-]

Kitchen-Year-8434@reddit

Dflash in vllm on qwen3.5 27b took me from 80 ish tps with MTP to 150-180. Insane speed up. Just waiting on gemma4 now.

[-]

Interesting_Key3421@reddit

can dflash be integrated in llama.cpp ?

[-]

-dysangel-@reddit

I've got Claude working on an mlx version atm. If we get it working well, I can try llama.cpp too

[-]

pmttyji@reddit

.... I can try llama.cpp too

Please do it. Thanks

[-]

DerDave@reddit

When you say "we" - do you mean yourself and Claude or an actual team behind you? ;-)

[-]

-dysangel-@reddit

myself and Claude

[-]

Beginning-Window-115@reddit

any update

[-]

tomakorea@reddit

hope it works, fingers crossed

[-]

Monkey_1505@reddit

Yeah would be nice to see for sure. VLLM is really geared to multi-instance commercial implementation, and doesn't support single end user things as much, like eg, offloading select expert tensors to cpu.

This tech seems genuinely great and would be lovely to have it nearer to the average end user.

[-]

eugene20@reddit

This + turboquant + WHT Lloyd-Max centroid weight compression is really going to open up what locally run models can do.

[-]

snapo84@reddit

i would prefer rotorquant kv cache (much faster and better than turboquant) , dflash
those both would allow me to run qwen 3.5 27B at a staggering 60 token/s

[-]

Silver-Champion-4846@reddit

When will this be mature enough to be freely plug-and-play on things like Jan?

[-]

Clear-Ad-9312@reddit

When will this be mature enough

when it gets mature? idk its too open for debate as tech moves too fast that by the time things are being figured out another groundbreaking announcement/release. If possible, maybe one year or two for actual maturity, but you can likely start using it in like one to three months if devs are able. Consider supporting them, that is all we can do, haha

[-]

9r4n4y@reddit

Can someone please give me explanation of what's happening?

[-]

brandarchist@reddit

Take this as a vaguely-accurate-but-probably-not-totally explanation...

Despite running on GPUs, token gen is largely a serial operation. Speculative uses a "draft" model to guess and a larger one verifies them; this can give a 2-3x improvement by delivering chunks instead of individual tokens.

What this is doing is cheating a bit by basically taking the "LLMs are just autocomplete" and pointing it at the internal state of the larger model above, i.e.. the one actually generating tokens. As it is actively generating, the smaller models are (in parallel) predicting the next chunk of tokens.

If you watch utilization, GPU spikes heavy on attention (before tokens generate) and then drops pretty significantly as it generates. This project aims to leverage a more significant portion of the GPU during the generation process.

[-]

Direct-Salt-9577@reddit

Great explanation thanks

[-]

9r4n4y@reddit

Thank you 🤗

[-]

kulchacop@reddit

Here's the abstract from the paper. Make of that what you will:

Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by us- ing a fast draft model whose outputs are verified in parallel by the target LLM.

However, existing methods still rely on autoregressive drafting, which remains sequential and constrains practi- cal speedups.

Diffusion LLMs offer a promising alternative by enabling parallel generation, but current diffusion models typically underperform compared with autoregressive models.

In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting. We show that speculative decoding provides a natural and effective setting for diffusion models.

By generating draft tokens in a single forward pass, DFlash enables efficient drafting, and by conditioning the draft model on context features extracted from the target model, it achieves high-quality drafts with higher acceptance rates.

Experiments show that DFlash achieves over 6× lossless acceleration across a range of models and tasks, delivering up to 2.5× higher speedup than the state-of the-art speculative decoding method EAGLE-3.

[-]

9r4n4y@reddit

Ohh

[-]

NickCanCode@reddit

free lossless speed up

[-]

divide0verfl0w@reddit

Imma pile on too

[-]

jadhavsaurabh@reddit

dont mind me just commenting for more info

[-]

Substantial_Swan_144@reddit

I don't know what is happening precisely, but I sure like it!

[-]

Tyrannas@reddit

Don't mind me, just commenting to also be notified of the explanation

[-]

ortegaalfredo@reddit

4x decoding speed? this is the kind of paper that makes nvidia loss 500 Billions in market cap.

I wonder what's the size of the draft. Apparently it's quite bigger than that of the Eagle3 MTP.

[-]

Mochila-Mochila@reddit

Doesn't scale up so well apparently, so it may not be Earth-shattering with the biggest models.

[-]

twnznz@reddit

Looks like inference might be an edge problem rather than a DC problem

[-]

Finanzamt_Endgegner@reddit

not really though, everyone profits from faster inference with same hardware

[-]

Finanzamt_Endgegner@reddit

It wont because it wont get the hype of turboquant, which is a shame because this is arguably better lol

[-]

ortegaalfredo@reddit

Much better

[-]

Conscious-content42@reddit

I wonder how the scaling works for larger models. In their blog they see a 2.5x speed up over Eagle 3 (so a 6x total speed up over no speculative decoding) for an 8B model. Maybe a bit more modest gains for larger models?

[-]

Conscious-content42@reddit

Answer... read the paper: https://arxiv.org/pdf/2602.06036

For qwen 30B A3B, it's like 2.2-3x speed up without speculative decoding.

[-]

z_latent@reddit

Left to right numerical columns are different concurrency levels (1 2 4 8 16).

Looks like a \~3x speed-up for concurrency = 1. Unfortunately lacks a comparison with EAGLE for this model.

[-]

JayPSec@reddit

WTF is going on? A week ago we're all crying that maybe they would stop releasing openweights and now it's effing christmas everyday???

[-]

TheRealMasonMac@reddit

ZLab isn't ZAI

[-]

AdventurousFly4909@reddit

Would this work with speculative speculative decoding?

https://arxiv.org/pdf/2603.03251

[-]

Dany0@reddit

This feels like a bigger deal than the TurboQuant hype. \~10-20% VRAM more requirement (max, less so for larger models) in exchange for 6x speed

[-]

Dany0@reddit

Some clanker summary (abbreviated by me):

From the code, generation is blockwise, not one diffusion chain that runs forever. In spec_generate(), each loop:

takes the current context,
runs the draft model to propose a block,
runs the target model on that block,
computes an acceptance_length,
commits the accepted tokens,
crops caches and continues from the new position.

Does diffusion continue steps as generation continues?

Yes, but only in the sense that it is re-run repeatedly on the newly extended context.

It is not one uninterrupted diffusion trajectory over the whole response. Instead, each new block is a fresh “drafting” pass conditioned on:

Does target confirmation improve the diffusion model’s guesses?

Indirectly, yes, the improvement is from more context, cleaner prefix, target hidden-state features extracted from the confirmed segment

vram estimates for q8 27b + dflash

27B q8: \~30 GB

Draft model: \~3–8 GB

Total (including cache/overhead): \~40–48 GB for standard use, 64 GB+ for long context.

[-]

Dany0@reddit

They use a Qwen3-based block diffusion draft model, not a generic standalone diffusion architecture.

Specifically, in this repo the draft model class is:

Text

DFlashDraftModel(Qwen3PreTrainedModel)

and it’s implemented as a Qwen3-style decoder stack modified for block diffusion. The README shows model pairs like:

Qwen3.5-4B-DFlash
Qwen3.5-9B-DFlash
Qwen3.5-27B-DFlash
Qwen3.5-35B-A3B-DFlash

So the diffusion model in the paper/repo is basically:

a small draft model derived from the same family as the target
trained as a block diffusion model
conditioned on target hidden states plus noisy token embeddings

Important nuance

It is not “a smaller Qwen3-like diffusion model” in the sense of being a separate classical image-style diffusion model. It’s a language-model draft network with diffusion/block-denoising behavior.

What the code suggests

The draft model:

takes noise_embedding from the target token embeddings
takes target_hidden from selected hidden layers of the big model
runs a special decoder stack
predicts a block of tokens in parallel

So it’s more like:

If you mean “which exact base model?”

For the examples in the README, it’s Qwen3.5-family variants such as:

z-lab/Qwen3.5-27B-DFlash
z-lab/Qwen3.5-8B-DFlash-b16

[-]

BagComprehensive79@reddit

What is the meaning of “losses” here? Does it mean it would produce exact same output if temp set to “0”?

[-]

Careful_Letter_9223@reddit

400 T/s is the minimum for ideal inference (for me at least). The point where it’s looks less like typing and more like streaming answers.

Future looks bright

[-]

BeeegZee@reddit

First of all, kudos to your work. Really strange no one has done it before in the open (although we had a brief Gemini Diffusion sneak peak, which died young)

Did you test it vs MTP available from day one for Qwen3.5 model family?

[-]

BeeegZee@reddit

Tested Qwen3.5 family on H100 + vllm

HEAD-TO-HEAD (same target weights, H100 80GB, single-stream, 20 reqs warm)

Model	MTP=3 TPS	DFlash(15) TPS	Δ	Winner
Qwen3.5-9B (BF16)	168.8	153.1	+10.3%	MTP
Qwen3.5-27B-FP8	108.8	103.9	+4.7%	MTP
Qwen3.5-27B-GPTQ-Int4	107.7	105.0	+2.6%	TIE/MTP
Qwen3.5-35B-A3B-FP8	171.8	170.2	+0.9%	TIE
Qwen3.5-35B-A3B-GPTQ-Int4	197.2	160.6	+22.8%	MTP

CUDA GRAPHS CAPTURED (for 9B):

DFlash 9B → 32 PIECEWISE prefill-decode graphs + 32 FULL decode graphs, 4s

MTP 9B → 33 PIECEWISE prefill-decode graphs + 17 FULL decode graphs, 4s

Both have batch=1 in the capture set → bench hits the graph, not eager fallback.

[-]

eribob@reddit

Oh that looks like a bummer? No speedup?

[-]

BeeegZee@reddit

idk, I have no idea if i tested it with the best possible configs, but seems so.

MTP heads implemented natively (Qwen3.5 is relatively new) is no joke. It's like at first sight "we have EAGLE3 at home", but under the hood it's the one she told you not to worry about.

[-]

Own_Suspect5343@reddit

I hope this would work well on strix halo later

[-]

EveningIncrease7579@reddit

Really impressive. Maybe we can adapt for qwen 3.5 in the same way? And what about results running on cpu exclusively, seems improve performance too?

[-]

EveningIncrease7579@reddit

Forgive my first question, in repository i see support for qwen 3.5

[-]

BeeegZee@reddit

did some tests in the adjacent comment

[-]

Randomdotmath@reddit

currently not support for gpu offload i think, looking for it too

[-]

king_of_jupyter@reddit

Awesome!

[-]

helpmefindmycat@reddit

is it possible to get this to work with gemm 3 31B in lm studio, because I suspect that would be amazing.

[-]

Ok_Zookeepergame8714@reddit

They are working on it. Says so in their GitHub repo issues. ☺️

[-]

Substantial_Swan_144@reddit

At those speeds, any local model could crush the much more intelligent models, because you could swarm agents to improve on the input at very little cost.

[-]

oxygen_addiction@reddit

If your application has proper reward functions to target. You could do swarms of small llms even now.

Swarm Bonsai and beat Claude.

[-]

Substantial_Swan_144@reddit

What I mean is that with current speed, calling agents would be expensive. But definitely not so at 400 token / seconds.

[-]

kulchacop@reddit

The person who named this DFlash deserves an award. /s

[-]

Christosconst@reddit

What hardware is the demo running on

[-]

QuackerEnte@reddit

speculative decoding but diffusion based why didn't I think of that

[-]

ortegaalfredo@reddit

Many teams thought of that in the past but they couldn't get enough quality predicted tokens. Diffusion models are not super accurate, but this one is.

[-]

Jeidoz@reddit

Can someone me tell how I can download and use it for LM Studio? I wanna try it with Qwen 3.5 option.

[-]

miniocz@reddit

I spent literally last night testing speculative decoding. I could have slept and just wait till today. Great news anyway.

[-]

Specter_Origin@reddit

Supported model is missing gemma : (

[-]

pmttyji@reddit

From their github repo:

Feel free to open a GitHub issue to request support for additional models. We will also open-source the training recipe soon, so you can train your own DFlash draft model to accelerate any LLM.

[-]