Is there a DFlash draft model compatible with Qwen3.6 27B yet?

Posted by butterfly_labs@reddit | LocalLLaMA | View on Reddit | 22 comments

Title.

I have the draft for Qwen3.5 (not 3.6) 27B, would it be compatible? I tried this combination in oMLX and PP speed is actually much worse .

[-]

One-Replacement-37@reddit

Yes, there is.

https://huggingface.co/z-lab/Qwen3.6-27B-DFlash

As of this morning however, as the model is still being trained - the embedded MTP layers provide a much higher acceptance rate. I was only getting ~2 tokens acceptance on DFlash vs. 4-5 on the MTP layers.

If your quant dropped the layers, ask a model to write a stitching script to bring them back.

[-]

Familiar_Wish1132@reddit

You didn't have this problem?
https://github.com/spiritbuun/buun-llama-cpp/issues/25

[-]

One-Replacement-37@reddit

I did not. Running:

vllm/vllm-openai:cu130-nightly

--speculative-config '{"method":"dflash","model":"z-lab/Qwen3.6-27B-DFlash","num_speculative_tokens":15}'

[-]

Koalababies@reddit

I keep getting errors similar to `Access to model z-lab/Qwen3.6-27B-DFlash is restricted and you are not in the authorized list` - is this expected right now?

[-]

One-Replacement-37@reddit

You need to open the link, sign in, and accept the terms.

[-]

Koalababies@reddit

Can't believe I missed the "accept terms" banner there. Thank you for the fast reply!

[-]

LaurentPayot@reddit

Btw, a GGUF version can be found here: https://huggingface.co/spiritbuun/Qwen3.6-27B-DFlash-GGUF

[-]

mouseofcatofschrodi@reddit

is there a way to get a notification when that improvement happens?

[-]

RoomyRoots@reddit

Follow the account. They have a Git and a blog too.

[-]

butterfly_labs@reddit (OP)

Thanks! I know some of those words 😅 I am running the 8-bit unsloth quant.

[-]

soyalemujica@reddit

Sadly DFlash does not work with AMD

[-]

suavedude2005@reddit

PR was merged a couple of days ago in vllm for supporting flash on AMD MI GPUs.

[-]

soyalemujica@reddit

AMD MI GPUs, imagine, not even for consumer hardware like 7900XTX

[-]

RoomyRoots@reddit

As an AMD user, this is pretty much how we always had to do. Gotta wait a lot for consumer hardware support.

[-]

SwanManThe4th@reddit

https://github.com/Kaden-Schutt/hipfire

[-]

SwanManThe4th@reddit

https://github.com/Kaden-Schutt/hipfire

[-]

FullOf_Bad_Ideas@reddit

Qwen 3.5 27B DFlash draft model did work with Qwen 3.6 27B BF16 model in SGLang for me, but on lower context lengths and not on all requests. 150-30 t/s.

[-]

-dysangel-@reddit

seems odd to have a speculative model affect pp, since you already know the exact tokens that you're processing and so don't need to run the speculative model during those passes..?

[-]

Evening_Ad6637@reddit

I think I've missed something important. Could a kind soul please shortly explain to me what DFlash is?

[-]

tuliosarmento@reddit

On a really brief and short summary, dflash is an approach of using both a small draft model built from the original model (in this case, likely a 0.8B version of qwen 3.6) and the original one as a "supervisor". This helps to get faster generation.

What happens is:

Your prompt > draft model generate 4-5 tokens > original model approves/reproves some of these tokens > draft model regenerate not disapproved tokens > ... Until all tokens of the output are created.

"Why?" one could ask. Well, imagine you have a performance of 20 tok/s of token gen and 200 tok/s on the prefill (reading part) using the 27B model. The same prompt would lead to hundreds of tok/s with the draft model. Obviously, since most of these tokens generated by the draft model are rejected, your final speed wouldn't be hundreds of tokens/s, but instead, something around 1.5x-4x (with some papers claiming 6x) the performance of the original/big model.

In this example, you could expect going from 20 tok/s to something around 50 tok/s. One useful analogy is imaging that the draft is like an autocomplete tool attached to your brain, this would make you type/create text way much faster.

[-]

audioen@reddit

This seems somewhat incorrect. You are describing vanilla speculative decoding. DFlash is based on speculation, yes, but there is large amount of speculated tokens constructed with diffusion model, which is probably what the D stands for. So the diffusion makes like 100 tokens at once and the main model somehow is used to either validate or guide the diffusion, is what I'm guessing.

The point is that approach is lossless and extremely fast, much faster than speculative decoding has ability to be. Even speculative speculative decoding, where speculator is run before the main model chooses the next tokens on its inference pass, is likely not going to be as fast, though speculative speculative decoding allows the main model to run continuously on one hardware while speculation happens on another so that there's no competition between their resources as the speculator will be evaluating many more token possibilities and works much harder.

[-]

audioen@reddit

Prompt processing is not going to improve, as this is for inference. Surely you meant token generation speed? DFlash is very interesting because it promises to increase generation speed by something like an order of magnitude if it can be made to work...