DFlash speculative decoding on Apple Silicon: 4.1x on Qwen3.5-9B, now open source (MLX, M5 Max)

Posted by No_Shift_4543@reddit | LocalLLaMA | View on Reddit | 40 comments

A few weeks ago I posted early results from a native MLX implementation of DFlash. Since then I rewrote the benchmark methodology, fixed numerical issues, and open sourced the whole thing.

A small draft model generates 16 tokens in parallel via block diffusion, the target verifies them in one forward pass. Every emitted token is verified against the target model before being committed. Lossless. Stock MLX, no fork.

Setup: M5 Max, 64GB, MLX 0.31.1. Baseline is stock mlx_lm.stream_generate, not a custom loop. 3 runs, median reported, 10s cooldown.

Results @ 2048 tokens

Model	Baseline	DFlash	Speedup	Acceptance
Qwen3.5-4B	53.74 tok/s	219.83 tok/s	4.10x	89.3%
Qwen3.5-9B	30.96 tok/s	127.07 tok/s	4.13x	89.4%
Qwen3.5-27B-4bit	32.35 tok/s	62.78 tok/s	1.90x	89.1%
Qwen3.5-35B-A3B-4bit	142.12 tok/s	240.21 tok/s	1.69x	88.7%

Full results at 1024/2048/4096 in the repo.

What changed since last post

Baseline is now stock mlx_lm (was a custom Python loop that was slower, inflating the speedup)
Tape-replay rollback: custom Metal kernel that replays only accepted steps through GatedDeltaNet recurrent state. No full checkpoint save/restore. This is what keeps acceptance at 89% over long generations.
JIT 2-pass SDPA kernel for long-context verify (N >= 1024)
Numerically stable bf16 paths across speculative cycles
Acceptance went from \~82% to \~89% thanks to precision fixes

What I learned

On unified memory everything is bandwidth-bound. Custom Metal kernels (batched-GEMV, fused gated SiLU, custom SDPA) all came back slower than stock MLX. The wins came from numerical precision, not compute optimization.

The 27B-4bit speedup is lower because the quantized target is already fast, making the bf16 draft the bottleneck. Structural limitation of speculative decoding on bandwidth-bound hardware with quantized targets.

Built specifically for Qwen3.5's hybrid GatedDeltaNet + attention architecture. Pure attention models (Qwen3, Gemma) work but without the tape-replay benefits.

Roadmap

Sustained acceptance at 4096+ tokens
Full-attention model optimization
Draft model compression

https://github.com/bstnxbt/dflash-mlx

[-]

deleted_by_reddit@reddit

[removed]

[-]

zhijianliu@reddit

We are also working on releasing drafts for more models. Name your favorites!

[-]

mr_il@reddit

Great work! I was able to nearly reproduce on M5 Max with Qwen3.5-4B @ 2048 tokens. Baseline: 54 tok/s. DFlash: 140 tok/s. Speedup: 2.6x. Acceptance: 82%. MLX 0.31.1. I will test with other models too, but I wonder what might explain the variation?

Anyway, a bigger question is what is your ambition with this implementation? Are you planning to develop a serving layer yourself or propose this implementation for mlx_lm?

[-]

No_Shift_4543@reddit (OP)

Keeping it standalone for now

[-]

mr_il@reddit

In such case, would you consider implementing streaming OpenAI-compatible API which can parse reasoning and tool calls then, so we could experiment with coding agents?

[-]

No_Shift_4543@reddit (OP)

it’s already implemented

[-]

mr_il@reddit

It didn’t parse tool calls or reasoning tokens when I tried using Qwen3.5-27B via OpenCode and dflash-serve. But if you are saying it should be working, I’ll raise an issue in GitHub

[-]

No_Shift_4543@reddit (OP)

updated: https://github.com/bstnxbt/dflash-mlx/commit/a6ecff4e9ccbcf793b23de3ac7e860c9b7d8be5b

[-]

No_Shift_4543@reddit (OP)

thanks for flagging this, you're right, tool calls and reasoning tokens aren't parsed in the current server. I'll add proper support for both (tool_calls + reasoning_content in the SSE stream)

[-]

Its-all-redditive@reddit

I'm getting considerably higher benchmarks for the 4B 4096 token tests. Consistent (over 10 benchmark runs) \~200 t/s generation vs the expected \~150 t/s. At 4096 tokens, the draft seems to be accepting about 1.2x more tokens per cycle than the 1028 runs which must be the reason for the faster generation. Will be testing with the 9B, 27B 4-bit tomorrow. M5 Max 128GB

[-]

Safe_Sky7358@reddit

Let me know how it goes. Do we have to use the Dflash version of the model as draft from zlab's HF or are there any alternatives?

On my potato macair with m4(16gb) I get about 25% speed up(from about 32to/sec to 40to/sec) for the 4b model using the 4b-Dflash as a dwarf model but it actually slows down for the 9b with 9-dflash as dwarf model. :(

If I could get 9B to 30's or even mid 20's in terms of tps that would be a dream come true.

A bit isolated finding, I noticed that the MLX variants waste a lot of tokens for reasoning even when using the recommended parameters.

A qwen3.5 9b-4bit from HF/mlx-community spits out about 2x reasoning tokens for solving the same prompt compared to HF/bartowski's 9B-4bit gguf.

Prompt used : "Read the following information carefully and answer the questions given below:

i. There is a group of five persons A, B, C, D and E.

ii. One of them is a horticulturist, one is a physicist, one is a journalist, one is an industrialist and one is an advocate.

iii. Three of them A, C and advocate prefer tea to coffee and two of them - B and the journalist prefer coffee to tea.

iv. The industrialist and D and A, are friends to one another but two of them prefer coffee to tea.

v. The horticulturist is C's brother. What are the professions for A, B, C, D, E ? Be Brief in your response."

Answer : "A is the horticulturist,
B is the industrialist,
C is the physicist,
D is the journalist,
E is the advocate."

[-]

DerDave@reddit

Great work buddy! Wonder how well these diffusion models behave when compressed/quantized.

[-]

Colecoman1982@reddit

If I'm understanding your post correctly, it looks like this post might answer some of our question: https://old.reddit.com/r/LocalLLaMA/comments/1skesyq/dflash_speculative_decoding_on_apple_silicon_41x/og0sswv/

[-]

DerDave@reddit

No I was talking about quantization of the draft model.

[-]

No-Judgment9726@reddit

Nice work. One thing I've been wondering with speculative decoding on Apple Silicon — how's the memory overhead looking? I've been running some 13B/30B models locally on M-series and memory is basically always the constraint. Would love to know if this stays practical once you go beyond ~13B.

[-]

ieatrox@reddit

/u/cryingneko

what are the chances we could see this in omlx?

[-]

Dorkits@reddit

Exists something similar to the windows environment?

[-]

putrasherni@reddit

beautiful to see , there's 4-5 repos doing the thing though
i think you are ahead of them all with qwen3.5 dense model performance

[-]

mr_il@reddit

I benchmarked Qwen3.5-27B-bf16 and 8bit versions (by mlx-community), and also comparing this and https://github.com/Aryagm/dflash-mlx approaches on M5 Max 128GB. DFlash models are at bf16. Same prompt at 2048 tokens and temp=0.

| LLM Quant | Baseline tg, tok/s | Aryagm DFlash tg, tok/s (speed-up) | bstnxbt DFlash tg, tok/s (speed-up) |
| bf16 | 9.1 | 29.8 | 29.3 |
| 8bit | 18.3 | 39.1 | 37.2 |

At 40 tok/s Qwen3.5-27B becomes almost usable.

[-]

Objective-Picture-72@reddit

Why do you say 40 tk/s is "almost" usable. That's about what the cloud API providers send across.

[-]

No_Shift_4543@reddit (OP)

Thanks for the detailed comparison! The 27B numbers are expected, at that model size, both implementations converge to the same throughput because it's pure memory bandwidth. The speedup differences show up on smaller models: 9B bf16 is where i get 4.1x thanks to the precision work.

[-]

putrasherni@reddit

And it’s 8 bit , really good to see

[-]

mr_il@reddit

Second this! 27B is the best coder of this generation of open models, especially considering it's memory footprint. It's slow though, but speeding it up 2-3x should make a great deal of difference for local development ergonomics.

[-]

putrasherni@reddit

Great work, sharing some higher quant results

Hardware : Apple M4 Max, 128GB unified memory

Qwen3.5-27B

Tokens	Quant	Baseline (tok/s)	DFlash (tok/s)	Speedup	Acceptance
1024	Q4	27.30	43.05	1.59x	90.23%
2048	Q4	23.83	41.23	1.74x	90.48%
4096	Q4	24.24	36.07	1.51x	88.72%
1024	Q6	19.53	37.74	1.95x	88.67%
2048	Q6	16.96	36.87	2.16x	88.87%
4096	Q6	16.40	31.01	1.85x	87.92%
1024	Q8	15.15	36.07	2.42x	88.96%
2048	Q8	14.70	33.04	2.25x	88.09%
4096	Q8	14.59	30.43	2.05x	87.89%

Qwen3.5-35B-A3B

Tokens	Quant	Baseline (tok/s)	DFlash (tok/s)	Speedup	Acceptance
1024	Q4	130.14	197.82	1.52x	89.06%
2048	Q4	126.84	186.40	1.46x	88.72%
4096	Q4	125.44	158.30	1.26x	87.52%
1024	Q6	96.97	157.42	1.67x	89.84%
2048	Q6	96.38	123.55	1.28x	88.43%
4096	Q6	90.70	114.76	1.28x	87.96%
1024	Q8	91.74	141.39	1.54x	87.50%
2048	Q8	91.18	143.66	1.57x	88.92%
4096	Q8	86.14	118.88	1.38x	87.62%