DFlash speculative decoding on Apple Silicon: 4.1x on Qwen3.5-9B, now open source (MLX, M5 Max)
Posted by No_Shift_4543@reddit | LocalLLaMA | View on Reddit | 40 comments
A few weeks ago I posted early results from a native MLX implementation of DFlash. Since then I rewrote the benchmark methodology, fixed numerical issues, and open sourced the whole thing.
A small draft model generates 16 tokens in parallel via block diffusion, the target verifies them in one forward pass. Every emitted token is verified against the target model before being committed. Lossless. Stock MLX, no fork.
Setup: M5 Max, 64GB, MLX 0.31.1. Baseline is stock mlx_lm.stream_generate, not a custom loop. 3 runs, median reported, 10s cooldown.
Results @ 2048 tokens
| Model | Baseline | DFlash | Speedup | Acceptance |
|---|---|---|---|---|
| Qwen3.5-4B | 53.74 tok/s | 219.83 tok/s | 4.10x | 89.3% |
| Qwen3.5-9B | 30.96 tok/s | 127.07 tok/s | 4.13x | 89.4% |
| Qwen3.5-27B-4bit | 32.35 tok/s | 62.78 tok/s | 1.90x | 89.1% |
| Qwen3.5-35B-A3B-4bit | 142.12 tok/s | 240.21 tok/s | 1.69x | 88.7% |
Full results at 1024/2048/4096 in the repo.
What changed since last post
- Baseline is now stock mlx_lm (was a custom Python loop that was slower, inflating the speedup)
- Tape-replay rollback: custom Metal kernel that replays only accepted steps through GatedDeltaNet recurrent state. No full checkpoint save/restore. This is what keeps acceptance at 89% over long generations.
- JIT 2-pass SDPA kernel for long-context verify (N >= 1024)
- Numerically stable bf16 paths across speculative cycles
- Acceptance went from \~82% to \~89% thanks to precision fixes
What I learned
On unified memory everything is bandwidth-bound. Custom Metal kernels (batched-GEMV, fused gated SiLU, custom SDPA) all came back slower than stock MLX. The wins came from numerical precision, not compute optimization.
The 27B-4bit speedup is lower because the quantized target is already fast, making the bf16 draft the bottleneck. Structural limitation of speculative decoding on bandwidth-bound hardware with quantized targets.
Built specifically for Qwen3.5's hybrid GatedDeltaNet + attention architecture. Pure attention models (Qwen3, Gemma) work but without the tape-replay benefits.
Roadmap
- Sustained acceptance at 4096+ tokens
- Full-attention model optimization
- Draft model compression
deleted_by_reddit@reddit
[removed]
zhijianliu@reddit
We are also working on releasing drafts for more models. Name your favorites!
mr_il@reddit
Great work! I was able to nearly reproduce on M5 Max with Qwen3.5-4B @ 2048 tokens. Baseline: 54 tok/s. DFlash: 140 tok/s. Speedup: 2.6x. Acceptance: 82%. MLX 0.31.1. I will test with other models too, but I wonder what might explain the variation?
Anyway, a bigger question is what is your ambition with this implementation? Are you planning to develop a serving layer yourself or propose this implementation for mlx_lm?
No_Shift_4543@reddit (OP)
Keeping it standalone for now
mr_il@reddit
In such case, would you consider implementing streaming OpenAI-compatible API which can parse reasoning and tool calls then, so we could experiment with coding agents?
No_Shift_4543@reddit (OP)
it’s already implemented
mr_il@reddit
It didn’t parse tool calls or reasoning tokens when I tried using Qwen3.5-27B via OpenCode and dflash-serve. But if you are saying it should be working, I’ll raise an issue in GitHub
No_Shift_4543@reddit (OP)
updated: https://github.com/bstnxbt/dflash-mlx/commit/a6ecff4e9ccbcf793b23de3ac7e860c9b7d8be5b
No_Shift_4543@reddit (OP)
thanks for flagging this, you're right, tool calls and reasoning tokens aren't parsed in the current server. I'll add proper support for both (tool_calls + reasoning_content in the SSE stream)
Its-all-redditive@reddit
I'm getting considerably higher benchmarks for the 4B 4096 token tests. Consistent (over 10 benchmark runs) \~200 t/s generation vs the expected \~150 t/s. At 4096 tokens, the draft seems to be accepting about 1.2x more tokens per cycle than the 1028 runs which must be the reason for the faster generation. Will be testing with the 9B, 27B 4-bit tomorrow. M5 Max 128GB
Safe_Sky7358@reddit
Let me know how it goes. Do we have to use the Dflash version of the model as draft from zlab's HF or are there any alternatives?
On my potato macair with m4(16gb) I get about 25% speed up(from about 32to/sec to 40to/sec) for the 4b model using the 4b-Dflash as a dwarf model but it actually slows down for the 9b with 9-dflash as dwarf model. :(
If I could get 9B to 30's or even mid 20's in terms of tps that would be a dream come true.
A bit isolated finding, I noticed that the MLX variants waste a lot of tokens for reasoning even when using the recommended parameters.
A qwen3.5 9b-4bit from HF/mlx-community spits out about 2x reasoning tokens for solving the same prompt compared to HF/bartowski's 9B-4bit gguf.
Prompt used : "Read the following information carefully and answer the questions given below:
i. There is a group of five persons A, B, C, D and E.
ii. One of them is a horticulturist, one is a physicist, one is a journalist, one is an industrialist and one is an advocate.
iii. Three of them A, C and advocate prefer tea to coffee and two of them - B and the journalist prefer coffee to tea.
iv. The industrialist and D and A, are friends to one another but two of them prefer coffee to tea.
v. The horticulturist is C's brother. What are the professions for A, B, C, D, E ? Be Brief in your response."
Answer : "A is the horticulturist,
B is the industrialist,
C is the physicist,
D is the journalist,
E is the advocate."
DerDave@reddit
Great work buddy! Wonder how well these diffusion models behave when compressed/quantized.
Colecoman1982@reddit
If I'm understanding your post correctly, it looks like this post might answer some of our question: https://old.reddit.com/r/LocalLLaMA/comments/1skesyq/dflash_speculative_decoding_on_apple_silicon_41x/og0sswv/
DerDave@reddit
No I was talking about quantization of the draft model.
No-Judgment9726@reddit
Nice work. One thing I've been wondering with speculative decoding on Apple Silicon — how's the memory overhead looking? I've been running some 13B/30B models locally on M-series and memory is basically always the constraint. Would love to know if this stays practical once you go beyond ~13B.
ieatrox@reddit
/u/cryingneko
what are the chances we could see this in omlx?
Dorkits@reddit
Exists something similar to the windows environment?
putrasherni@reddit
beautiful to see , there's 4-5 repos doing the thing though
i think you are ahead of them all with qwen3.5 dense model performance
mr_il@reddit
I benchmarked Qwen3.5-27B-bf16 and 8bit versions (by mlx-community), and also comparing this and https://github.com/Aryagm/dflash-mlx approaches on M5 Max 128GB. DFlash models are at bf16. Same prompt at 2048 tokens and temp=0.
| LLM Quant | Baseline tg, tok/s | Aryagm DFlash tg, tok/s (speed-up) | bstnxbt DFlash tg, tok/s (speed-up) |
| bf16 | 9.1 | 29.8 | 29.3 |
| 8bit | 18.3 | 39.1 | 37.2 |
At 40 tok/s Qwen3.5-27B becomes almost usable.
Objective-Picture-72@reddit
Why do you say 40 tk/s is "almost" usable. That's about what the cloud API providers send across.
No_Shift_4543@reddit (OP)
Thanks for the detailed comparison! The 27B numbers are expected, at that model size, both implementations converge to the same throughput because it's pure memory bandwidth. The speedup differences show up on smaller models: 9B bf16 is where i get 4.1x thanks to the precision work.
putrasherni@reddit
And it’s 8 bit , really good to see
mr_il@reddit
Second this! 27B is the best coder of this generation of open models, especially considering it's memory footprint. It's slow though, but speeding it up 2-3x should make a great deal of difference for local development ergonomics.
putrasherni@reddit
Great work, sharing some higher quant results
Hardware : Apple M4 Max, 128GB unified memory
Qwen3.5-27B
Qwen3.5-35B-A3B
No_Shift_4543@reddit (OP)
Thanks for the thorough testing! Good to see acceptance holding at 89%+ across all quants.
putrasherni@reddit
Can you try getting qwen3 coder and qwen3 coder next optimised as well ?
putrasherni@reddit
have you tested at longer context like 32K 64K 128K and 256K
how is acceptance and speed at those levels ?
apetersson@reddit
Did you get this to work with gemma4 models? i tried to enable it with oMLX, but not with observable speedup yet
mr_il@reddit
How did you enable DFlash in oMLX?
No_Shift_4543@reddit (OP)
Not yet, waiting on z-lab to release a DFlash draft model for Gemma 4.
apetersson@reddit
is gemma-4-E2B not suitable as a draft model for 31b?
layer4down@reddit
The DFlash approach specifically calls for diffusion models not transformers.
KubeKidOnTheBlock@reddit
Does this method of speculative decoding affect the benchmarks?
coder543@reddit
speculative decoding does not affect model performance, only speed and how much computation is used
THS_Cardiacz@reddit
I would love for there to be a Swift implementation of this somewhere so I could embed it in my app. I may take a crack at it if no one else does.
layer4down@reddit
Dope! I happened to catch the repo commits when It was just 35 mins old. My specific interest is 27b-bf16 and hot damn those are lovely results! I just tested a few randos I had on deck:
Do you have training recipes? or pointers for training the 397b model? I've been working on the same problem over the weekend but wasn't breaching like 38% acceptance.
layer4down@reddit
Oh and PS Thanks for your work! This is such a big deal for Apple users and I don't think people really appreciate that yet!
coder543@reddit
A few weeks ago? It wasn't even announced a few weeks ago, was it?
No_Shift_4543@reddit (OP)
Good catch, fixed.
My implementation is optimized for Qwen3.5's hybrid GDN
DanzakFromEurope@reddit
The paper was released 2 months ago