DFlash speculative decoding on Apple Silicon : 85 tok/s, 3.3x on Qwen3.5-9B (MLX, M5 Max)
Posted by No_Shift_4543@reddit | LocalLLaMA | View on Reddit | 36 comments
I'm building a native MLX implementation of DFlash (paper) for Apple Silicon. A small draft model generates 16 tokens in parallel via block diffusion, the target verifies them in one forward pass. Output is bit-for-bit identical to baseline (greedy exact argmax match).
Setup: M5 Max, 64GB, MLX, no CUDA.
Results
Qwen3.5-9B bf16
| Gen length | DFlash | Baseline | Speedup |
|---|---|---|---|
| 1024 tokens | 85 tok/s | 26 tok/s | 3.3x |
| 2048 tokens | 80 tok/s | 26 tok/s | 3.1x |
Qwen3.5-4B bf16
| Gen length | DFlash | Baseline | Speedup |
|---|---|---|---|
| 1024 tokens | 109 tok/s | 41 tok/s | 2.7x |
| 2048 tokens | 133 tok/s | 42 tok/s | 3.2x |
The 4B actually gets faster at longer generation. The model is small enough that the draft/verify balance stays healthy as context grows.
Qwen3.5-27B quantized
| Quant | Gen length | DFlash | Baseline | Speedup |
|---|---|---|---|---|
| 8bit | 1024 tokens | 35 tok/s | 14 tok/s | 2.5x |
| 8bit | 2048 tokens | 26 tok/s | 11 tok/s | 2.3x |
| 4bit | 1024 tokens | 44 tok/s | 24 tok/s | 1.9x |
| 4bit | 2048 tokens | 40 tok/s | 23 tok/s | 1.7x |
8bit gives better speedup ratios than 4bit. int4 makes the verify so fast that the bf16 draft becomes the bottleneck. With int8, the draft/verify balance is healthier.
All numbers are generation only (first token to last token, no prefill). Acceptance around 80-87% across all models.
What I built
No DFlash MLX implementation existed. I wrote the runtime from scratch. What actually moved the numbers:
head_dim=256 patch. Qwen3.5-9B uses head_dim=256, which MLX's steel_attention didn't support. A 2-line patch unlocked the fast SDPA path.
Sync elision. Restructured the pipeline from 2 GPU→CPU syncs per cycle to 1. At 80+ tok/s each sync costs \~0.5ms.
Packed QKV projection. 3 matmuls → 1 matmul + split. Fewer kernel dispatches per layer.
Lessons on Apple Silicon
On unified memory everything is bandwidth-bound, which changes the speculative decoding game:
Custom Metal kernels (batched-GEMV, fused gated SiLU, custom SDPA) all came back 0.5 to 0.8x slower than stock MLX steel GEMM. Ended up reverting all of them.
Verify cost is almost flat from 4 to 16 tokens (57ms vs 59ms). Weight loading dominates, not token count. "Verify fewer tokens when confidence is low" doesn't help here.
On quantized models, the optimization landscape flips: the draft (bf16) becomes slower than the verify (int4/int8). This is the opposite of the bf16 case and is a structural limitation of speculative decoding on bandwidth-bound hardware with quantized targets.
Currently working on
Draft compression/distillation for the 27B to fix the bf16 draft bottleneck on quantized targets.
Long context stability. Speedup degrades past 2K tokens due to KV cache growth.
MoE models. DFlash drafts exist for Qwen3.5-35B-A3B (35B total, 3B active). Verify cost of a small model, quality of a large one.
Everything is still very much under construction. Will open source when ready.
akavel@reddit
knowing nothing about DFlash: is there a memory overhead? or some other tradeoff? or is it "free lunch"? the sub's favorite qwen3.5-27b currently "barely" fits (actually starts swapping already, but doesn't OOM-crash yet) on my 32gb M4 with llama.cpp; will this give me "free speedup", or I won't be able to run it at all? is it maybe M5+ only?
trying to manage my expectations/excitement 😂
VoiceApprehensive893@reddit
yes its speculative decoding so there is a second model in ram
fallingdowndizzyvr@reddit
In this instance. But not all speculative decoding requires a second model. There is self speculative decoding.
oxygen_addiction@reddit
Which only reuses existing context. So basically doing copy and paste for bits that were already mentioned before (editing code, displaying a large bit of text, etc.)
akavel@reddit
Understood, so probably no chance on my machine; pity but thanks!
ezyz@reddit
Any plans to add this to mlx-lm? Or is this standalone?
Mochila-Mochila@reddit
That's really cool ! Thanks for working on this and sharing your results 👍
I hope DFlash will be generalised across models and platforms.
DonnaPollson@reddit
This is the kind of optimization work that actually moves local inference forward: not a vague "Apple is fast" claim, but a clear demonstration that bandwidth realities change which tricks win. The bit that stands out to me is your 8-bit result, because it shows the bottleneck can migrate so hard that the draft becomes the liability, which is exactly the sort of systems insight most benchmark posts skip. If you open source this with notes on where MLX helped versus where it fought you, it’ll probably teach people more than the raw tok/s number.
layer4down@reddit
This is amazing work! Now we’ve got to find the DFlash speculative prefill paper and address the real Apple Silicon use bottleneck. Even just 2-4x boost in prefill performance on Apple Silicon would be massive for long suffering Apple users.
GroundbreakingMall54@reddit
85 tok/s on a 9B is genuinely impressive. block diffusion generating 16 tokens in parallel is such a clever approach, way more interesting than just throwing bigger gpus at the problem. apple silicon keeps quietly becoming the best bang for buck for local inference
layer4down@reddit
I think it’s more so that we all know the potential of this platform and we’re all the more determined to get the most bang for this very high buck!
putrasherni@reddit
and its is BF16
No_Shift_4543@reddit (OP)
thanks! yeah block diffusion is a really nice fit for this hardware. still pushing it, the 27B is next
zeth0s@reddit
Why specifically for this hardware?
Thrumpwart@reddit
Awesome. Any plans for incorporating into mlx-lm?
nonerequired_@reddit
Couldn’t wait for 27B
CATLLM@reddit
This is amazing. This makes even more excited about my M5 max mbp.
BeeegZee@reddit
What about DFlash vs MTP on the same HW?
alexx_kidd@reddit
Has anyone tried this on an M5 Pro?
snugglezone@reddit
God bless you
Zestyclose_Yak_3174@reddit
This looks promising. And it is very welcome on Apple Silicon. Especially for dense models they can get slow real quick
Remarkable_Jicama775@reddit
Great work — the sync elision and head_dim=256 patch are exactly the kind of Apple Silicon-specific insights that don't show up in the paper.
I'm building an open-source MLX port: github.com/eauchs/mlx-dflash — weight conversion from z-lab safetensors, native MLX draft model, full speculative loop with the same optimizations you documented (packed QKV, single GPU→CPU sync). Will drop benchmarks on M3 Max 128GB this weekend.
Happy to coordinate if you open-source yours first — no point duplicating.
Master-Refuse-6094@reddit
All the updates and the creation date are from 10/15 minutes ago, and you even put the original post's author as a reference in your README?
Can we see the prompt you sent to your QWEN 3.5B asking it to reproduce someone else's work so you could try to pass it off as your own, or not?
Remarkable_Jicama775@reddit
Fair callout on the timing, I posted within hours of your post, that looks bad.
To be clear: I didn't use an LLM to reproduce your work. I read your post, recognized the architecture from the paper, and implemented an MLX port from scratch using the z-lab model card and modeling_dflash.py directly from HuggingFace.
The README references your post because your benchmarks (M5 Max numbers) are the only published MLX results, I don't have mine yet, benchmark is running right now. I should have framed that more explicitly as "reference numbers, not mine."
If you open-source yours first I'll defer entirely and link to it. If not, happy to coordinate so we're not duplicating effort.
VoiceApprehensive893@reddit
https://i.redd.it/mv1rx9tqllug1.gif
JacketHistorical2321@reddit
Says the person who has no idea what they are talking about...
riceinmybelly@reddit
MLX?
Specter_Origin@reddit
still waiting for gemma
ML-Future@reddit
Dear God:
If you allow this to be implemented tomorrow on llama.cpp, I will never be evil again.
pmttyji@reddit
Someone gonna try that soon or later
cryptofriday@reddit
Nice...
aigemie@reddit
So the 27b could get around 30 t/s? Great job! Can't wait!
Equal-Document4213@reddit
Anyone know if they plan on releasing a training recipe for dflash? Trying to figure out how to use this without performance loss on finetuned models.
Sugaaray@reddit
Wow
DerDave@reddit
Love that people get their hands on this! Can't wait for the first llama.cpp implementatioms!
dinerburgeryum@reddit
Good early results. Can’t wait for the repo.