DFlash Doubles the T/S Gen Speed of Qwen3.5 27B (BF16) on Mac M5 Max
Posted by MiaBchDave@reddit | LocalLLaMA | View on Reddit | 31 comments
The new DFlash support in oMLX 0.3.5 RC1 looks like it doubles (!!!) the speed of Qwen3.5 27B (BF16). Initial test. Generation T/S went from 9 to 22 T/S!
Models used (HuggingFace)
Main Model: Jackrong/MLX-Qwopus3.5-27B-v3-bf16
Draft Model: z-lab/Qwen3.5-27B-DFlash
System: M5 Max 128GB
DFlash on Github: https://github.com/bstnxbt/dflash-mlx?tab=readme-ov-file
oMLX (v0.3.5 RC1): https://omlx.ai
I'm not affiliated with any of the developers. Since the Qwen3.5 27B model is so good for the size, with speed being the only thing holding it back, I thought that this may help deploy this model locally at higher quants/full weights.
I've yet to test with OpenCode or other harness.
R_Duncan@reddit
Nice, now please test 4 bit quantized and multi-user, as other reports are saying:
- 4 bit quantized DFlash gain is minimal
- multi-user / stream gain decrease with number of users, halved in 2, 20% in 4, 0% in 8.
j_osb@reddit
that prefill though. god.
xienze@reddit
Yeah that's the Achilles Heel of Macs. If you want to use it for agentic coding purposes, forget it (at least, this particular model).
MiaBchDave@reddit (OP)
The M5 Max does not have a pre-fill "problem" like older M series chips. It's not a desktop 5090, but not slow at all. Plus oMLX has hot/cold cache for agentic work and large codebases, making things MUCH faster compared to llama.cpp for any harness.
xienze@reddit
Isn't the OP running this on an M5 Max and getting 17token/sec PP?
Thrumpwart@reddit
oMLX also has prefix caching so for agentic tasks it dramatically speeds up pp by caching pre-computed context.
MiaBchDave@reddit (OP)
The prefill is not reading correctly… at least cold. with one line. The full benchmark with the same model has prefill at 950 t/s on my same system with 4K context. The oMLX website has community benchmarks if interested on regular PP/TS. Those are normalized settings though, with static model configs, not speculative options.
Fast-Gold125@reddit
DFlash for Gemma 4 31B pls
Specter_Origin@reddit
Even with that you will get like 20 TPS, how is that even usable for day to day? MOE would make a much better pick... you can get a usable 100 tps if it truly doubles
robertpro01@reddit
Has anyone a workaround for llama.cpp? I Want to use it for 122b or 27b
OrkanFlorian@reddit
I have been experimenting with trying to implement DFlash for llama.cpp, but it turns out it is really really hard. (even with the help of codex and CC)
If I do get it working somehow, I will try to get it upstream, but with how llama.cpp maintainers tick that could take a while. (not that I judge them, they are being flooded with vibe coded, bullshit PRs sadly)
robertpro01@reddit
I hope so my friend!
pmttyji@reddit
Expensive_Demand1069@reddit
Sorry if this is a noob questions...but does this work also on llama.cpp with cuda/rocm?
MiaBchDave@reddit (OP)
You can use it with vLLM, no llama.cpp yet AFAIK. The DFlash model on HF has install instructions.
po_stulate@reddit
Are you using it on a finetuned qwen3.5-27b? Wouldn't that lead to low acceptance rate?
MiaBchDave@reddit (OP)
Likely. This one “thinks less” than stock as it’s fine tuned on Opus. The acceptance rates are already really high from looking at the DFlash GitHub readme. Worth it to revert to the regular 27B to check on a real codebase, though.
ofan@reddit
How’s pp speed?
j_osb@reddit
it lists it there. 17 vs... 1. t/s. that's why the second has a ttft of like 15s.
MiaBchDave@reddit (OP)
Not the actual PP speed (one line cold), the plain model benches at 950 t/s at 4k, 750 t/s 1K. Benches are on the oMLX website for M5 Max.
maschayana@reddit
How many max tokens?
Dany0@reddit
use the base qwen model ideally and don't use qwopus v3 it underperforms in my real tests and isn't worth the token savings. the only "finetune" I can vouch for isjackasda211233/Qwen3.5-27B-Uncensored-RYS-Reasoner-GGUF but since the dflash model is trained to predict the base model you'll get vastly lower acceptance rates (20-40% as opposed to 60-70% in coding tasks and even lower in general tasks)
Once zlab releases all their code we can finetune our own dflash drafters though!
Robos_Basilisk@reddit
Speculative decoding depends on the acceptance rate. The acceptance rate is higher is code than prose, so take it with a grain of salt.
Ciffa_@reddit
There is a similar concept that can be applied to prompt processing, you can search about it: speculative prefill. A couple of new paper came out in early 2026
mr_Owner@reddit
Yeah, I'm fast at math too but doesn't mean I'm good at it.
Did you do some proper benchmarks?
po_stulate@reddit
The output quality should be 100% the same, but in one of the gh issues they said that the current draft model is only trained on small context, and no agentic training, so the acceptence rate is very low on these cases, which will actually slow down the model instead of making it faster.
snapo84@reddit
You should try DFlash + DDTree
https://x.com/nash_su/status/2043924682802712600
Beginning-Window-115@reddit
would this still work at lower quants?
WhatTheFlukz@reddit
so how much extra vram does this take up?
MiaBchDave@reddit (OP)
Draft model is around 3.6GB
Objective-Picture-72@reddit
22 t/s is legit