DFlash Doubles the T/S Gen Speed of Qwen3.5 27B (BF16) on Mac M5 Max

Posted by MiaBchDave@reddit | LocalLLaMA | View on Reddit | 31 comments

The new DFlash support in oMLX 0.3.5 RC1 looks like it doubles (!!!) the speed of Qwen3.5 27B (BF16). Initial test. Generation T/S went from 9 to 22 T/S!

Models used (HuggingFace)

Main Model: Jackrong/MLX-Qwopus3.5-27B-v3-bf16
Draft Model: z-lab/Qwen3.5-27B-DFlash

System: M5 Max 128GB

DFlash on Github: https://github.com/bstnxbt/dflash-mlx?tab=readme-ov-file

oMLX (v0.3.5 RC1): https://omlx.ai

I'm not affiliated with any of the developers. Since the Qwen3.5 27B model is so good for the size, with speed being the only thing holding it back, I thought that this may help deploy this model locally at higher quants/full weights.

I've yet to test with OpenCode or other harness.

[-]

R_Duncan@reddit

Nice, now please test 4 bit quantized and multi-user, as other reports are saying:

- 4 bit quantized DFlash gain is minimal

- multi-user / stream gain decrease with number of users, halved in 2, 20% in 4, 0% in 8.

[-]

j_osb@reddit

that prefill though. god.

[-]

xienze@reddit

Yeah that's the Achilles Heel of Macs. If you want to use it for agentic coding purposes, forget it (at least, this particular model).

[-]

MiaBchDave@reddit (OP)

The M5 Max does not have a pre-fill "problem" like older M series chips. It's not a desktop 5090, but not slow at all. Plus oMLX has hot/cold cache for agentic work and large codebases, making things MUCH faster compared to llama.cpp for any harness.

[-]

xienze@reddit

Isn't the OP running this on an M5 Max and getting 17token/sec PP?

[-]

Thrumpwart@reddit

oMLX also has prefix caching so for agentic tasks it dramatically speeds up pp by caching pre-computed context.

[-]

MiaBchDave@reddit (OP)

The prefill is not reading correctly… at least cold. with one line. The full benchmark with the same model has prefill at 950 t/s on my same system with 4K context. The oMLX website has community benchmarks if interested on regular PP/TS. Those are normalized settings though, with static model configs, not speculative options.

[-]

Fast-Gold125@reddit

DFlash for Gemma 4 31B pls

[-]

Specter_Origin@reddit

Even with that you will get like 20 TPS, how is that even usable for day to day? MOE would make a much better pick... you can get a usable 100 tps if it truly doubles

[-]

robertpro01@reddit

Has anyone a workaround for llama.cpp? I Want to use it for 122b or 27b

[-]

OrkanFlorian@reddit

I have been experimenting with trying to implement DFlash for llama.cpp, but it turns out it is really really hard. (even with the help of codex and CC)

If I do get it working somehow, I will try to get it upstream, but with how llama.cpp maintainers tick that could take a while. (not that I judge them, they are being flooded with vibe coded, bullshit PRs sadly)

[-]

robertpro01@reddit

I hope so my friend!

[-]

pmttyji@reddit

[-]

Expensive_Demand1069@reddit

Sorry if this is a noob questions...but does this work also on llama.cpp with cuda/rocm?

[-]

MiaBchDave@reddit (OP)

You can use it with vLLM, no llama.cpp yet AFAIK. The DFlash model on HF has install instructions.

[-]

po_stulate@reddit

Are you using it on a finetuned qwen3.5-27b? Wouldn't that lead to low acceptance rate?

[-]

MiaBchDave@reddit (OP)

Likely. This one “thinks less” than stock as it’s fine tuned on Opus. The acceptance rates are already really high from looking at the DFlash GitHub readme. Worth it to revert to the regular 27B to check on a real codebase, though.

[-]

ofan@reddit

How’s pp speed?

[-]

j_osb@reddit

it lists it there. 17 vs... 1. t/s. that's why the second has a ttft of like 15s.

[-]

MiaBchDave@reddit (OP)

Not the actual PP speed (one line cold), the plain model benches at 950 t/s at 4k, 750 t/s 1K. Benches are on the oMLX website for M5 Max.

[-]

maschayana@reddit

How many max tokens?

[-]

Dany0@reddit

use the base qwen model ideally and don't use qwopus v3 it underperforms in my real tests and isn't worth the token savings. the only "finetune" I can vouch for isjackasda211233/Qwen3.5-27B-Uncensored-RYS-Reasoner-GGUF but since the dflash model is trained to predict the base model you'll get vastly lower acceptance rates (20-40% as opposed to 60-70% in coding tasks and even lower in general tasks)

Once zlab releases all their code we can finetune our own dflash drafters though!

[-]

Robos_Basilisk@reddit

Speculative decoding depends on the acceptance rate. The acceptance rate is higher is code than prose, so take it with a grain of salt.

[-]

Ciffa_@reddit

There is a similar concept that can be applied to prompt processing, you can search about it: speculative prefill. A couple of new paper came out in early 2026

[-]

mr_Owner@reddit

Yeah, I'm fast at math too but doesn't mean I'm good at it.

Did you do some proper benchmarks?

[-]

po_stulate@reddit

The output quality should be 100% the same, but in one of the gh issues they said that the current draft model is only trained on small context, and no agentic training, so the acceptence rate is very low on these cases, which will actually slow down the model instead of making it faster.

[-]

MiaBchDave@reddit (OP)

Draft model is around 3.6GB

[-]

Objective-Picture-72@reddit

22 t/s is legit