DFlash is real: x2 tg on small context with oMLX
Posted by dpswt@reddit | LocalLLaMA | View on Reddit | 7 comments
Right from the oven with the latest commit:
DFLASH_MAX_CTX=8192 uv run python -m omlx.cli serve
oMLX - LLM inference, optimized for your Mac
https://github.com/jundot/omlx
Benchmark Model: Qwen3.5-35B-A3B-MLX-MXFP4-FP16
================================================================================
Single Request Results
--------------------------------------------------------------------------------
Test TTFT(ms) TPOT(ms) pp TPS tg TPS E2E(s) Throughput Peak Mem
pp1024/tg128 1471.2 6.94 696.0 tok/s 145.3 tok/s 2.352 489.8 tok/s 21.24 GB
pp4096/tg128 7213.7 6.76 567.8 tok/s 149.0 tok/s 8.073 523.3 tok/s 23.49 GB
pp8192/tg128 13674.1 14.23 599.1 tok/s 70.8 tok/s 15.481 537.4 tok/s 21.51 GB
pp16384/tg128 25626.5 17.10 639.3 tok/s 58.9 tok/s 27.798 594.0 tok/s 22.76 GB
More benchmarks here.
Puzzleheaded_Base302@reddit
second this. made mine working on my RTX PRO 4500 32GB.
token rate went from 22 tps to 60 tps for qwen3.5-27b-awq. almost 3x improvement.
unfortunately, 32GB VRAM is on the edge to run qwen3.5-27b on vllm. i can only do 2048 context length.
in case anyone wants to duplicate use the following known working command line.
claude found out a datatype mismatch bug in vllm nightly, it patched the bug in the below command line, otherwise vllm won't start.
docker run -it --rm \
--name vllm-dflash \
--gpus all \
--ipc=host \
-p 8000:8000 \
-v \~/.cache/huggingface:/root/.cache/huggingface \
--entrypoint /bin/bash \
vllm/vllm-openai:nightly \
-c '
python3 -c "
import re, pathlib
f = pathlib.Path(\"/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_dflash.py\")
src = f.read_text()
old = \"result = self.model.fc(hidden_states)\"
new = \"result = self.model.fc(hidden_states.to(self.model.fc.weight.dtype))\"
if old in src:
f.write_text(src.replace(old, new))
print(\"Patched qwen3_dflash.py\")
else:
print(\"Pattern not found — check line manually\")
"
exec python3 -m vllm.entrypoints.openai.api_server \
--model QuantTrio/Qwen3.5-27B-AWQ \
--tokenizer Qwen/Qwen3.5-27B \
--served-model-name qwen3.5-27b \
--port 8000 \
--gpu-memory-utilization 0.92 \
--max-model-len 2048 \
--max-num-batched-tokens 8192 \
--max-num-seqs 64 \
--trust-remote-code \
--enforce-eager \
--dtype float16 \
--speculative-config '"'"'{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash", "num_speculative_tokens": 15}'"'"'
'
====================================
\^\^ don't forget the " ` " at the last line.
R_Duncan@reddit
ok on small context.... but in large context? We already discovered performance advantage drops if more than 1 concurrent stream/users and drops with quantizations. Does it drops in large context?
dpswt@reddit (OP)
It drops badly, which is kinda expected in my understanding.
With
DFLASH_MAX_CTX=32768, notice the throughput.DerDave@reddit
Why is it expected. Don't really understand the connection between the performance of speculative decoding and context length...
dpswt@reddit (OP)
Actually, the answer might be even "simplier":
From https://huggingface.co/z-lab/Qwen3.5-27B-DFlash
No mention of that in the 35B-A3B-DFlash readme file, but I think it's similar.
So the draft model has never seen that long context during training, so I'd expect the predicted token acceptance rate to actually drop.
dpswt@reddit (OP)
That's beyond what I really know, but I think it all starts drowning in token verification and overall bandwidth pressure.
gyzerok@reddit
Yeah, the speed is amazing. The only sad news - it’s for a very small context. In my project just an initial context takes around 10k :(