DFlash is real: x2 tg on small context with oMLX

Posted by dpswt@reddit | LocalLLaMA | View on Reddit | 7 comments

Right from the oven with the latest commit:

DFLASH_MAX_CTX=8192 uv run python -m omlx.cli serve

oMLX - LLM inference, optimized for your Mac
https://github.com/jundot/omlx
Benchmark Model: Qwen3.5-35B-A3B-MLX-MXFP4-FP16
================================================================================

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128          1471.2        6.94   696.0 tok/s   145.3 tok/s       2.352   489.8 tok/s    21.24 GB
pp4096/tg128          7213.7        6.76   567.8 tok/s   149.0 tok/s       8.073   523.3 tok/s    23.49 GB
pp8192/tg128         13674.1       14.23   599.1 tok/s    70.8 tok/s      15.481   537.4 tok/s    21.51 GB
pp16384/tg128        25626.5       17.10   639.3 tok/s    58.9 tok/s      27.798   594.0 tok/s    22.76 GB

More benchmarks here.

[-]

Puzzleheaded_Base302@reddit

second this. made mine working on my RTX PRO 4500 32GB.

token rate went from 22 tps to 60 tps for qwen3.5-27b-awq. almost 3x improvement.

unfortunately, 32GB VRAM is on the edge to run qwen3.5-27b on vllm. i can only do 2048 context length.

in case anyone wants to duplicate use the following known working command line.

claude found out a datatype mismatch bug in vllm nightly, it patched the bug in the below command line, otherwise vllm won't start.

docker run -it --rm \

--name vllm-dflash \

--gpus all \

--ipc=host \

-p 8000:8000 \

-v \~/.cache/huggingface:/root/.cache/huggingface \

--entrypoint /bin/bash \

vllm/vllm-openai:nightly \

-c '

python3 -c "

import re, pathlib

f = pathlib.Path(\"/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_dflash.py\")

src = f.read_text()

old = \"result = self.model.fc(hidden_states)\"

new = \"result = self.model.fc(hidden_states.to(self.model.fc.weight.dtype))\"

if old in src:

f.write_text(src.replace(old, new))

print(\"Patched qwen3_dflash.py\")

else:

print(\"Pattern not found — check line manually\")

exec python3 -m vllm.entrypoints.openai.api_server \

--model QuantTrio/Qwen3.5-27B-AWQ \

--tokenizer Qwen/Qwen3.5-27B \

--served-model-name qwen3.5-27b \

--port 8000 \

--gpu-memory-utilization 0.92 \

--max-model-len 2048 \

--max-num-batched-tokens 8192 \

--max-num-seqs 64 \

--trust-remote-code \

--enforce-eager \

--dtype float16 \

--speculative-config '"'"'{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash", "num_speculative_tokens": 15}'"'"'

====================================
\^\^ don't forget the " ` " at the last line.

[-]

R_Duncan@reddit

ok on small context.... but in large context? We already discovered performance advantage drops if more than 1 concurrent stream/users and drops with quantizations. Does it drops in large context?

[-]

dpswt@reddit (OP)

It drops badly, which is kinda expected in my understanding.

With DFLASH_MAX_CTX=32768, notice the throughput.

oMLX - LLM inference, optimized for your Mac
https://github.com/jundot/omlx
Benchmark Model: Qwen3.5-35B-A3B-MLX-MXFP4-FP16
================================================================================

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128          1508.9        8.20   678.6 tok/s   123.0 tok/s       2.550   451.8 tok/s    21.24 GB
pp4096/tg128          7434.7        6.81   550.9 tok/s   148.0 tok/s       8.300   508.9 tok/s    23.49 GB
pp8192/tg128         20005.3        7.71   409.5 tok/s   130.7 tok/s      20.984   396.5 tok/s    24.08 GB
pp16384/tg128        60227.3       10.97   272.0 tok/s    91.9 tok/s      61.621   268.0 tok/s    24.97 GB
pp32768/tg128        70990.9       26.26   461.6 tok/s    38.4 tok/s      74.326   442.6 tok/s    25.39 GB

[-]

DerDave@reddit

Why is it expected. Don't really understand the connection between the performance of speculative decoding and context length...

[-]

dpswt@reddit (OP)

Actually, the answer might be even "simplier":

It was trained with a context length of 4096 tokens.

From https://huggingface.co/z-lab/Qwen3.5-27B-DFlash

No mention of that in the 35B-A3B-DFlash readme file, but I think it's similar.

So the draft model has never seen that long context during training, so I'd expect the predicted token acceptance rate to actually drop.

[-]

dpswt@reddit (OP)

That's beyond what I really know, but I think it all starts drowning in token verification and overall bandwidth pressure.

[-]

gyzerok@reddit

Yeah, the speed is amazing. The only sad news - it’s for a very small context. In my project just an initial context takes around 10k :(