DFlash is real: x2 tg on small context with oMLX

Posted by dpswt@reddit | LocalLLaMA | View on Reddit | 7 comments

Right from the oven with the latest commit:

DFLASH_MAX_CTX=8192 uv run python -m omlx.cli serve

oMLX - LLM inference, optimized for your Mac
https://github.com/jundot/omlx
Benchmark Model: Qwen3.5-35B-A3B-MLX-MXFP4-FP16
================================================================================

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128          1471.2        6.94   696.0 tok/s   145.3 tok/s       2.352   489.8 tok/s    21.24 GB
pp4096/tg128          7213.7        6.76   567.8 tok/s   149.0 tok/s       8.073   523.3 tok/s    23.49 GB
pp8192/tg128         13674.1       14.23   599.1 tok/s    70.8 tok/s      15.481   537.4 tok/s    21.51 GB
pp16384/tg128        25626.5       17.10   639.3 tok/s    58.9 tok/s      27.798   594.0 tok/s    22.76 GB

More benchmarks here.