Can a 5090 with qwen3.6 achieve > 3,000 tok/s ? bring your pitchforks (open-dllm)

Posted by Revolutionary_Ask154@reddit | LocalLLaMA | View on Reddit | 8 comments

so background - these people. Fred Zhangzhi Peng, Shuibai Zhang, Alex Tong, worked on converting AR -> diffusion (its already working from older models).

https://oval-shell-31c.notion.site/Open-dLLM-Open-Diffusion-Large-Language-Model-25e03bf6136480b7a4ebe3d53be9f68a

I forked the codebase - ran it through opencode with free deepseek-flash / GLM5.1 overnight to upgrade to support qwen3.6 - because codebase is > 6 mths old - i got AI to mash up LDLM a most recent paper in the mix https://arxiv.org/pdf/2605.07933v1 Viacheslav Meshchaninov1 , Alexander Shabalin1 , Egor Chimbulatov2 , Nikita Gushchin3,4, Ilya Koziev5 , Alexander Korotin3,4, Dmitry Vetrov1 - these guys spent 3 years working on getting this paper working.

https://x.com/Viacheslav91112/status/2054613430082957443?s=20

I asked it to build config for qwen 3.6 model + upgrade with LDLM and spit ball some numbers on outputs with "honest" assumptions - big one is sequence length - throughput likely to fall off with higher outputs.

Inference Throughput (Qwen3.6 LDLM, untrained, RTX 5090 32GB)

Model Dim Trainable Params Diffusion Steps Throughput
Qwen3.6-35B-A3B 2048 1.39B 10 3,238 tok/s
Qwen3.6-35B-A3B 2048 1.39B 4 \~6,500 tok/s
Qwen3.6-27B 5120 6.75B 10 745 tok/s
Qwen3.6-27B 5120 6.75B 4 \~1,500 tok/s

Assumptions & Caveats

Code is here - with git issues enabled

https://github.com/scrya-com/Open-dLLM

wandb training metrics

https://wandb.ai/snoozie/Qwen3.6-35B-A3B-LDLM?nw=nwusersnoozie

If anyone has spare vast.ai credits / azure credits / google credits hook me up