Can a 5090 with qwen3.6 achieve > 3,000 tok/s ? bring your pitchforks (open-dllm)

Posted by Revolutionary_Ask154@reddit | LocalLLaMA | View on Reddit | 8 comments

so background - these people. Fred Zhangzhi Peng, Shuibai Zhang, Alex Tong, worked on converting AR -> diffusion (its already working from older models).

https://oval-shell-31c.notion.site/Open-dLLM-Open-Diffusion-Large-Language-Model-25e03bf6136480b7a4ebe3d53be9f68a

I forked the codebase - ran it through opencode with free deepseek-flash / GLM5.1 overnight to upgrade to support qwen3.6 - because codebase is > 6 mths old - i got AI to mash up LDLM a most recent paper in the mix https://arxiv.org/pdf/2605.07933v1 Viacheslav Meshchaninov1 , Alexander Shabalin1 , Egor Chimbulatov2 , Nikita Gushchin3,4, Ilya Koziev5 , Alexander Korotin3,4, Dmitry Vetrov1 - these guys spent 3 years working on getting this paper working.

https://x.com/Viacheslav91112/status/2054613430082957443?s=20

I asked it to build config for qwen 3.6 model + upgrade with LDLM and spit ball some numbers on outputs with "honest" assumptions - big one is sequence length - throughput likely to fall off with higher outputs.

Inference Throughput (Qwen3.6 LDLM, untrained, RTX 5090 32GB)

Model	Dim	Trainable Params	Diffusion Steps	Throughput
Qwen3.6-35B-A3B	2048	1.39B	10	3,238 tok/s
Qwen3.6-35B-A3B	2048	1.39B	4	\~6,500 tok/s
Qwen3.6-27B	5120	6.75B	10	745 tok/s
Qwen3.6-27B	5120	6.75B	4	\~1,500 tok/s

Assumptions & Caveats

Untrained weights: These benchmarks use randomly initialized Perceiver/decoder/diffusion-head weights. A trained model will have identical throughput but produce coherent output. Quality benchmarks (perplexity, HumanEval) will be published after training completes.
No encoder in the loop: The frozen Qwen3.6 encoder is not used during generation — it's only needed for training (to produce latent targets). At inference, the diffusion head denoises random noise, then the Perceiver decoder maps latents to tokens. The encoder is deleted before benchmarking (del autoencoder.token_encoder).
Seq len = 64: The benchmark uses a short sequence length (64 tokens). Longer sequences will reduce throughput proportionally. The 4-step throughput numbers are linear extrapolations from the 10-step measurements.
Batch size = 1: Single-sequence generation only. Throughput scales near-linearly with batch size for the 35B-A3B (dim=2048 fits easily in VRAM), less so for the 27B (dim=5120).
CPU RAM requirement: While the encoder is not used at inference, it must fit in system RAM during training (\~54GB for 27B, \~22GB for 35B-A3B in bf16). The Qwen3.6 architecture uses Triton kernels (flash-linear-attention) that cannot run on CPU, so the encoder forward pass during training requires GPU offloading — a multi-GPU setup is recommended for training.
Qwen3.6 requires trust_remote_code=True: The model uses custom architecture code (Qwen3_5ForConditionalGeneration) that is not in standard transformers releases. Ensure your transformers version supports it (>=4.54).
35B-A3B is MoE: Only 3B of its 35B parameters are active per token, giving it a much smaller hidden dim (2048) than the 27B dense model (5120). This is why the LDLM trainable components are 5x smaller and 4x faster.
Not an apples-to-apples comparison with AR models: The diffusion model generates all tokens in parallel across N diffusion steps, while AR generates one token at a time. The "tok/s" metric favors diffusion for short sequences but does not reflect output quality, which depends on training convergence.

Code is here - with git issues enabled

https://github.com/scrya-com/Open-dLLM

wandb training metrics

https://wandb.ai/snoozie/Qwen3.6-35B-A3B-LDLM?nw=nwusersnoozie

If anyone has spare vast.ai credits / azure credits / google credits hook me up

[-]

EbbNorth7735@reddit

So you convert a transformers model into a diffusion model? Or are you training a diffusion model? Is there a continue generation output that it can use to say it's not done and needs to diffuse the next section. Just wondering about context length and how that works.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

Sofakingwetoddead@reddit

Thanks "John" 😉

Elkal277@reddit

cool numbers but seq len 64 and untested weights are huge asterisks. would love to see real trained benchmarks at 512+ tokens

Can a 5090 with qwen3.6 achieve > 3,000 tok/s ? bring your pitchforks (open-dllm)

Inference Throughput (Qwen3.6 LDLM, untrained, RTX 5090 32GB)

Assumptions & Caveats

EbbNorth7735@reddit

OldBlackEye@reddit

theblizz4rd@reddit

IslamNofl@reddit

robertpro01@reddit

RemindMeBot@reddit

Sofakingwetoddead@reddit

Elkal277@reddit