Can a 5090 with qwen3.6 achieve > 3,000 tok/s ? bring your pitchforks (open-dllm)
Posted by Revolutionary_Ask154@reddit | LocalLLaMA | View on Reddit | 8 comments
so background - these people. Fred Zhangzhi Peng, Shuibai Zhang, Alex Tong, worked on converting AR -> diffusion (its already working from older models).
I forked the codebase - ran it through opencode with free deepseek-flash / GLM5.1 overnight to upgrade to support qwen3.6 - because codebase is > 6 mths old - i got AI to mash up LDLM a most recent paper in the mix https://arxiv.org/pdf/2605.07933v1 Viacheslav Meshchaninov1 , Alexander Shabalin1 , Egor Chimbulatov2 , Nikita Gushchin3,4, Ilya Koziev5 , Alexander Korotin3,4, Dmitry Vetrov1 - these guys spent 3 years working on getting this paper working.
https://x.com/Viacheslav91112/status/2054613430082957443?s=20
I asked it to build config for qwen 3.6 model + upgrade with LDLM and spit ball some numbers on outputs with "honest" assumptions - big one is sequence length - throughput likely to fall off with higher outputs.
Inference Throughput (Qwen3.6 LDLM, untrained, RTX 5090 32GB)
| Model | Dim | Trainable Params | Diffusion Steps | Throughput |
|---|---|---|---|---|
| Qwen3.6-35B-A3B | 2048 | 1.39B | 10 | 3,238 tok/s |
| Qwen3.6-35B-A3B | 2048 | 1.39B | 4 | \~6,500 tok/s |
| Qwen3.6-27B | 5120 | 6.75B | 10 | 745 tok/s |
| Qwen3.6-27B | 5120 | 6.75B | 4 | \~1,500 tok/s |
Assumptions & Caveats
- Untrained weights: These benchmarks use randomly initialized Perceiver/decoder/diffusion-head weights. A trained model will have identical throughput but produce coherent output. Quality benchmarks (perplexity, HumanEval) will be published after training completes.
- No encoder in the loop: The frozen Qwen3.6 encoder is not used during generation — it's only needed for training (to produce latent targets). At inference, the diffusion head denoises random noise, then the Perceiver decoder maps latents to tokens. The encoder is deleted before benchmarking (
del autoencoder.token_encoder). - Seq len = 64: The benchmark uses a short sequence length (64 tokens). Longer sequences will reduce throughput proportionally. The 4-step throughput numbers are linear extrapolations from the 10-step measurements.
- Batch size = 1: Single-sequence generation only. Throughput scales near-linearly with batch size for the 35B-A3B (dim=2048 fits easily in VRAM), less so for the 27B (dim=5120).
- CPU RAM requirement: While the encoder is not used at inference, it must fit in system RAM during training (\~54GB for 27B, \~22GB for 35B-A3B in bf16). The Qwen3.6 architecture uses Triton kernels (flash-linear-attention) that cannot run on CPU, so the encoder forward pass during training requires GPU offloading — a multi-GPU setup is recommended for training.
- Qwen3.6 requires
trust_remote_code=True: The model uses custom architecture code (Qwen3_5ForConditionalGeneration) that is not in standard transformers releases. Ensure yourtransformersversion supports it (>=4.54). - 35B-A3B is MoE: Only 3B of its 35B parameters are active per token, giving it a much smaller hidden dim (2048) than the 27B dense model (5120). This is why the LDLM trainable components are 5x smaller and 4x faster.
- Not an apples-to-apples comparison with AR models: The diffusion model generates all tokens in parallel across N diffusion steps, while AR generates one token at a time. The "tok/s" metric favors diffusion for short sequences but does not reflect output quality, which depends on training convergence.
Code is here - with git issues enabled
https://github.com/scrya-com/Open-dLLM
wandb training metrics
https://wandb.ai/snoozie/Qwen3.6-35B-A3B-LDLM?nw=nwusersnoozie
If anyone has spare vast.ai credits / azure credits / google credits hook me up
EbbNorth7735@reddit
So you convert a transformers model into a diffusion model? Or are you training a diffusion model? Is there a continue generation output that it can use to say it's not done and needs to diffuse the next section. Just wondering about context length and how that works.
OldBlackEye@reddit
!remindme 1 month
theblizz4rd@reddit
!remindme 1 month
IslamNofl@reddit
!remindme 1 week
robertpro01@reddit
!remindme 1 month
RemindMeBot@reddit
I will be messaging you in 1 month on 2026-06-16 02:51:09 UTC to remind you of this link
CLICK THIS LINK to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) ^(delete this message to hide from others.)
Sofakingwetoddead@reddit
Thanks "John" 😉
Elkal277@reddit
cool numbers but seq len 64 and untested weights are huge asterisks. would love to see real trained benchmarks at 512+ tokens