Nemotron-Labs-Diffusion from NVIDIA
Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 25 comments
https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-VLM-8B
https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B-Base
https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B
https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-8B-Base
https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-8B
https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-3B-Base
https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-3B
Model Overview
Nemotron-Labs-Diffusion is a tri-mode language model that supports both AR decoding and diffusion-based parallel decoding by simply switching the attention pattern of the same model during inference. The synergy between these two modes enables a third mode, called self-speculation: the same model performs diffusion-based parallel drafting and AR verification with shared KV cache, achieving high acceptance lengths and decoding efficiency. The seamless mode switching by simply changing attention patterns enables high efficiency at different concurrency levels in varying deployment scenarios with one single model.
Highlights
- SOTA 3B, 8B, 14B dense LM family (base, instruct, and vision-language variants) supporting AR, diffusion, and self-speculation with the focus on decode efficiency.
- Generation moved from a memory-bound regime toward a compute-bound regime. Model weights are loaded once and reused to compute multiple tokens during generation.
- Self-speculation uses diffusion for drafting and AR for verification, providing a stronger alternative to MTP approaches:
- 3x higher acceptance length and 2.2x speed-up vs. Qwen3-8B-Eagle3 in SGLang.
- 5.9× tokens per forward over Qwen3-8B (no MTP) with the same accuracy.
- Real-device speed-up across platforms:
- DGX Spark (8B, concurrency 1): 2.7x faster with 112 tok/sec vs. 41.8 tok/sec AR using w4a16.
- GB200 (8B, concurrency 1): 3.3x faster with 850 tok/sec vs. 253 tok/sec AR and 360 tok/sec Eagle3. Custom CUDA kernels boost to 1015 tok/sec (4x).
- Diffusion speedup-of-light analysis shows that throughput can be further doubled (vs. current best) for a single user with better sampling - future research.
Borkato@reddit
Can we all agree to never ever compare any model to qwen 3 ever again
Finanzamt_Endgegner@reddit
its based on qwen3 just like orthrus lol
West_Ad1573@reddit
No, it is based on Ministral3 actually, same tokenizer as main Nemotron models.
Finanzamt_Endgegner@reddit
Interesting then I genuinely wonder why they compared it to qwen3 lol
West_Ad1573@reddit
It still makes sense when comparing to dense models with normal attention. I guess, this model is on top3 downloads on HF for <8B.
HumanDrone8721@reddit
But then how would they show improvement?
Borkato@reddit
Compare to qwen 3.6 is my point
harrro@reddit
There is no 8B/9B for Qwen 3.6
FerLuisxd@reddit
3.6 is 35b and 27b, none of these models are close to those
Double_Cause4609@reddit
Why?
This is about architecture / model serving, not about performance quality.
The same thing applies to Qwen 3.6, it would just be more expensive to prototype and prove out; it doesn't really matter what model you do it on. They could have done it on Llama 3.2 3B and it would mean the same thing for the industry. This is just research, not the final model artifact.
What benefit would it serve to get the same result on Qwen 3.6?
Finanzamt_Endgegner@reddit
the same thing can be trained for 3.6 its just easier for 3 since that one doesnt use gated deltanet 😅
AppealSame4367@reddit
The point here is the extreme speed-up. You can have that, in 1-2 months in llama.cpp
coder543@reddit
These models are derived from Qwen3. Why wouldn't they compare them to Qwen3?
The key question is whether Nvidia will apply this research to their own future Nemotron models or not.
Finanzamt_Endgegner@reddit
Orthrus wants to do something similar, basically the spec decoding part (which imo is the most interesting part) with qwen3.5/3.6 and release their training code so we can get this for any local mode l relatively soon (;
West_Ad1573@reddit
Author here, happy to answer questions!
noddy432@reddit
Interested in the "LoRA-enhanced Drafter". I havent noticed this before, only in image generation. Could you expand a little on this and perhaps link an example? Thanks for your work. 🙂
oxygen_addiction@reddit
This sounds a lot like Orthus, which is good because it validates that approach.
West_Ad1573@reddit
I think many of similar works came after Set Block Decoding, including our prior work on TiDAR: https://tidarlm.github.io/ Important aspect of NemotronLabsDiffusion is that we validated on real inference in SGLang. Many papers just show acceptance length, but would require 2 forward passes. Because of joint AR and dLLM training, acceptance length is above 10 on benchmark datasets, but real speed up will be 4x.
Franck_Dernoncourt@reddit
(Pointer to Orthrus in case anyone is curious: https://github.com/chiennv2000/orthrus)
laul_pogan@reddit
The 112 tok/sec number on Spark makes sense mechanically. LPDDR5X tops out around 273 GB/s, which is what caps AR decode on unified memory boxes; you're loading weights once per token and bandwidth runs out fast. The self-speculation trick works here because it reuses the loaded weights across multiple draft steps, shifting from memory-bound toward compute-bound. That regime change is worth more on unified memory (where bandwidth is the hard ceiling) than on HBM-backed discrete cards where the ceiling sits higher. Curious whether the w4a16 quantization they used is also doing most of the heavy lifting on weight loading, or if the architecture change alone at bf16 shows similar gains.
Silver-Champion-4846@reddit
How's the licence?
FerLuisxd@reddit
Nvidia license /s
Silver-Champion-4846@reddit
Hell yeah /s
Finanzamt_Endgegner@reddit
okay its basically just orthrus lol
Finanzamt_Endgegner@reddit
But this shows that the methodology itself is good!