Nemotron-Labs-Diffusion from NVIDIA

Model Overview

Nemotron-Labs-Diffusion is a tri-mode language model that supports both AR decoding and diffusion-based parallel decoding by simply switching the attention pattern of the same model during inference. The synergy between these two modes enables a third mode, called self-speculation: the same model performs diffusion-based parallel drafting and AR verification with shared KV cache, achieving high acceptance lengths and decoding efficiency. The seamless mode switching by simply changing attention patterns enables high efficiency at different concurrency levels in varying deployment scenarios with one single model.

Highlights

SOTA 3B, 8B, 14B dense LM family (base, instruct, and vision-language variants) supporting AR, diffusion, and self-speculation with the focus on decode efficiency.
Generation moved from a memory-bound regime toward a compute-bound regime. Model weights are loaded once and reused to compute multiple tokens during generation.
Self-speculation uses diffusion for drafting and AR for verification, providing a stronger alternative to MTP approaches:
3x higher acceptance length and 2.2x speed-up vs. Qwen3-8B-Eagle3 in SGLang.
5.9× tokens per forward over Qwen3-8B (no MTP) with the same accuracy.
Real-device speed-up across platforms:
DGX Spark (8B, concurrency 1): 2.7x faster with 112 tok/sec vs. 41.8 tok/sec AR using w4a16.
GB200 (8B, concurrency 1): 3.3x faster with 850 tok/sec vs. 253 tok/sec AR and 360 tok/sec Eagle3. Custom CUDA kernels boost to 1015 tok/sec (4x).
Diffusion speedup-of-light analysis shows that throughput can be further doubled (vs. current best) for a single user with better sampling - future research.

[-]

Borkato@reddit

Can we all agree to never ever compare any model to qwen 3 ever again

Finanzamt_Endgegner@reddit

its based on qwen3 just like orthrus lol

West_Ad1573@reddit

No, it is based on Ministral3 actually, same tokenizer as main Nemotron models.

Interesting then I genuinely wonder why they compared it to qwen3 lol

It still makes sense when comparing to dense models with normal attention. I guess, this model is on top3 downloads on HF for <8B.

HumanDrone8721@reddit

But then how would they show improvement?

Compare to qwen 3.6 is my point

harrro@reddit

There is no 8B/9B for Qwen 3.6

FerLuisxd@reddit

3.6 is 35b and 27b, none of these models are close to those

Double_Cause4609@reddit

Why?

This is about architecture / model serving, not about performance quality.

The same thing applies to Qwen 3.6, it would just be more expensive to prototype and prove out; it doesn't really matter what model you do it on. They could have done it on Llama 3.2 3B and it would mean the same thing for the industry. This is just research, not the final model artifact.

What benefit would it serve to get the same result on Qwen 3.6?

the same thing can be trained for 3.6 its just easier for 3 since that one doesnt use gated deltanet 😅

AppealSame4367@reddit

The point here is the extreme speed-up. You can have that, in 1-2 months in llama.cpp

coder543@reddit

These models are derived from Qwen3. Why wouldn't they compare them to Qwen3?

The key question is whether Nvidia will apply this research to their own future Nemotron models or not.

Orthrus wants to do something similar, basically the spec decoding part (which imo is the most interesting part) with qwen3.5/3.6 and release their training code so we can get this for any local mode l relatively soon (;

Author here, happy to answer questions!

noddy432@reddit

Interested in the "LoRA-enhanced Drafter". I havent noticed this before, only in image generation. Could you expand a little on this and perhaps link an example? Thanks for your work. 🙂

oxygen_addiction@reddit

This sounds a lot like Orthus, which is good because it validates that approach.

I think many of similar works came after Set Block Decoding, including our prior work on TiDAR: https://tidarlm.github.io/ Important aspect of NemotronLabsDiffusion is that we validated on real inference in SGLang. Many papers just show acceptance length, but would require 2 forward passes. Because of joint AR and dLLM training, acceptance length is above 10 on benchmark datasets, but real speed up will be 4x.

Franck_Dernoncourt@reddit

(Pointer to Orthrus in case anyone is curious: https://github.com/chiennv2000/orthrus)

laul_pogan@reddit

The 112 tok/sec number on Spark makes sense mechanically. LPDDR5X tops out around 273 GB/s, which is what caps AR decode on unified memory boxes; you're loading weights once per token and bandwidth runs out fast. The self-speculation trick works here because it reuses the loaded weights across multiple draft steps, shifting from memory-bound toward compute-bound. That regime change is worth more on unified memory (where bandwidth is the hard ceiling) than on HBM-backed discrete cards where the ceiling sits higher. Curious whether the w4a16 quantization they used is also doing most of the heavy lifting on weight loading, or if the architecture change alone at bf16 shows similar gains.

Silver-Champion-4846@reddit

How's the licence?

Nvidia license /s

Hell yeah /s

okay its basically just orthrus lol

But this shows that the methodology itself is good!