Nemotron-Labs-Diffusion from NVIDIA

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 25 comments

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-VLM-8B

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B-Base

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-8B-Base

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-8B

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-3B-Base

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-3B

Model Overview

Nemotron-Labs-Diffusion is a tri-mode language model that supports both AR decoding and diffusion-based parallel decoding by simply switching the attention pattern of the same model during inference. The synergy between these two modes enables a third mode, called self-speculation: the same model performs diffusion-based parallel drafting and AR verification with shared KV cache, achieving high acceptance lengths and decoding efficiency. The seamless mode switching by simply changing attention patterns enables high efficiency at different concurrency levels in varying deployment scenarios with one single model.

Highlights