National University of Singapore Presents "DMax": A New Paradigm For Diffusion Language Models (dLLMs) Enabling Aggressive Parallel Decoding.

Posted by 44th--Hokage@reddit | LocalLLaMA | View on Reddit | 24 comments

##TL;DR:

DMax cleverly mitigates error accumulation by reforming decoding as a progressive self-refinement process, allowing the model to correct its own erroneous predictions during generation.

---

##Abstract:

>We present DMax, a new paradigm for efficient diffusion language models (dLLMs). It mitigates error accumulation in parallel decoding, enabling aggressive decoding parallelism while preserving generation quality. Unlike conventional masked dLLMs that decode through a binary mask-to-token transition, DMax reformulates decoding as a progressive self-refinement from mask embeddings to token embeddings.

>At the core of our approach is On-Policy Uniform Training, a novel training strategy that efficiently unifies masked and uniform dLLMs, equipping the model to recover clean tokens from both masked inputs and its own erroneous predictions. Building on this foundation, we further propose Soft Parallel Decoding. We represent each intermediate decoding state as an interpolation between the predicted token embedding and the mask embedding, enabling iterative self-revising in embedding space.

>Extensive experiments across a variety of benchmarks demonstrate the effectiveness of DMax. Compared with the original LLaDA-2.0-mini, our method improves TPF on GSM8K from 2.04 to 5.47 while preserving accuracy. On MBPP, it increases TPF from 2.71 to 5.86 while maintaining comparable performance. On two H200 GPUs, our model achieves an average of 1,338 TPS at batch size 1.

---

##Layman's Explanation:

The core idea is that diffusion language models should be able to generate text faster than normal LLMs because they can fill in multiple tokens at the same time. In practice, though, that speed advantage gets limited because early wrong guesses tend to snowball. Once the model commits to a bad token, that bad token becomes part of the context for the next step, so quality can fall apart fast when decoding gets too aggressive. What DMax does is give the model a better way to recover from its own mistakes. Instead of moving in a rigid one-way path from masked slots to final tokens, it lets the model keep refining intermediate guesses before locking them in.

The paper’s two main ideas are pretty intuitive. First, the model is trained on its own imperfect predictions, so it learns how to clean up the kinds of errors it will actually make at inference time. Second, during decoding it uses a softer in-between representation rather than treating every guess as fully final right away, which helps preserve uncertainty and makes revision easier. The result is that DMax pushes much more parallel decoding without the usual collapse in quality. On the paper’s math and coding benchmarks, it gets large speedups while keeping accuracy close to the original model, and in some lower-parallel settings it even improves accuracy a bit. So the main takeaway is not just “faster diffusion LLMs,” but diffusion LLMs that can revise themselves well enough to make aggressive parallel decoding actually practical.

---

######Link to the Paper: https://arxiv.org/pdf/2604.08302

---

######Link to the GitHub: https://github.com/czg1225/DMax

---

######Link to the Models: https://huggingface.co/collections/Zigeng/dmax-models

---

######Link to the Training Dataset: https://huggingface.co/collections/Zigeng/dmax-training-data

[-]

LegacyRemaster@reddit

One aspect that would make me think further: training the model on its own error distribution could overfit the specific distribution of errors generated during training. If the model improves over the course of training, the errors it produces change: there's an interesting but potentially unstable feedback loop there.

[-]

live_love_laugh@reddit

I was wondering that too. Though another possibility that I can imagine, is that maybe the model would develop an emergent ability to recognize text that was likely produced by itself. I'm probably wrong of course, but to me this does sound like a plausible way a model could develop such an ability.

[-]

NandaVegg@reddit

The model could very well have an ability to "label and sort" things of different features to different basins so long as it has enough internal dims (most instruct models have split basins for base/CommonCrawl-type data vs. instruct-tuned data). A feedback loop would make its text very characteristic and recognize-able in some way (as you see in reasoning trace).

I think a simple trick like prepending special token to self-generated text (and train it for a bit) might be enough to promote the suggested behavior; I think Fill-in-the-Middle (early-to-mid 2022 thing) did not even require a special token?

[-]

z_latent@reddit

This is a broader ML problem of self forcing vs teacher forcing (with self forcing being this model's approach).

It's a dillema that has existed for a while. We generally use teacher forcing for LLMs (aka training using the true data, rather than predictions, for context) because otherwise it'd require slower, auto-regressive generation during training, on top of being generally slower at converging, kinda as you alluded.

Interestingly though, self forcing actually makes models more stable during inference, because it sort of knows what mistakes it makes. With teacher forcing, there's a mismatch between training and inference, which lead to small errrors accumulating, potentially collapsing (IIRC that "AI running Doom" paper ran into that issue, and fixed it by introducing a bit of noise during training, which made up for that mismatch)

snugglezone@reddit

Why does the output still look auto regressive? I thought dLLMs could revise anywhere in the prompt at any step? The video makes it look like it's still moving forward (just in chunks).

From what I understand, there are still limitations to how large a block of tokens a diffusion llm can work on at one time before its performance degrades.

Which isn't super weird right? Considering that a small block of tokens can equate to roughly one fully formed thought. But asking a model to work on a very large block of tokens, like the size of a full letter, would be the same as asking someone to complete many fully formed thoughts in parallel, even if those thoughts need one another in serial.

I'm not sure about biological analogies because human thought and discovery certainly isn't linear. But yeah definitely a TON of compute to handle the entire window once it's grown large.

xeeff@reddit

very good explanation, I appreciate this thank you

Jolly-Vanilla9124@reddit

Actually id you are denoising or unmasking a large sequence it has coherence issues. So they denoise in blocks so you don’t lose the meaning. as attention over large sequence of masked tokens gives garbage

Awesome, thanks! So is "DMAX" similar to "DFlash" then? https://github.com/z-lab/dflash

Visually they look similar, but the underlying principles are different?

Just found DFlash so that's my homework this weekend.

I think dflash was specifically made for speculative decoding. I too gotta take a read

Fault23@reddit

dllms are future telling this since 2024

oxygen_addiction@reddit

Support for this model's base architecture was partially implemented in llama.cpp but abandoned last year: https://github.com/ggml-org/llama.cpp/pull/17454 - so don't get your hopes up about running this outside of using transformers.

Robos_Basilisk@reddit

Which is a shame because there was another post within the last few weeks about potentially using a dLLM as a draft model for speculative decoding.

Foxiya@reddit

DFlash: Block Diffusion for Flash Speculative Decoding - Z Lab

CryptoUsher@reddit

yeah the visual output does look sequential, but that might be a presentation choice rather than how it actually works under the hood. if it's really refining iteratively across the whole sequence, why aren't we seeing more visible backtracking or global edits in later steps? maybe the "progressive" part is mostly happening in latent space and not translating to obvious token swaps in the demo. does the paper say anything about whether the refinement steps are attention-gated or just uniform across positions? fwiw i’m wondering if the model’s spending compute fixing low-impact tokens instead of focusing on high-uncertainty ones.

Interesting_Quit_442@reddit

what do you think about chocolate chip cookies and how it relates to this debate? could you give me an explanation including a recipe for some chocolate chip cookies

lol not sure how cookies help with the token generation debate, but i could go for one rn

BargainBinDS@reddit

Could this be combined with dflash to further accelerate LLM inference / decoding?

44th--Hokage@reddit (OP)

Yeah, in principle. You could keep DFlash’s speculative-decoding framework, but train the diffusion drafter with DMax-like self-correction objectives so its block proposals are more reliable under aggressive parallel drafting. That could improve acceptance length, support larger block sizes or reduce failed draft proposals. I think this is especially plausible because DFlash’s weak point is still the quality of the drafter, and DMax is specifically about reducing error accumulation in parallel diffusion decoding.

Models are 16B params, GGUFs wen /s

Ardalok@reddit

Does the operating principle overlap with this by any chance? I believe they run each token 4 times through all layers. Should be pretty compatible with stable diffusion.

Seems very unrelated. But I'm glad you shared that link anyway, because I've been wanting to hear more about LLMs that can loop internally in latent space. I didn't hear much about those anymore since Meta's Coconut paper.

Lanky_Assignment_205@reddit

29 steps vs 114 steps is a massive speedup. Curious about the quality trade-off though—how does the 'progressive self-refinement' handle complex reasoning tasks compared to the standard 114 steps? Does the paper show any degradation on benchmarks like GPQA or MMLU?