Diffusion Language Models are Super Data Learners

[-]

ohgoditsdoddy@reddit

I don’t think these are new. They also have drawbacks (e.g. autoregressive models are better at coherence; in image terms think a hand with 7 fingers or disconnected, additional hands generated with handlebars etc.).

Check this GIF (from this post advocating for a hybrid approach).

https://i.redd.it/e4wxcpz1a8if1.gif

[-]

LyAkolon@reddit

Also block diffusion ressembles how humans think

[-]

Skylion007@reddit

Author of the paper here, happy to answer any questions.

[-]

Photoperiod@reddit

Have there been any advances in bd3-lms since this paper was published? Seems like these models aren't quite as accurate as straight autoregressive models. Do you see some clear next steps to improve upon this hybrid approach? Awesome work BTW!

[-]

Skylion007@reddit

Cooking up something for ICLR, stay tuned.

This also works marginally better and addresses a lot of the failings with BD3-LMs: https://arxiv.org/abs/2506.01928

[-]

roofitor@reddit

Very cool

[-]

robertotomas@reddit

Get ready to train on your 5t-30t data sets for 100 epochs instead of 2-4

[-]

Crierlon@reddit

LLMs are based in chaos theory. DLMs are competitive but not near parity yet or if ever.

Some argued it’s an approximation of auto regression.

[-]

F4k3r22@reddit

Hey, if anyone wants to experiment and see how a Diffusion Language Model works and how to train it, I'll leave my repo and a checkpoint that I trained so you can see how it behaves :D

Repo: https://github.com/F4k3r22/LLaDA-from-scratch

Checkpoint: https://huggingface.co/Fredtt3/LLaDA-100M-Test

[-]

HauntingAd8395@reddit

This is reek of hype language.

Telling ya, the experiments are conducted through training with multiple epochs.

Modern LLMs are all trained with one epoch only because data is abundant.

Given all of that, the experiments seem to conducted with ill-intention: why the performance of AR at one epoch higher than the performance of DT at 96 epochs? It is easy to see that they conducted training of AR in a very wrong scheduler in order to hype up DT.

[-]

Irisi11111@reddit

Repeating batches isn’t a big deal for diffusion models. Training runs through multiple noise timesteps in each pass, so even if you see the same data again, the model’s getting different views of it. Gradient descent doesn’t really max out all the useful directions in parameter space in one go, so training the same samples a few more times actually helps cover more ground. That’s pretty different from autoregressive models, where next-token prediction is a very direct, step-by-step objective. In that setup, repeating batches can just lead to faster overfitting without much benefit.

[-]

DanielKramer_@reddit

OK let's 100x the size of the internet before gpt 6 is trained

[-]

No_Efficiency_1144@reddit

It is common for papers to compare to a weak baseline yes

[-]

PykeAtBanquet@reddit

Imagine that I wish to do some experimenting with it, how do I actually run the code? What can I read to be able to make it, for example, diffuse text block by block but in a specific way etc - what should I read to build and test out something like this?

[-]

No_Efficiency_1144@reddit

They are strong contenders for some uses.

As I said in another comment they have two downsides:

Worse inductive prior for autoregressive structures than LLMs. Please note that both language and code have autoregressive structures.
No KV cache. This is a devastating one for long context.

[-]

ColorlessCrowfeet@reddit

In a long, multi-turn conversation, Gemini Diffusion remembered the earliest context. It acts like it's a hybrid model with diffusion blocks plus a "KV cache equivalent" memory.

[-]

Thunderbird120@reddit

There's technically nothing stopping you from using autoregressive models to do bidirectional sequence modeling. You can just autoregressively model a sequence in a random order instead of left-to-right.

It requires some modifications to the attention operation but the required changes are not that big. You get a similar(?) bump to data efficiency from doing this while still allowing you to use KV caches. Training models like this improves the performance of models even when they're only generating new tokens in a left-to-right order.

The main downside is that it's still much more compute intensive to train a good model this way due to the much higher complexity of the problem being learned. Instead of learning to predict the next token, you're asking the model to learn to predict any token in the sequence given any subset of other tokens, which is very hard.

You can make this task easier by making the "random" order of the sequence traversal less random, biasing "next" tokens to be near "previous" tokens or in other ways. You retain most of the data efficiency gains even when you dramatically simplify how "random" the random order sequence traversal is.

[-]

No_Efficiency_1144@reddit

Non-unidirectional autoregressive modelling is great yeah, they use it for images sometimes as well, and you do indeed get your KV cache back.

The inductive prior of such models is different and depends a lot on the exact implementation. I think we are generally not good at matching tasks to inductive priors, there is potentially a lot of gains to be had if we were better at matching our model architectures to our tasks.

The point I made about language and code suiting the unidirectional autoregressive prior still stands somewhat although ultimately language and code are some kind of graph.

GNNs are in many ways the ultimate model because they can adapt to the data to a greater extent. But the downside is that ideal GNN mathematics and hardware is still being worked out.

[-]

ohgoditsdoddy@reddit

I don’t think these are new. They also have drawbacks (e.g. autoregressive models are better at coherence; in image terms think a hand with 7 fingers or disconnected, additional hands generated with handlebars etc.).

Check this GIF (from this post advocating for a hybrid approach).

[-]

ohgoditsdoddy@reddit

I don’t think these are new. They also have drawbacks (e.g. autoregressive models are better at coherence; in image terms think a hand with 7 fingers or disconnected, additional hands generated with handlebars etc.).

[-]

ohgoditsdoddy@reddit

I don’t think these are new. They also have drawbacks (e.g. autoregressive models are better at coherence; in image terms think a hand with 7 fingers or disconnected, additional hands generated with handlebars etc.).

Check this GIF (from this post advocating for a hybrid approach).

https://i.redd.it/r7t7yy2o98if1.gif