University of Hong Kong releases Dream 7B (Diffusion reasoning model). Highest performing open-source diffusion model to date. You can adjust the number of diffusion timesteps for speed vs accuracy
Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 157 comments
PathIntelligent7082@reddit
as i can see, the results are in par with quen, so statement like "most powerful" is inaccurate...
silenceimpaired@reddit
It’s unfortunate that they put the least compelling charts first. There are charts present in the image that make this an interesting model. It doesn’t have to be an either or. It can be both.
PathIntelligent7082@reddit
interesting? yes... but terms like "most powerful" are BS
silenceimpaired@reddit
Across the board? Agreed. Sudoku? Agree to Disagree.
jd_3d@reddit (OP)
It's fascinating watching it generate text:
https://i.redd.it/xci0dlo7hgse1.gif
NullHypothesisCicada@reddit
No wonder it’s so good at sudoku
100thousandcats@reddit
What the actual fuck…
Recoil42@reddit
Wait until you see block diffusion.
Samurai2107@reddit
its almost how autoregressive models like 4o works, but block diffusion is not left to right or top to bottom, it shows how claude figured out that is a level in latent that the model already knows what to show us
MINIMAN10001@reddit
Considering how they say block diffusions shows a decreasing perplexity.
It feels like a hack job in order to increase parallelizability?
ClassyBukake@reddit
Even a miniscule amount of parallelism would massive increase the efficiency of multi-compute environments.
kremlinhelpdesk@reddit
Defrag diffusion.
Many_SuchCases@reddit
Never forget the struggle.
PathIntelligent7082@reddit
and then all the crap gets cleaned up, but one lil' red square remains intact
FaceDeer@reddit
I used to find that to be a strangely relaxing process to watch. Sadly, at some point defragmentation became an automatic background process of the filesystem and we no longer got to see it work.
no_witty_username@reddit
Defrag sound was the original asmr i ell asleep to at night....
SidneyFong@reddit
Been using SSDs for so many years now that I totally forgot how we kinda knew what the computer was doing by listening to hard disk sounds...
hazed-and-dazed@reddit
click-click
Oh no!!
DaniyarQQQ@reddit
I remember the sound:
trrt...trrt...trrt...trrt...trrt...trrt...trrt...trrt...trrrrrrt.....
ConiglioPipo@reddit
I was there. I won't forget.
switchpizza@reddit
😂
ffiw@reddit
https://i.redd.it/9xtkswjpphse1.gif
WhereIsYourMind@reddit
I wouldn't put it past front-end gimmicks, but I had a ChatGPT 4.5 response that generated in a similar manner. I remember distinctly that it created blank lines and then generated entire sentence chunks at once, instead of outputting tokens one at a time.
I wonder if OpenAI is doing A/B testing using a model with similar architecture. Pure conjecture.
Shoddy_Ad_7853@reddit
That's efficient, it's what I do.
jabblack@reddit
How does it know the spacing for words it hasn’t figured out yet?
People technically write like this: where the initial words are high level ideas and outlines, then add in additional details.
Look at the words that are filled in first:
Joey and Rachel had been dating for awhile but.. …just wasn’t ready… finally they together.
It creates an overarching narrative, then fills in gaps.
tim_Andromeda@reddit
That's a gimmick right? How would it know how much space to leave for text it hasn't outputted yet.
DerfK@reddit
I'm suspicious as well, but I'm guessing what the video shows is a "dramatization" of how the final product was arrived at (maybe even an accurate dramatization of the fragments of the text in the order they actually got generated), rather than actual runtime diffusion snapshots like StableDiffusion where you can see the blurry bits come together.
Pyros-SD-Models@reddit
Why are you guys just guessing instead of just checking out their github or any hugginface space of a diffusion LLM and literally try it out yourself lol
UserXtheUnknown@reddit
Thanks, tried it. It was not particularly good when compared to similar -in size- sequential LLMs, though. Maybe even a bit worse.
DerfK@reddit
OK not quite the same as the video, it is still working in tokens and each token could be longer or shorter so the text isn't fixed in place with a set number of spaces to fill in like OP's video.
Stepfunction@reddit
This example is specifically an infilling example, so the space needed was specified ahead of time.
stddealer@reddit
https://i.redd.it/1qquw5mw7ise1.gif
This is not infilling and shows the same oddity.
Stepfunction@reddit
I imagine that there are probably something like 1024 placeholder tokens, which are then filled in by the diffusion process. In this case, the rest of the placeholders were likely rejected, and only the first section was used for the answer.
This is likely something you would need to specify for any model like this.
The fact that you can specify a response length is, in its own right, a very powerful feature.
Pyros-SD-Models@reddit
Yes, but the response length is like max_tokens with auto regressive llms.
Like if you set the length to 1024 and ask it to answer "What does meow in a word?" it'll answer "cat" and invalidates all other 1023 tokens
Stepfunction@reddit
That's what I'd imagine. It's like specifying a certain pixel size output latent in an image diffusion model.
MountainDry2344@reddit
the visualization here is misleading since it makes it look like the model knows exactly how much whitespace to provision - I tried it out at https://huggingface.co/spaces/multimodalart/LLaDA, and it doesn't pre-calculate the amount of whitespace, it just progressively replaces a row of wildcard tokens with text or nothing. I think technically it could just generate like a normal LLM left to right, but it's not constrained to working in that order, so it places text all over the place and fills the gap in between
stddealer@reddit
LLaDA us a different model
veggytheropoda@reddit
the "16-3-4=9" and "9*2=18" equations are generated simultaneously, so is the result 18. How could it work out the answer before the equations are filled, or is the answer already exists when it reads the prompt, and all "caluclations" are just it explaining how it got the result?
Pyros-SD-Models@reddit
Yes
Anthropic's paper has interactive examples how for example when writing a poem the model figures out the rhymes at first and then build the rest
Or how they do calculations.
https://transformer-circuits.pub/2025/attribution-graphs/biology.html
martinerous@reddit
Looks like you missed this: https://huggingface.co/spaces/multimodalart/LLaDA
And this: https://chat.inceptionlabs.ai/ (signup needed).
Pyros-SD-Models@reddit
https://huggingface.co/spaces/multimodalart/LLaDA works for me, and it works exactly as here https://ml-gsai.github.io/LLaDA-demo/
I don't know what's so hard to grasp that instead of just the token the position is also part of the distribution. that's like the point of diffusion. like the whole space get's diffused at the same time, until a token reaches a threshold and is fixed.
It's like if you recognize the eyes in a stable diffusion image first
martinerous@reddit
Now LLaDA works for me too. But it behaves a bit different - in the visualization it did not output the known ending immediately:
ninjasaid13@reddit
probably a slider for how many tokens you want to generate.
KillerX629@reddit
wasn't mercury almost the same? at least I remember it being like that. probably has a "mean space required" variable and slightly adjusts it with time maybe
florinandrei@reddit
Maybe they waited until the whole message was generated, figured out the empty spaces, then filled them in in the order the words were generated.
Determined-Hedgehog@reddit
Take my upvote!
momono75@reddit
How can we stream this? I think this way doesn't fit well for chatting until the generation process goes much faster.
Thick-Protection-458@reddit
Blockwise generation can be streamed, at very least. The question is compute efficiency of different setups.
momono75@reddit
Yes, technically it will be possible as we see this screenshot, but I didn't feel it was for humans...
Sad-Elk-6420@reddit
I wonder if it is easier to have it follow JSON. Could we pre write the JSON parts and it just fill in?
DerfK@reddit
This is actually what I'm hoping for, that we'll be able to ask the model to "inpaint" text in between what's already written rather than constantly append to the context.
FaceDeer@reddit
I've been doing a lot of work with LLMs generating lyrics lately and this would be really handy, often I'd like it to just try fixing a verse or a single line from a mostly done song. Or insert a new verse between existing ones. Inpainting would be very handy.
reaper2894@reddit
How is it creating words at certain positions? Is it not trained as next token prediction method? Is it not transformer based? What changed ?? 😯
Thick-Protection-458@reddit
It is (paralelly) denoising sequence from input noise.
So it may became very "sure" about N-th token before it will be sure about N-1th token.
P.S. now I wonder if denoising step for N-1-th token use previous state denoised (not original) state of N-th token as input. Otherwise it should have a good chance to place such a token into earlier positions so it will not fit late ones.
spiritualblender@reddit
Definition sucks for 20m context length
Thick-Protection-458@reddit
Why should it necessary?
It is still a transformer, so if we use causal attention (state of N-th token is some kind of function of dynamically-weighted average of 1..N inputs) we will have same hidden state for prompts on each diffusion steps.
So actual compute count for diffusion is like O(diffusionSteps * promptSize * completionSize) but (theorectically) well paralellizeable, while for autoregressive setup it is O(promptSize * completionSize) but less paralellizeable.
Mart-McUH@reddit
brain that Hey is how works my!
Interesting8547@reddit
Yeah though the same when I saw it, this the way, let's go... AI is advancing faster...
ninjasaid13@reddit
Hey that is how my! brain works
ZachCope@reddit
Hey that is how brain works my!
Interesting8547@reddit
That is it, I really think the diffusion models are the future of AI. Just seeing this I just "know it". I really like diffusion models more. I think the models should be able to "picture" what they imagine, this is the way. It's so fascinating seeing this.
xquarx@reddit
I'm surprised it does not change a work after its been placed. Would expect it to adjust the direction its going as its getting closer to the final form. Sometimes see that in image diffusion.
MoffKalast@reddit
Yeah that's really weird, like if a wrong word is just locked in place and fucks everything up, along with a pre-fixed generation length? Probably leaving lots of performance on the table by not letting it remove or shift tokens around.
GrimReaperII@reddit
There are other methods like SEDD that allow the model to edit tokens freely (including generated tokens). Even here, they could randomly mask tokens to allow the model to finetune its output. They just choose not to in this example.
furish@reddit
Anyone correct me if I’m wrong, but if this works similarly to MDLM and SEDD, the underlying Continuous Time Markov Chain does not allow to do that and you would have to change how you train the model. It is possible to use other underlying CTMCs, where in sampling you start from random tokens sampled uniformly and you “correct” them to make it have sense (similarly to image diffusion where you start from Gaussian noise), but it does not perform as well as the current masking paradigm.
clduab11@reddit
https://arxiv.org/abs/2502.09992
Actually, CMTC framework does indeed allow for masking tokens to be used; LLaDAs are usually going to be designed around the CMTC framework so discrete data like text can be utilized. Then follow your typical optimizations from there (gradient descent, etc).
Pretraining for DLLMs masks all tokens randomly at ratio t \~ U, but they apply the SFT paradigm for the training (would be curious to see what DPO would do...). Then the model simulates diffusion from full masking (t = 1) to unmasking (t = 0), predicting all masks simultaneously at each step with flexible remasking.
So it doesn't really start from the same noise that diffusive image generators employ. It starts from masking tokens and refines them down from there. LLaDA was shown to be highly competitive with that of the autoregressive baseline when looking at apples to apples data. Its scalability is a LOT better than conventional NLPs.
ninjasaid13@reddit
Isn't this more of an upscaler diffusion model?
Feztopia@reddit
The third paragraph is basically saying 3 times that she wasn't ready.
clduab11@reddit
GOD I love this. I've been hoping someone was working on the diffusion language model which studies have shown have a LOT more accuracy than sequential generation.
JuniorConsultant@reddit
After reading Anthropic's circuit tracing work, which shows activation of the last token before the first is generated: diffusion might be a better representation of what is going on inside the model. My bet is that diffusion language might be the next generation of architecture.
muyuu@reddit
a bit sceptical that it can perfectly predict the placement of words, i'd suspect it generates the text before it does that
fallingdowndizzyvr@reddit
That's a big downside to transformers. Since with transformers I can read a long as it generates. For diffusion, I have to wait for it all to finish before I can read it.
FluffyMoment2808@reddit
Diffusion models are still transformers, they're just not autoregressive
ninjasaid13@reddit
diffusion is quicker anyways.
Sad-Elk-6420@reddit
I wonder if it is easier to have it follow JSON. Like could we pre write the JSON parts and it just fill in?
Sad-Elk-6420@reddit
I wonder if it is easier to have it follow JSON, Like could we pre write the JSON parts and it just fills in?
gangofminotaurs@reddit
Oh. A big move.
Healthy-Nebula-3603@reddit
Looks like a regressive model but random ...
Bitter-College8786@reddit
Lets assume we have a diffusion model which has the same performance like a Transformer model (here Dream vs Qwen). Do Diffusion models have any advantages?
Context length, memory consumption for long context, inference speed?
Devatator_@reddit
Afaik diffusion models are faster and apparently allow stuff like "Inpainting" (in quotes because it's text here)
Doctor_moctor@reddit
Shouldn't this be WAY better for lyric generation, especially rap? When writing lyrics in a specific style you often first write one line, then create a rhyme for the end of the next line and fill the space in front afterwards.
MrXavi3@reddit
This could be very good for subtitle translation too! Sometimes with llama 3.2 it changes the context of some characters from for example in french "tu" to "vous" wich both translate to "you", i wonder if it can fix that
KaleidoscopeFuzzy422@reddit
We need to have a conversation about the testing that is being done for these models.
Like, the tests are not a good measure anymore of their accuracy and practicality. You have some of these models score great on the tests but when you try to use it in practice it's stupid and basic.
The tests need a major overall for comparison.
GreedyAdeptness7133@reddit
Over fitting or tests that have properties different from these? (Or both? And different how?)
MountainDry2344@reddit
Sudoku stocks 📉📉
durden111111@reddit
Diffusion LLMs are really cool
Gold_Pen@reddit
For the Cantonese speakers (especially at HKU), DLLM means a lot more than just diffusion LLMs 😂 sauce
Born-Attention-2151@reddit
It used to be DLNM aka “delay no more” aka “xxx xxx xxx xxx” In Cantonese 😂
alvenestthol@reddit
Hong Kong Cantonese lost its L-N distinction at least half a century ago; in fact, it's not even technically valid to have DLNM like DLLM or DNLM is, but because "DeLay No More" sounds like valid English that's stuck
clduab11@reddit
I'm HARDCORE nerding out right now. I've been waiting for a DLLM since the arXiv paper on DLLM generation. This is amazing.
ashirviskas@reddit
You can already run LLaDA.
clduab11@reddit
I'm stoked. I had been too out-of-the-loop on some of the more recent developments since the paper in February re: LLaDAs. I figured it was something immediately deployable as a framework and people had been working on it; I've just not had time to futz around myself with it.
jd_3d@reddit (OP)
Blog post: https://hkunlp.github.io/blog/2025/dream/
github: https://github.com/HKUNLP/Dream
Competitive_Ad_5515@reddit
Did it get taken down? The HF model links in the blog post 404
TheOneThatIsHated@reddit
They say they will upload in a couple of days, whatever that means
hak8or@reddit
Oh, like Seaseme labs with their ai demo?
Meaning ruining their image in the eyes of many developers when they had such massive potential?
MINIMAN10001@reddit
Sesame was such a massive bummer.
Any time a new AI that comes out into open source changes the game.
An entire new field opens up as it opens to window to various companies competing to have the best open source model and it is amazing. They could have been the gateway that opened up conversational AIs where voice actually functioned.
Enough-Meringue4745@reddit
"lets ignore everything theyre asking"
Competitive_Ad_5515@reddit
Well that's crappy and vague. Where did you read that?
The title of this post and the blog post explicitly say it has been released, which is apparently untrue. Also the Huawei connection is the second-most interesting aspect of this story to me.
"In a joint effort with Huawei Noah’s Ark Lab, we release Dream 7B (Diffusion reasoning model), the most powerful open diffusion large language model to date."
SidneyFong@reddit
Yep, trained using H800s (legal under Nvidia exports restrictions to China) too.
TheRealGentlefox@reddit
Noah's Ark Lab is a surprisingly dark name for an AI lab when you really think about it.
TheOneThatIsHated@reddit
On their github....
MoffKalast@reddit
Yeaahhh that's usually code for "we're not releasing this but don't want to backlash for it", otherwise they'd have it ready to go.
TheOneThatIsHated@reddit
I think you are referring to sesame right? In research it does happen more often, but most of the time more because they were lazy or forgot than malice.
We'll see in the coming weeks. It would not surprise me if they either will or will not release it
MoffKalast@reddit
I happens really often. I wouldn't really blame the researchers themselves, there's usually someone higher up the chain that says they can't publish it. Typically someone from the legal department.
Interesting8547@reddit
Was it released and then taken down, or it was never released?!
Creative-robot@reddit
I’m really excited about the potential of diffusion for intelligence applications. It already dominates the image and video generation scene, i wonder if it’s just a matter of time before it dominates language and reasoning too?
bdsmmaster007@reddit
isnt the new Open AI image model explicitly not a diffusion model, and still really fucking good, if not one of the top image models currently?
odragora@reddit
It's a combination of diffusion and autoregression.
From OpenAI release notes:
https://openai.com/index/introducing-4o-image-generation/
Transfer between Modalities:
Suppose we directly model p(text, pixels, sound) [equation] with one big autoregressive transformer.
Pros: * image generation augmented with vast world knowledge * next-level text rendering * native in-context learning * unified post-training stack
Cons: * varying bit-rate across modalities * compute not adaptive"
(Right) "Fixes: * model compressed representations * compose autoregressive prior with a powerful decoder"
On the bottom right of the board, she draws a diagram: "tokens -> [transformer] -> [diffusion] -> pixels"
GrimReaperII@reddit
Yes, but could it be better if if it was a multimodal diffusion LLM? Their new model is good because of reinforcement learning + multimodality, not because of some inherent advantage to autoregression. The advantage comes in compute efficiency (KV cache). but that is not exclusive to autoregressive models, block diffusion also allows for a KV cache. Really autoregression is a subset of diffusion.
BusRevolutionary9893@reddit
Best I've used.
binheap@reddit
I'd be a little more suspicious of it dominating text. Diffusion is particularly good in Fourier space which is presumably why it works so well for images. This could be a form of us optimizing for inductive bias. Text seems inherently more auto regressive in nature (even if we go back and edit from time to time).
ninjasaid13@reddit
I'm more interesting in coding, and code editing..
Zulfiqaar@reddit
Yes, I'm very interested in "inpainting" for text, something diffusion is exceptional at in visual domains.
It could be the new best FIM architecture, just like RNNs outperformed transformers previously (eg SuperMaven, before their Cursor acquisition)
9acca9@reddit
Can you explain what is diffusion? Thanks
Creative-robot@reddit
https://en.wikipedia.org/wiki/Diffusion_model
jd_3d@reddit (OP)
Me too. They only used 96 GPUs and trained for 11 days. Imagine a 100,000 GPU training run?
logicchains@reddit
Using a pre-trained Qwen model's weights as the base.
smflx@reddit
I read LLaDA & block diffusion papers. Both are similar. LLaDA also mentioned blockwise diffusion.
They are not a diffusion like SD. Talked about several diffusion process but only masking used.
The difference from transformer is parallel token generation in block. But LLaDA generates 1 by 1 for best quality (similar to AR!) but very slow.
Blockwise diffusion is for a fast parallel token generation within a short block of few tokens. (Quality is far under AR models)
To me... It's still basically transformer with non-sequential 1-by-1 generation or short term few token generation.
I guess this paper might be the similar kind. I will check paper anyway.
no_witty_username@reddit
Nice, look at those sudoku stats! and pretty decent at planning too. There must be a bunch of other use cases where this thing shines. Glad to see labs take other architectures besides sequential more seriously....
i3ym@reddit
so how does it know how much space to leave for the non-yet-generatrd words? strange stuff
dp3471@reddit
Best model of the year. Getting text diffusion to work well is very hard, and this seems awesome. Sure, deepseek is amazing and very beneficial for current LLMs, but this is novel.
idesireawill@reddit
! Remindme 1w
vlodia@reddit
git pls / source? tl;dr
sanobawitch@reddit
In theory, nothing prevents us from slapping a SNAC on top of it, after many hours of training, then we have a tts model?
yukiarimo@reddit
Working on a banger TTS model
yukiarimo@reddit
No, thank you. The word diffusion was enough for me to be uninterested in that
BABA_yaaGa@reddit
Diffusion models are the future
relmny@reddit
based on what happened 1-2 weeks ago with closeai, it seems it's actually the past...
ninjasaid13@reddit
I still put diffusion models over until we actually have an open research paper that shows superiority,
Zulfiqaar@reddit
Have you seen Janus? I'm hoping it's an experiment before they release a full size one on the scale of R1
https://huggingface.co/deepseek-ai/Janus-Pro-7B
ninjasaid13@reddit
That's still a pure autoregression model, I want to see if they can scale up multimodal discrete diffusion model.
Zulfiqaar@reddit
Whoops I was skimming, missed that out. I agree, I definitely think there's a lot more potential in diffusion than is currently available. I'd like something that has a similar parameters count to SOTA LLMs, then we can compare like for like. Flux and Wan are pretty good, and they're only in the 10-15b range
ninjasaid13@reddit
Flux and Wan use an autoregressive model T5 as the text encoder don't they?
Zulfiqaar@reddit
Not 100% sure, haven't been diffusing as much these months so not got deep into the details. Quick search seems to indicate a Umt5 and clip
AppearanceHeavy6724@reddit
fill me in....
frankh07@reddit
It looks like diffusion models will be a game changer.
GreedyAdeptness7133@reddit
Does anyone know how someone can easily run all these benchmarks in python? (Maybe a bit link?) thanks!
swagonflyyyy@reddit
Oh yeah, this is huge news. We desperately need a different architecture than transformers.
Transformers is still king, but I really wanna see how far you can take this architecture.
_yustaguy_@reddit
Diffusion models and transformer modela aren't mutually exclusive.
It's a diffusion-transformer model from what I can tell. The real change is that it's not autoregressive anymore (tokens aren't generated one at a time).
MoffKalast@reddit
Tbh that's still autoregressive, just not linearly.
ninjasaid13@reddit
you mean that it follows causality, not autoregressively.
MoffKalast@reddit
Same thing really.
ninjasaid13@reddit
Causality often involves multiple variables (e.g., X causes Y), while autoregression uses past values of the same variable.
MoffKalast@reddit
Well what other variables are there? It's still iterating on a context, much the same as a transformer doing fill in the middle would.
TheRealGentlefox@reddit
Well it's like, half autoregressive, no? There appear to be independent token generations in each pass.
Thick-Protection-458@reddit
Isn't this still transformers, just used in diffusion way rather than autoregressive (with all the diffusion bonuses and problems)
TheRealGentlefox@reddit
I like that it's competitive on all benchmarks, and then is randomly a god at sudoku.
ninjasaid13@reddit
Unique strength of diffusion models, multivariable planning.
ThenExtension9196@reddit
This is the next generation right here.
100thousandcats@reddit
!remindme 2 weeks
RemindMeBot@reddit
I will be messaging you in 14 days on 2025-04-16 17:52:20 UTC to remind you of this link
CLICK THIS LINK to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) ^(delete this message to hide from others.)
JLeonsarmiento@reddit
YES.
FullOf_Bad_Ideas@reddit
Waiting for weights to drop.
pseudonerv@reddit
So it’s like masked attention encoder/decoder, so like Bert?
Competitive_Ad_5515@reddit
Sudoku is never gonna be the same
swagonflyyyy@reddit
There's a Sudoku benchmark? Like, the actual game sudoku?