University of Hong Kong releases Dream 7B (Diffusion reasoning model). Highest performing open-source diffusion model to date. You can adjust the number of diffusion timesteps for speed vs accuracy

[-]

jd_3d@reddit (OP)

It's fascinating watching it generate text:

https://i.redd.it/xci0dlo7hgse1.gif

[-]

Interesting8547@reddit

Yeah though the same when I saw it, this the way, let's go... AI is advancing faster...

[-]

This is also a particularly useful-use-case for diffusion models. It's also fascinating to think that most LLMs have no idea where they're going to end up. They just walk forward until they get there.

[-]

momono75@reddit

How can we stream this? I think this way doesn't fit well for chatting until the generation process goes much faster.

[-]

r_Sh4d0w@reddit

diffusion models are quick. Give mercury coder by inceptionlabs a try, much faster at spitting out a whole paragraph of code compared to any language model. Even images diffusion models got much faster after a few iterations.

[-]

Thick-Protection-458@reddit

Blockwise generation can be streamed, at very least. The question is compute efficiency of different setups.

[-]

momono75@reddit

Yes, technically it will be possible as we see this screenshot, but I didn't feel it was for humans...

[-]

xquarx@reddit

I'm surprised it does not change a work after its been placed. Would expect it to adjust the direction its going as its getting closer to the final form. Sometimes see that in image diffusion.

[-]

MoffKalast@reddit

Yeah that's really weird, like if a wrong word is just locked in place and fucks everything up, along with a pre-fixed generation length? Probably leaving lots of performance on the table by not letting it remove or shift tokens around.

[-]

GrimReaperII@reddit

There are other methods like SEDD that allow the model to edit tokens freely (including generated tokens). Even here, they could randomly mask tokens to allow the model to finetune its output. They just choose not to in this example.

[-]

cms2307@reddit

So with this model can you just let it run for as long as you want doing that technique and it will approach the “optimal” output given its training data?

[-]

GrimReaperII@reddit

Yes. It's still limited by the training data, parameter count, and architecture but it can create a more optimal output than autoregressive model of the same size because it can dedicate more compute (>n) to generating a sequence (of length n).

[-]

Player06@reddit

Pretty sure it does change them, we just dont see it.

Under the hood it might write a full story on the first go, but most words are low confidence. Only the high confidence words are made visible. To us it looks like it writes out of order, when it actually re writes the whole text many times and just shows the parts it is super sure about.

That being said, I have no idea. This is an educated guess.

[-]

Player06@reddit

Pretty sure it does change them, we just dont see it.

Under the hood it might write a full story on the first go, but most words are low confidence. Only the high confidence words are made visible. To us it looks like it writes out of order, when it actually re writes the whole text many times and just shows the parts it is super sure about.

That being said, I have no idea. This is an educated guess.

[-]

nialv7@reddit

yeah how does it know all the 't s so early on?

[-]

furish@reddit

Anyone correct me if I’m wrong, but if this works similarly to MDLM and SEDD, the underlying Continuous Time Markov Chain does not allow to do that and you would have to change how you train the model. It is possible to use other underlying CTMCs, where in sampling you start from random tokens sampled uniformly and you “correct” them to make it have sense (similarly to image diffusion where you start from Gaussian noise), but it does not perform as well as the current masking paradigm.

[-]

clduab11@reddit

https://arxiv.org/abs/2502.09992

Actually, CMTC framework does indeed allow for masking tokens to be used; LLaDAs are usually going to be designed around the CMTC framework so discrete data like text can be utilized. Then follow your typical optimizations from there (gradient descent, etc).

Pretraining for DLLMs masks all tokens randomly at ratio t \~ U, but they apply the SFT paradigm for the training (would be curious to see what DPO would do...). Then the model simulates diffusion from full masking (t = 1) to unmasking (t = 0), predicting all masks simultaneously at each step with flexible remasking.

So it doesn't really start from the same noise that diffusive image generators employ. It starts from masking tokens and refines them down from there. LLaDA was shown to be highly competitive with that of the autoregressive baseline when looking at apples to apples data. Its scalability is a LOT better than conventional NLPs.

[-]

ninjasaid13@reddit

Isn't this more of an upscaler diffusion model?

[-]

Pretty_Sand3036@reddit

Ahh this makes and doesn’t make sense at the same time

[-]

100thousandcats@reddit

What the actual fuck…

[-]

Recoil42@reddit

Wait until you see block diffusion.

[-]

kremlinhelpdesk@reddit

Defrag diffusion.

[-]

Many_SuchCases@reddit

Never forget the struggle.

[-]

ConiglioPipo@reddit

I was there. I won't forget.

[-]

Thistleknot@reddit

Elrond?

[-]

PathIntelligent7082@reddit

and then all the crap gets cleaned up, but one lil' red square remains intact

[-]

FaceDeer@reddit

I used to find that to be a strangely relaxing process to watch. Sadly, at some point defragmentation became an automatic background process of the filesystem and we no longer got to see it work.

[-]

no_witty_username@reddit

Defrag sound was the original asmr i ell asleep to at night....

[-]

SidneyFong@reddit

Been using SSDs for so many years now that I totally forgot how we kinda knew what the computer was doing by listening to hard disk sounds...

[-]

hazed-and-dazed@reddit

click-click

Oh no!!

[-]

DaniyarQQQ@reddit

I remember the sound:

trrt...trrt...trrt...trrt...trrt...trrt...trrt...trrt...trrrrrrt.....

[-]

Samurai2107@reddit

its almost how autoregressive models like 4o works, but block diffusion is not left to right or top to bottom, it shows how claude figured out that is a level in latent that the model already knows what to show us

[-]

MINIMAN10001@reddit

Considering how they say block diffusions shows a decreasing perplexity.

It feels like a hack job in order to increase parallelizability?

[-]

ClassyBukake@reddit

Even a miniscule amount of parallelism would massive increase the efficiency of multi-compute environments.

[-]

switchpizza@reddit

😂

[-]

ffiw@reddit

https://i.redd.it/9xtkswjpphse1.gif

[-]

NullHypothesisCicada@reddit

No wonder it’s so good at sudoku

[-]

WhereIsYourMind@reddit

I wouldn't put it past front-end gimmicks, but I had a ChatGPT 4.5 response that generated in a similar manner. I remember distinctly that it created blank lines and then generated entire sentence chunks at once, instead of outputting tokens one at a time.

I wonder if OpenAI is doing A/B testing using a model with similar architecture. Pure conjecture.

[-]

Shoddy_Ad_7853@reddit

That's efficient, it's what I do.

[-]

jabblack@reddit

How does it know the spacing for words it hasn’t figured out yet?

People technically write like this: where the initial words are high level ideas and outlines, then add in additional details.

Look at the words that are filled in first:

Joey and Rachel had been dating for awhile but.. …just wasn’t ready… finally they together.

It creates an overarching narrative, then fills in gaps.

[-]

tim_Andromeda@reddit

That's a gimmick right? How would it know how much space to leave for text it hasn't outputted yet.

[-]

DerfK@reddit

I'm suspicious as well, but I'm guessing what the video shows is a "dramatization" of how the final product was arrived at (maybe even an accurate dramatization of the fragments of the text in the order they actually got generated), rather than actual runtime diffusion snapshots like StableDiffusion where you can see the blurry bits come together.

[-]

Pyros-SD-Models@reddit

Why are you guys just guessing instead of just checking out their github or any hugginface space of a diffusion LLM and literally try it out yourself lol

[-]

UserXtheUnknown@reddit

Thanks, tried it. It was not particularly good when compared to similar -in size- sequential LLMs, though. Maybe even a bit worse.

[-]

DerfK@reddit

OK not quite the same as the video, it is still working in tokens and each token could be longer or shorter so the text isn't fixed in place with a set number of spaces to fill in like OP's video.

[-]

Stepfunction@reddit

This example is specifically an infilling example, so the space needed was specified ahead of time.

[-]

stddealer@reddit

https://i.redd.it/1qquw5mw7ise1.gif

This is not infilling and shows the same oddity.

[-]

Stepfunction@reddit

I imagine that there are probably something like 1024 placeholder tokens, which are then filled in by the diffusion process. In this case, the rest of the placeholders were likely rejected, and only the first section was used for the answer.

This is likely something you would need to specify for any model like this.

The fact that you can specify a response length is, in its own right, a very powerful feature.

[-]

Pyros-SD-Models@reddit

Yes, but the response length is like max_tokens with auto regressive llms.

Like if you set the length to 1024 and ask it to answer "What does meow in a word?" it'll answer "cat" and invalidates all other 1023 tokens

[-]

Stepfunction@reddit

That's what I'd imagine. It's like specifying a certain pixel size output latent in an image diffusion model.

[-]

MountainDry2344@reddit

the visualization here is misleading since it makes it look like the model knows exactly how much whitespace to provision - I tried it out at https://huggingface.co/spaces/multimodalart/LLaDA, and it doesn't pre-calculate the amount of whitespace, it just progressively replaces a row of wildcard tokens with text or nothing. I think technically it could just generate like a normal LLM left to right, but it's not constrained to working in that order, so it places text all over the place and fills the gap in between

[-]

stddealer@reddit

LLaDA us a different model

[-]

veggytheropoda@reddit

the "16-3-4=9" and "9*2=18" equations are generated simultaneously, so is the result 18. How could it work out the answer before the equations are filled, or is the answer already exists when it reads the prompt, and all "caluclations" are just it explaining how it got the result?

[-]

Pyros-SD-Models@reddit

Yes

Anthropic's paper has interactive examples how for example when writing a poem the model figures out the rhymes at first and then build the rest

Or how they do calculations.

https://transformer-circuits.pub/2025/attribution-graphs/biology.html

[-]

martinerous@reddit

Looks like you missed this: https://huggingface.co/spaces/multimodalart/LLaDA
And this: https://chat.inceptionlabs.ai/ (signup needed).

[-]

Pyros-SD-Models@reddit

https://huggingface.co/spaces/multimodalart/LLaDA works for me, and it works exactly as here https://ml-gsai.github.io/LLaDA-demo/

I don't know what's so hard to grasp that instead of just the token the position is also part of the distribution. that's like the point of diffusion. like the whole space get's diffused at the same time, until a token reaches a threshold and is fixed.

It's like if you recognize the eyes in a stable diffusion image first

[-]

martinerous@reddit

Now LLaDA works for me too. But it behaves a bit different - in the visualization it did not output the known ending immediately:

[-]

ninjasaid13@reddit

probably a slider for how many tokens you want to generate.

[-]

KillerX629@reddit

wasn't mercury almost the same? at least I remember it being like that. probably has a "mean space required" variable and slightly adjusts it with time maybe

[-]

florinandrei@reddit

Maybe they waited until the whole message was generated, figured out the empty spaces, then filled them in in the order the words were generated.

[-]

Determined-Hedgehog@reddit

Take my upvote!

[-]

Sad-Elk-6420@reddit

I wonder if it is easier to have it follow JSON. Could we pre write the JSON parts and it just fill in?

[-]

DerfK@reddit

This is actually what I'm hoping for, that we'll be able to ask the model to "inpaint" text in between what's already written rather than constantly append to the context.

[-]

FaceDeer@reddit

I've been doing a lot of work with LLMs generating lyrics lately and this would be really handy, often I'd like it to just try fixing a verse or a single line from a mostly done song. Or insert a new verse between existing ones. Inpainting would be very handy.

[-]

reaper2894@reddit

How is it creating words at certain positions? Is it not trained as next token prediction method? Is it not transformer based? What changed ?? 😯

[-]

Thick-Protection-458@reddit

It is (paralelly) denoising sequence from input noise.

So it may became very "sure" about N-th token before it will be sure about N-1th token.

P.S. now I wonder if denoising step for N-1-th token use previous state denoised (not original) state of N-th token as input. Otherwise it should have a good chance to place such a token into earlier positions so it will not fit late ones.

[-]

spiritualblender@reddit

Definition sucks for 20m context length

[-]

Thick-Protection-458@reddit

Why should it necessary?

It is still a transformer, so if we use causal attention (state of N-th token is some kind of function of dynamically-weighted average of 1..N inputs) we will have same hidden state for prompts on each diffusion steps.

So actual compute count for diffusion is like O(diffusionSteps * promptSize * completionSize) but (theorectically) well paralellizeable, while for autoregressive setup it is O(promptSize * completionSize) but less paralellizeable.

[-]

Interesting8547@reddit

That is it, I really think the diffusion models are the future of AI. Just seeing this I just "know it". I really like diffusion models more. I think the models should be able to "picture" what they imagine, this is the way. It's so fascinating seeing this.

[-]

Feztopia@reddit

The third paragraph is basically saying 3 times that she wasn't ready.

[-]

clduab11@reddit

GOD I love this. I've been hoping someone was working on the diffusion language model which studies have shown have a LOT more accuracy than sequential generation.

[-]

JuniorConsultant@reddit

After reading Anthropic's circuit tracing work, which shows activation of the last token before the first is generated: diffusion might be a better representation of what is going on inside the model. My bet is that diffusion language might be the next generation of architecture.

[-]

muyuu@reddit

a bit sceptical that it can perfectly predict the placement of words, i'd suspect it generates the text before it does that

[-]

fallingdowndizzyvr@reddit

That's a big downside to transformers. Since with transformers I can read a long as it generates. For diffusion, I have to wait for it all to finish before I can read it.

[-]

FluffyMoment2808@reddit

Diffusion models are still transformers, they're just not autoregressive

[-]

ninjasaid13@reddit

diffusion is quicker anyways.

[-]

Sad-Elk-6420@reddit

I wonder if it is easier to have it follow JSON. Like could we pre write the JSON parts and it just fill in?

[-]

Sad-Elk-6420@reddit

I wonder if it is easier to have it follow JSON, Like could we pre write the JSON parts and it just fills in?

[-]

gangofminotaurs@reddit

Oh. A big move.

[-]

Healthy-Nebula-3603@reddit

Looks like a regressive model but random ...

[-]

Hot_Rice6594@reddit

Looks like it's not diffusion improvement by steps
Early steps will determine the whole content, later steps are like speculative decoding...

[-]

Lazy-Pattern-5171@reddit

THIS IS WHAT I WANTED. Thank you so much.

[-]

pseudonerv@reddit

So it’s like masked attention encoder/decoder, so like Bert?

[-]

BashfulMelon@reddit

BERT is encoder-only.

[-]

xor_2@reddit

I spend few days analyzing LLaDA so this model is very interesting to me to see how it differs.

LLaDA is super fun how it works but it obviously needs some work done to it. Especially prompts with short answers seems to require big block size but might spend most steps filling in masking tokens which kinda doesn't make any sense. Not to mention it was strange to me that step to step not a lot of data is carried over and model really worked on already prepared results - it somehow works so who am I to question it but it seems like big limitation.

What is fun about LLaDA is being able to fill in gaps - like I can slap text with holes and it will fill these holes. Heck, I can randomly start adding holes and model can arrive at the same results.

Other than limitation I mentioned another limitation is that LLaDA can in theory produce more tokens per step but to get best performance it is just single token - and in this case especially with bigger block size (which is what gives best intelligence/performance) there is no speed advantages - and rather giant speed downgrade along with size limitations.

That said to really compare performance I would need to run some benchmarks. If benchmarks were performed with very small block sizes as scripts suggest and are comparable to AR 7B/8B models (or even better) then situation might be much better than I think.

Still in LLaDA I see some room for improvement where it comes to selecting tokens and tendency of model to self-correct (this functionality exists but model is hesitant to do it).

Now I shall test "Dream 7B" - from benchmarks it looks interresting. Also if will be interresting to do some other unholy abominations with these models. Actually waited for some other model like it to play with this stuff.

[-]

PathIntelligent7082@reddit

as i can see, the results are in par with quen, so statement like "most powerful" is inaccurate...

[-]

silenceimpaired@reddit

It’s unfortunate that they put the least compelling charts first. There are charts present in the image that make this an interesting model. It doesn’t have to be an either or. It can be both.

[-]

PathIntelligent7082@reddit

interesting? yes... but terms like "most powerful" are BS

[-]

silenceimpaired@reddit

Across the board? Agreed. Sudoku? Agree to Disagree.

[-]

Bitter-College8786@reddit

Lets assume we have a diffusion model which has the same performance like a Transformer model (here Dream vs Qwen). Do Diffusion models have any advantages?

Context length, memory consumption for long context, inference speed?

[-]

Devatator_@reddit

Afaik diffusion models are faster and apparently allow stuff like "Inpainting" (in quotes because it's text here)

[-]

Doctor_moctor@reddit

Shouldn't this be WAY better for lyric generation, especially rap? When writing lyrics in a specific style you often first write one line, then create a rhyme for the end of the next line and fill the space in front afterwards.

[-]

MrXavi3@reddit

This could be very good for subtitle translation too! Sometimes with llama 3.2 it changes the context of some characters from for example in french "tu" to "vous" wich both translate to "you", i wonder if it can fix that

[-]

KaleidoscopeFuzzy422@reddit

We need to have a conversation about the testing that is being done for these models.

Like, the tests are not a good measure anymore of their accuracy and practicality. You have some of these models score great on the tests but when you try to use it in practice it's stupid and basic.

The tests need a major overall for comparison.

[-]

GreedyAdeptness7133@reddit

Over fitting or tests that have properties different from these? (Or both? And different how?)

[-]

MountainDry2344@reddit

Sudoku stocks 📉📉

[-]

durden111111@reddit

Diffusion LLMs are really cool

[-]

Gold_Pen@reddit

For the Cantonese speakers (especially at HKU), DLLM means a lot more than just diffusion LLMs 😂 sauce

[-]

Born-Attention-2151@reddit

It used to be DLNM aka “delay no more” aka “xxx xxx xxx xxx” In Cantonese 😂

[-]

alvenestthol@reddit

Hong Kong Cantonese lost its L-N distinction at least half a century ago; in fact, it's not even technically valid to have DLNM like DLLM or DNLM is, but because "DeLay No More" sounds like valid English that's stuck

[-]

clduab11@reddit

I'm HARDCORE nerding out right now. I've been waiting for a DLLM since the arXiv paper on DLLM generation. This is amazing.

[-]

ashirviskas@reddit

You can already run LLaDA.

[-]

clduab11@reddit

I'm stoked. I had been too out-of-the-loop on some of the more recent developments since the paper in February re: LLaDAs. I figured it was something immediately deployable as a framework and people had been working on it; I've just not had time to futz around myself with it.

[-]

jd_3d@reddit (OP)

Blog post: https://hkunlp.github.io/blog/2025/dream/
github: https://github.com/HKUNLP/Dream

[-]

Competitive_Ad_5515@reddit

Did it get taken down? The HF model links in the blog post 404

[-]

TheOneThatIsHated@reddit

They say they will upload in a couple of days, whatever that means

[-]

hak8or@reddit

Oh, like Seaseme labs with their ai demo?

Meaning ruining their image in the eyes of many developers when they had such massive potential?

[-]

MINIMAN10001@reddit

Sesame was such a massive bummer.

Any time a new AI that comes out into open source changes the game.

An entire new field opens up as it opens to window to various companies competing to have the best open source model and it is amazing. They could have been the gateway that opened up conversational AIs where voice actually functioned.

[-]

Enough-Meringue4745@reddit

"lets ignore everything theyre asking"

[-]

Competitive_Ad_5515@reddit

Well that's crappy and vague. Where did you read that?

The title of this post and the blog post explicitly say it has been released, which is apparently untrue. Also the Huawei connection is the second-most interesting aspect of this story to me.

"In a joint effort with Huawei Noah’s Ark Lab, we release Dream 7B (Diffusion reasoning model), the most powerful open diffusion large language model to date."

[-]

SidneyFong@reddit

Yep, trained using H800s (legal under Nvidia exports restrictions to China) too.

[-]

TheRealGentlefox@reddit

Noah's Ark Lab is a surprisingly dark name for an AI lab when you really think about it.

[-]

TheOneThatIsHated@reddit

On their github....

[-]

MoffKalast@reddit

Yeaahhh that's usually code for "we're not releasing this but don't want to backlash for it", otherwise they'd have it ready to go.

[-]

TheOneThatIsHated@reddit

I think you are referring to sesame right? In research it does happen more often, but most of the time more because they were lazy or forgot than malice.

We'll see in the coming weeks. It would not surprise me if they either will or will not release it

[-]

MoffKalast@reddit

I happens really often. I wouldn't really blame the researchers themselves, there's usually someone higher up the chain that says they can't publish it. Typically someone from the legal department.

[-]

Interesting8547@reddit

Was it released and then taken down, or it was never released?!

[-]

Creative-robot@reddit

I’m really excited about the potential of diffusion for intelligence applications. It already dominates the image and video generation scene, i wonder if it’s just a matter of time before it dominates language and reasoning too?

[-]

bdsmmaster007@reddit

isnt the new Open AI image model explicitly not a diffusion model, and still really fucking good, if not one of the top image models currently?

[-]

odragora@reddit

It's a combination of diffusion and autoregression.

From OpenAI release notes:

https://openai.com/index/introducing-4o-image-generation/

Transfer between Modalities:

Suppose we directly model p(text, pixels, sound) [equation] with one big autoregressive transformer.

Pros: * image generation augmented with vast world knowledge * next-level text rendering * native in-context learning * unified post-training stack

Cons: * varying bit-rate across modalities * compute not adaptive"

(Right) "Fixes: * model compressed representations * compose autoregressive prior with a powerful decoder"

On the bottom right of the board, she draws a diagram: "tokens -> [transformer] -> [diffusion] -> pixels"

[-]

GrimReaperII@reddit

Yes, but could it be better if if it was a multimodal diffusion LLM? Their new model is good because of reinforcement learning + multimodality, not because of some inherent advantage to autoregression. The advantage comes in compute efficiency (KV cache). but that is not exclusive to autoregressive models, block diffusion also allows for a KV cache. Really autoregression is a subset of diffusion.

[-]

BusRevolutionary9893@reddit

Best I've used.

[-]

binheap@reddit

I'd be a little more suspicious of it dominating text. Diffusion is particularly good in Fourier space which is presumably why it works so well for images. This could be a form of us optimizing for inductive bias. Text seems inherently more auto regressive in nature (even if we go back and edit from time to time).

[-]

ninjasaid13@reddit

I'm more interesting in coding, and code editing..

[-]

Zulfiqaar@reddit

Yes, I'm very interested in "inpainting" for text, something diffusion is exceptional at in visual domains.

It could be the new best FIM architecture, just like RNNs outperformed transformers previously (eg SuperMaven, before their Cursor acquisition)

[-]

9acca9@reddit

Can you explain what is diffusion? Thanks

[-]

Creative-robot@reddit

https://en.wikipedia.org/wiki/Diffusion_model

[-]

jd_3d@reddit (OP)

Me too. They only used 96 GPUs and trained for 11 days. Imagine a 100,000 GPU training run?

[-]

logicchains@reddit

Using a pre-trained Qwen model's weights as the base.

[-]

smflx@reddit

I read LLaDA & block diffusion papers. Both are similar. LLaDA also mentioned blockwise diffusion.

They are not a diffusion like SD. Talked about several diffusion process but only masking used.

The difference from transformer is parallel token generation in block. But LLaDA generates 1 by 1 for best quality (similar to AR!) but very slow.

Blockwise diffusion is for a fast parallel token generation within a short block of few tokens. (Quality is far under AR models)

To me... It's still basically transformer with non-sequential 1-by-1 generation or short term few token generation.

I guess this paper might be the similar kind. I will check paper anyway.

[-]

no_witty_username@reddit

Nice, look at those sudoku stats! and pretty decent at planning too. There must be a bunch of other use cases where this thing shines. Glad to see labs take other architectures besides sequential more seriously....

[-]

i3ym@reddit

so how does it know how much space to leave for the non-yet-generatrd words? strange stuff

[-]

dp3471@reddit

Best model of the year. Getting text diffusion to work well is very hard, and this seems awesome. Sure, deepseek is amazing and very beneficial for current LLMs, but this is novel.

[-]

idesireawill@reddit

! Remindme 1w

[-]

vlodia@reddit

git pls / source? tl;dr

[-]

sanobawitch@reddit

In theory, nothing prevents us from slapping a SNAC on top of it, after many hours of training, then we have a tts model?

[-]

yukiarimo@reddit

Working on a banger TTS model

[-]

yukiarimo@reddit

No, thank you. The word diffusion was enough for me to be uninterested in that

[-]

BABA_yaaGa@reddit

Diffusion models are the future

[-]

relmny@reddit

based on what happened 1-2 weeks ago with closeai, it seems it's actually the past...

[-]

ninjasaid13@reddit

I still put diffusion models over until we actually have an open research paper that shows superiority,

[-]

Zulfiqaar@reddit

Have you seen Janus? I'm hoping it's an experiment before they release a full size one on the scale of R1

https://huggingface.co/deepseek-ai/Janus-Pro-7B

[-]

ninjasaid13@reddit

That's still a pure autoregression model, I want to see if they can scale up multimodal discrete diffusion model.

[-]

Zulfiqaar@reddit

Whoops I was skimming, missed that out. I agree, I definitely think there's a lot more potential in diffusion than is currently available. I'd like something that has a similar parameters count to SOTA LLMs, then we can compare like for like. Flux and Wan are pretty good, and they're only in the 10-15b range

[-]

ninjasaid13@reddit

Flux and Wan use an autoregressive model T5 as the text encoder don't they?

[-]

Zulfiqaar@reddit

Not 100% sure, haven't been diffusing as much these months so not got deep into the details. Quick search seems to indicate a Umt5 and clip

[-]

AppearanceHeavy6724@reddit

fill me in....

[-]

frankh07@reddit

It looks like diffusion models will be a game changer.

[-]

GreedyAdeptness7133@reddit

Does anyone know how someone can easily run all these benchmarks in python? (Maybe a bit link?) thanks!

[-]

swagonflyyyy@reddit

Oh yeah, this is huge news. We desperately need a different architecture than transformers.

Transformers is still king, but I really wanna see how far you can take this architecture.

[-]

_yustaguy_@reddit

Diffusion models and transformer modela aren't mutually exclusive.

It's a diffusion-transformer model from what I can tell. The real change is that it's not autoregressive anymore (tokens aren't generated one at a time).

[-]

MoffKalast@reddit

Tbh that's still autoregressive, just not linearly.

[-]

ninjasaid13@reddit

Tbh that's still autoregressive, just chronologically instead of positionally.

you mean that it follows causality, not autoregressively.

[-]

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

[-]