LLaDA2.0 (103B/16B) has been released

[-]

V1rgin_@reddit

It seems they started using block diffusion sampling. As far as I remember, in the previous article they said that block diffusion showed worse results than pure diffusion

[-]

Finanzamt_Endgegner@reddit

Performance is a bit worse but long context speed massively improves i think

[-]

rm-rf-rm@reddit

Is there any HF space or online demo?

[-]

How do the experts work for MoE diffusion models? I want to assume it's different experts per denoising step, not different experts per block nor different experts per token (since tokens are predicted concurrently).

[-]

Double_Cause4609@reddit

Per de-noising step, yeah probably. Actually, Diffusion MoEs are a bit bit cursed in general (because their Attention mechanism is compute bound due to not KV caching I think), so it results in a really weird compute/bandwidth tradeoff.

Overall I think it's better, though.

I do think there's probably an MoE Diffusion formulation that absolutely zooms by activating all experts per step but routing different experts per token (this is favorable compared to dense as I think it has a higher theoretical arithmetic intensity), but to my knowledge I don't know if anyone's actually done that and the MoE formulation sounds like an absolute headache. Would make it a nightmare to run on CPU, too. I'd have to think about the specifics a bit more, though.

[-]

Interesting_Fun5410@reddit

Who knows maybe a freak model is born that works well using a fast nvme as vram extension. The world is your Ouster.

[-]

Double_Cause4609@reddit

Existing autoregressive MoEs can already do that, and have been shown to work well-ish in that regime. If you have about 1/2 the parameters in-memory you don't really lose that much speed streaming from SSD (on Linux) with reasonably fast storage.

In particular Maverick did quite well in this respect for raw decoding speed, due to a rather large shared expert. Presumably you could do a model with an oversized shared expert for reasoning that fits on a cheap GPU, and a conditional MoE that is absolutely enourmous, but has so few parameters active that you can basically just stream off of NVMe for general raw knowledge / backround.

Going further in that direction I think you're looking more into event driven arches like Spiking Neural Networks, or possibly non-parametric systems like large knowledge bases, etc.

[-]

MmmmMorphine@reddit

This is a fascinating area that I barely understand, would you have any decently performing models to recommend (preferably ones with reasonable documentation) that I can run and study?

In particular I've got a situation where I have very limited vram but fucktons of ram and fresh ssds to thrash, but man, there's so much going on I'm never sure I'm following the right threads of development

[-]

Double_Cause4609@reddit

Llama 4 Maverick is the poster child for that method of running (it has very few parameters changing between any two tokens, compared to other arches), Deepseek V3 / R1, etc, and Kimi k2 are all amenable. Jamba 1.7 full and GLM 4.6 are interesting models, but have more parameters as conditional than the others so they're not quite as clean to run this way. While it's a bit of a waste (because it feels weird to run smaller models this way) Llama 4 Scout and GLM 4.5 Air do work, too.

Mostly, just build LlamaCPP, pick a model + quant that's less than 2x your available RAM in required memory, and let it rip, more or less.

If you want to do something custom in Transformers you're probably going to want to look into metadevices, etc.

[-]

No_Afternoon_4260@reddit

Llama.cpp pull ongoing seems like a possible "nightmare" on cpu

[-]

SlowFail2433@reddit

On the Image side Hidream does MoE with expert routers more like an llm

[-]

HawkObjective5498@reddit

I didn’t find the paper (if you do find it, please link it—I’ll read it). But most likely, each noised token is passed to the router to determine its experts, just like in the autoregressive Transformer. There’s no problem with the fact that they are computed in parallel.

[-]

SlowFail2433@reddit

It can be both

[-]

LongPutsAndLongPutts@reddit

I'm interested in the inference speed compared to traditional transformer models

[-]

Finanzamt_Endgegner@reddit

UPDATE:

Ive found out ive forgotten about a simplification with kv cache that speeds this model up by quite a bit over long context, making it actually useable, im currently trying to clean my source up to push this to the pr, so in a few hours you should be able to test performance again with greatly improved speed (at least in real world usage)!

time per step: 19.47ms -> 4.28ms in a 700 token generation

[-]

Kamal965@reddit

I switched to u/Finanzamt_Endgegner's PR, downloaded the 16BA1B MoE, quantized it and ran llama-bench:

Q8_0:

This is on 2x MI50 32GB. For comparison, that's faster than GPT-OSS-20B for me, in both prefill and TG. And GPT is a MXFP4, mind you. As for actual quality? Haven't fully tested it yet as I'm playing around with CLI flags lol.

[-]

Finanzamt_Endgegner@reddit

UPDATE:

Ive found out ive forgotten about a simplification with kv cache that speeds this model up by quite a bit over long context, making it actually useable, im currently trying to clean my source up to push this to the pr, so in a few hours you should be able to test performance again with greatly improved speed (at least in real world usage)!

[-]

Finanzamt_Endgegner@reddit

though im not 100% certain this will translate to actual faster performance in the end since the diffusion steps might get calculated differently 🤔

[-]

Finanzamt_Endgegner@reddit

nice!

[-]

SlowFail2433@reddit

Its a bit too early to have good speed estimates for this type of model. I’ve been working on a compiler for GB300 CUDA kernels for this but it’s still early stage. The cache and register usage patterns are very different. You also lose token-wise KV cache which eliminates a lot of existing speedup architecture.

[-]

Finanzamt_Endgegner@reddit

you could look into open evolve for that, just as a tip, its basically an open source version of alpha evolve and uses llms to evolve and optimize the code, if you create a proper framework with correctness tests and benchmarking you can squeeze out a good amount of performance over human written code most of the cases (;

but its still of use to start with an optimized kernel already to already have it in the right direction

[-]

Iq1pl@reddit

i tried llada-moe-7b-a1b-instruct and it's not that fast, qwen3-2b is faster

[-]

Finanzamt_Endgegner@reddit

you can compile from my source https://github.com/wsbagnsv1/llama.cpp already though the current uploaded gguf for the preview is old an missing some args, ill update that in the next 30-40 mins (;

[-]

Finanzamt_Endgegner@reddit

Since its a diffusion model you should use llama-diffusion-cli and for this model you should use

--diffusion-steps 4096 (however many tokens you wanna generate)

-n 4096 (I think you need to use this the same as diffusion steps, im not 100% sure though 😅)

--diffusion-block-length 32

--temp 0.0

you can test around a bit but those work for me (;

[-]

jacek2023@reddit (OP)

You can try diffusion models in llama.cpp already

[-]

bennmann@reddit

any chance of 256K++ context expansion?

[-]

Finanzamt_Endgegner@reddit

Okay update: Ive found a new optimization ill gonna implement next in the pr which should improve long context performance a LOT

[-]

Finanzamt_Endgegner@reddit

i mean you could test it out, but im not sure performance will be great /:

[-]

Adventurous_Cat_1559@reddit

Ohh, any word on the 103B in GGUF?

[-]

Finanzamt_Endgegner@reddit

well in theory i COULD upload a gguf for it and you could run it with my fork https://github.com/wsbagnsv1/llama.cpp

but im not so sure its advisable yet because there might be changes to it before it gets merged in llama.cpp 😅

you could however convert and quantize it yourself (;

[-]

Zc5Gwu@reddit

The PR still looks like it’s a draft…

[-]

Finanzamt_Endgegner@reddit

ive now opened the pr for review (;

[-]

jacek2023@reddit (OP)

yes it's a draft

[-]

Finanzamt_Endgegner@reddit

yep but it works generally (source im the one who made it 😅)

I dont think there are any major issues left, its just that i wanna clean up the code more before opening the pr (;

[-]

Few_Painter_5588@reddit

Oh nice, do you work with the team? Just wondering, will you guys be working on a larger model?

[-]

Finanzamt_Endgegner@reddit

I just was interested in different model architectures and opened a feature request in llama.cpp and since no one wanted to do it, i just thought ill try it myself (;

[-]

Finanzamt_Endgegner@reddit

nope im just a rando 😅

but you can ask inclusion ai on their discord, they generally answer (;

[-]

CasulaScience@reddit

Out of date for Llada2, but I made a 2 minute explainer for llada1 if people are starting from scratch: https://youtube.com/shorts/_6jekTwBxow?si=ldwG-xmmdNNw1OUo

[-]

atineiatte@reddit

Diffusion models are cool, but personally it's hard to get excited over weekly sparse MoE #53 anymore

[-]

my_name_isnt_clever@reddit

I'm loving them with my Strix Halo. More ~120b MoEs with <10b active please.

[-]

do-un-to@reddit

I'm developing an interest in Strix Halo. What do you have your Strix Halo in? Is it your only GPU? How well does it perform generally? Does it end up being complicated knowing what LLMs you can run efficiently?

[-]

my_name_isnt_clever@reddit

I have the Framework Desktop, 128 GB. It gives you a lot of VRAM headroom (up to 90 GB on Windows, 121 GB on Linux) to run models on, but the memory bandwidth holds it back from the performance of a dedicated GPU. My other system is a Mac so it was the best way to run larger LLMs for the price.

A model like GPT-OSS-120b runs like a dream because it's only 5b active, but anything dense 32B and up is so slow as to be almost unusable. It's complicated but it's really just local LLMs that are complicated, not this arch in particular. It's really all about the active param count when you're mem bandwidth limited, the hardware handles MoE really well.

[-]

do-un-to@reddit

The Framework Desktop looks super appealing, and I like the idea of supporting Framework.

Soldered RAM feels really gross to me, but I get it. For the efficiency trade-off, I'm okay with it. From what I'm seeing, the efficiency of the Strix Halo is fantastic. The performance you can get for wattage and temperature means that you can have human-compatible fan noise and heat while doing serious computing. How are you liking your machine from that perspective?

With MoEs seemingly growing more popular, all signs point to a setup like yours.

[-]

my_name_isnt_clever@reddit

Yeah I get the RAM issue but in this case there's a technical reason for it. That's why I got the 128 GB, it's just a bad idea to get anything but the most possible RAM on this chip.

And yes I agree on the efficiency! I've been downvoted in this sub for pointing out that this setup will save hundreds on my power bills annually over the equivalent NVIDIA chips. And I love that I just have this little box on my desk that's secretly my powerhouse machine, I've never noticed it making sound or heat. But I haven't left it at full tilt for very long, I should try that and see how it does. It's unfortunate that the memory bandwidth limitations means you can't take advantage of the full compute in inference, but at least it can multi-task well.

I'd be happy to chat more if you have more questions.

[-]

jacek2023@reddit (OP)

please elaborate

[-]

Freonr2@reddit

At least for flash, their benchmarks seem roughly similar to Qwen3 30B A3B (similar or trading some blows) and they didn't include similar size, well-known models like gpt oss 120b or GLM 4.5 Air which would have been a more fair comparison IMO.

It's neat, a curiosity since it is diffusion, but not something I'm terribly excited to try.

Note says they are going to run some RL so maybe it will be more impressive later.

shrug

[-]

No-Refrigerator-1672@reddit

Comparison against 30B A3B already says everything that we need to know. It's roughly 3x more memory, and 6x more compute for the same result. this thing should be treated like a research piece: it is a first time (to my knowledge) when somebody made a 100B diffusion model; it outputs something usable, which proves that this architecture has rights to exist; but it'll take a year or two of additional research to make it match transformers.

[-]

Freonr2@reddit

Yeah not trying to piss in their cereal, it's cool that a diffusion model is at least doing a fairly decent job.

[-]

atineiatte@reddit

They don't perform as well as the amount of space they take up in VRAM, and CPU inference is intolerably slow. A big MoE model may have more world knowledge than a mid-size dense model, but the latter will still do a better job chewing through 80k tokens of project context with a complex technical writing prompt

[-]

roz303@reddit

Ladies and gentlemen, this is MoE Number 5...

(Queue the mambo number 5 parody song)

[-]

ortegaalfredo@reddit

If LLMs looks like movie magic, this thing looks like alien technology. I will never accept that generating code via difussion is real.

[-]

Maximum@reddit

More local diffusion models, yay! I got so many questions, though.

What is the speed of this on CPU/GPU compared to qwen3-A30? If it's not way faster, then is this model meant to be used by end users since the two have about the same performance on benchmarks?

I see how flash 2.0 mini can replace qwen3-8b, since it has way less active params, it looks pretty great, but i am not sure about the use case for 2.0 flash, it has more active params than qwen-30B-A3B.

And what is the difference between preview and non preview? Just more training?

[-]

Badger-Purple@reddit

if I learned anything in r/LocalLLaMA, it…reserve your praise for when you test it yourself. As others have mentioned, cherry picking comparisons is so widespread now in AI that none of the benchmark results are truly reliable and consistent.

As a rant, I don’t understand this issue. Why not be scientific and thorough in your tests, show comparisons to models that are similar, show the same benchmarks the compared models showed, etc? Show the benchmarks they showed but you did not?

Honesty would go a long way in determining how good your model is, if you are interested in that vs number of downloads, I believe.

[-]

Maximum@reddit

I don't think that applies here. They compared it with a model that makes them look bad, hence my question what is the benefit of running the full flash 2.0 if qwen30ba3b is on the same level of performance but requires less params.

[-]

stoppableDissolution@reddit

There might be no benefit yet. Text diffusion is still very uncharted territory, and these models are basically public poc. Its great if it can solve some real tasks, but its ultimately not the main goal of it, I would assume

[-]

Maximum@reddit

I agree, but it begs the question, why the mini model does that well.

[-]

SlowFail2433@reddit

It didn’t rly include the main frontier benches (AIME, SweBench, HLE, Arc AGI 2 and the agentic benches like Tau bench) so it just needs more benching really

[-]

Maximum@reddit

It scored 60 on AIME25.

Swe would make sense but qwen3-30BA3B also hasn't published the results on SWE.

HLE and Arc wouldn't make sense for this kind of small models. Maybe next year we'll see small models score on those.

[-]

SlowFail2433@reddit

Thanks I see now the AIME score. I like to see HLE/Arc anyway even if its low but I guess they leave off low scores for marketing

[-]

MatlowAI@reddit

Ok this is really cool that the performance looks this promising. So when do we get qwen edit type principles applied here with even narrower loras on top? Works great with images makes me wonder if it's applicable here...

[-]

radarsat1@reddit

I'm not clear on how text diffusion models handle token shift during inference. like when it decides that a token needs to be inserted or deleted, does that happen, or are all token identities essentially frozen on the first step?

[-]

mantafloppy@reddit

By there own benchmark, they are barely equal, often worst and rarely better than Qwen3-30B-A3B-Instruct-2507.

Why would you ever use a much bigger model for worst performance, what im i missing?

[-]

power97992@reddit

Wow 100b but with the performance of a 30b model, that sounds not great...

[-]

Sufficient-Bid3874@reddit

16BA1B will be interesting for 16gb mac users. Hoping 8b performance from this.

[-]

hapliniste@reddit

Personally I expect 2-4b perf from this because any model with less than 4b active parameter is ass. Still a great choice here if all you want is speed.

[-]

SlowFail2433@reddit

Under 4b active is pretty rough yeah cos the internal representations end up being fairly low rank, is harder for them to represent complex hierarchical structures. Having said that a decent amount of tasks will slide into that limitation just fine. Only a certain proportion of tasks have a requirement for a high dimensional internal representation

[-]

From a quantitative research standpoint it has an additional advantage that it has a different inductive bias to LLMs and this matters for certain forms of quantitative modelling.

[-]

jacek2023@reddit (OP)

well it may be good idea to give them feedback in the community comments

[-]

Zyguard7777777@reddit

Good shout

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.