First direct side by side MoE vs Dense comparison.
Posted by Different_Fix_2217@reddit | LocalLLaMA | View on Reddit | 32 comments
Posted by Different_Fix_2217@reddit | LocalLLaMA | View on Reddit | 32 comments
ambient_temp_xeno@reddit
For their small model, sure. It's not replicated with the 26-35b moe vs 27-31b dense Qwen 3.5 and Gemma 4 models.
FullOf_Bad_Ideas@reddit
InclusionAI trained their 1T models based on this paper.
ambient_temp_xeno@reddit
That super great, whoever they are.
Different_Fix_2217@reddit (OP)
The point was that they trained them side by side with the same method / dataset / amount of tokens. So this is a far better comparison.
ambient_temp_xeno@reddit
Is it your paper? I'm just not sure why you posted it on a forum website if it's not even allowed to be challenged for discussion.
cagriuluc@reddit
Nah it is allowed to be challenged, you are just not making a good job of it.
Someone else said it’s not TOO different from a previous previous rule-of-thumb. I don’t know if it’s actually true but it sounds like a good way to challenge the paper.
I don’t know what you mean by your comment, though? I suspect you are claiming they didn’t compare it with anything big but… I also suspect you got the whole thing wrong.
ambient_temp_xeno@reddit
Well I was getting warmed up before everyone just downvoted a perfectly cromulent comment (regardless of how good it was).
I'm saying they've cooked up a new 'scaling law' but it only applies to their little 1T token models. It doesn't seem to match up real world models.
cagriuluc@reddit
Downvoting means you think some comment is bad, so it’s not “regardless of how good/bad it was”
If you have written something like: they didn’t compare Qwen3.5 122BA10B with Qwen3.5 27B, then I would see your point kinda. But I still wouldn’t agree with you.
ambient_temp_xeno@reddit
You're not supposed to downvote just because you disagree with something! Bloody hell...
Anyway this part really bothers me about their experiment:
It's like they've altered the whole thing to produce the results they wanted.
cagriuluc@reddit
I really don’t understand what’s “bloody hell” about downvoting unintelligible comments… if it was well written, it probably wouldn’t have been downvoted even if people disagree with you. I also don’t see why one cannot downvote something just because they don’t agree with it, what do you believe downvote is for?
Your point about the paper is legit. You are right to question the methodology, at least because they deviate from previously used methods.
ambient_temp_xeno@reddit
Downvote is supposed to be for off topic/rude comments, but do what you want.
cagriuluc@reddit
Your comment is almost off topic because it looks like you are complaining about them not comparing similar size moe and dense models. Not the point of the post, for sure.
ambient_temp_xeno@reddit
What is the point of the post, anyway?
It feels like you want everyone to just absorb the abstract of the paper and carrying on consuming uncritically.
I'm old enough to remember the 'less is more for alignment' paper and how that put everyone, including me, on a wild goose chase for a while.
cagriuluc@reddit
Look… I dont think you mean they should have compared some 30B MoE with a 30B dense model. That doesn’t make sense, right? Your OC implies you think that. That’s the reason you are downvoted. Other people are arguing their points in the thread just fine, you just miscommunicated what you actually meant.
ambient_temp_xeno@reddit
Just look at the post title. He's called it a 'direct side by side moe v dense comparison' (the fact it isn't notwithstanding)
ambient_temp_xeno@reddit
Oh yeah like "this post is from last year" great contriubution lol.
Serprotease@reddit
Per their results, the 26-35b MoE should perform better than the 8-9b. They don’t really compare MoE with bigger dense models.
Another interesting point is that MoE seems to be more impacted by the amount of compute available. Since we know that this is what Chinese open-weight model lacks compared to their us conterpart, we may expect even better smaller MoE in the future.
ResidentPositive4122@reddit
How so? Is 35BA3 not better than 9B dense? Or 122BA10 better than 27B dense? We can't compare gemmas since we don't have smaller dense models...
ambient_temp_xeno@reddit
I'm not even going to talk to plebs on here who downvote because they disagree. fuck off
ResidentPositive4122@reddit
That wasn't me, champ - https://ibb.co/KcZDP2mJ
ambient_temp_xeno@reddit
Ok, I apologize. But 35ba3 is 3.8 times the parameters of 9b it's not what I'm talking about. They only trained their models on 1T tokens so it's just not comparable to the real world.
FullOf_Bad_Ideas@reddit
I'm a huge fan of this paper and inclusionAI's research.
Here's an old vibe coded Gradio tool that can help you estimate EL of your MoE built on their formulas - https://github.com/adamo1139/Ling-V2/blob/main/gradio_model_chooser.py
I used it to decide on the configuration of the small pre-trained MoE that I've been working on in spare time, Poziomka. It's also based on their BailingMoEV2 architecture, I pre-trained it on ~80B of Polish language tokens, including 28B locally. It's Polish-only so it'd not be of interest to you if you don't know Polish.
In practice i found that EL needs to be taken only as a guide but it's crucial to not overlook MFU of your GPUs - even if your model has good effective leverage, but your compute usage is low since model is very sparse and GPUs are idling a lot, the model will just not be that great.
It's great for conceptualizing how model creators are deciding on the design choices for their models. You need high EL first and then hardware configuration that will keep GPUs really busy and that should deliver a good model trained cheaply.
Middle_Bullfrog_6173@reddit
While "old" in terms of AI time, it is an interesting paper. The problem in applying it to production models is that it's about compute optimal training. Almost all real models are overtrained to make inference cheaper . My intuition is that it doesn't change the big picture, but...
FullOf_Bad_Ideas@reddit
It's not only about compute optimal models, it takes overtraining into account as it derives the whole formula and you can plug in overtrained models into it too.
Endlesscrysis@reddit
This is almost a year old?
FullOf_Bad_Ideas@reddit
I like that paper a lot and I used it for calculating EL for pretraining my small 4B MoE.
I link it in my comments often and my last comment where I linked it got a lot of upvotes. I think that this is causing it to re-circulate now among people who missed it.
Comment - https://www.reddit.com/r/LocalLLaMA/comments/1svbmnc/decreased_intelligence_density_in_deepseek_v4_pro/oi7gg1z/
k_means_clusterfuck@reddit
Not first. https://arxiv.org/abs/2508.18672 was published before
FullOf_Bad_Ideas@reddit
InclusionAIs paper released on Arxiv in July 2025. The paper you linked released in August 2025. I think neither of them were the first, there was some other papers before but I find inclusionAI paper to be the most in depth and it's the one that derives a complete formula for calculating effective leverage.
ResidentPositive4122@reddit
This is a bit better than sqrt(A*T) "rule of thumb" we've been using since early mistral times. So sqrt (0.8 * 17.5) ~ 4b ; They seem to match it to ~6B. So a bit better (probably sparseness changes, mistral was experimenting with less sparse MoEs at the time 8x7, 2 active...).
Different_Fix_2217@reddit (OP)
2 issues. Your missing the amount of active comparison and the fact that the 17.5B performed a good deal better in the comparison.
ResidentPositive4122@reddit
Sorry, what am I missing? They compare a 17.5BA0.8 with a 6.1B dense.
Different_Fix_2217@reddit (OP)
Just to account for the moe performing better particularly where more knowledge matters.