Why MoE models keep converging on ~10B active parameters

Posted by Spare_Pair_9198@reddit | LocalLLaMA | View on Reddit | 27 comments

Interesting pattern: despite wildly different total sizes, many recent MoE models land around 10B active params. Qwen 3.5 122B activates 10B. MiniMax M2.7 runs 230B total with 10B active via Top 2 routing.

Training cost scales as C ≈ 6 × N_active × T. At 10B active and 15T tokens, you get \~9e23 FLOPs, roughly 1/7th of a dense 70B on equivalent data. The economics practically force this convergence.

Has anyone measured real inference memory scaling when expert count increases but active params stay fixed? KV cache seems to dominate past 32k context regardless.

[-]

Embarrassed_Adagio28@reddit

Add "Qwen3 coder next" to that list, 80b total with 10b active. It is the best agentic coder still imo.

[-]

EffectiveCeilingFan@reddit

KV cache is only really a concern for full attention models like MiniMax, which are starting to fall out of style. Qwen3.5 KV is teeny tiny. 128k is 4GB at BF16 if my memory serves me right. Practically nothing compared to a 120B MoE. Gemma 4 uses even less since K and V are unified.

[-]

nuclearbananana@reddit

Bot post. Two out of like fifty models is not "keep converging"

[-]

OcelotMadness@reddit

They don't look like a bot, they're active in other subs. You however aren't auditable so maybe your the bot lol

[-]

nuclearbananana@reddit

All their comment look like bot comments and they haven't replied here once. It's obviously a bot

[-]

catplusplusok@reddit

You can make your own tests with simple vLLM or whatever patches, try to activate fewer experts per token and see differences in speed and quality. Or potentially more, but since model is not trained for that, may need finetuning to get more smarts this way.

[-]

a_beautiful_rhind@reddit

It's simply cheaper but not better. 10b is easy to compute on a wide range of hardware. It's easier to train for longer on the tasks you predict the users will want. As a result nobody notices the deficiencies until they do.

[-]

silenceimpaired@reddit

Yeah, I’m beginning to think active parameters impact “wisdom” while total parameters impact “knowledge”. I just went straight with the Qwen 3.5 ~30b dense and never touched the 120b after seeing benchmarks.

[-]

the__storm@reddit

Begone, bot.

[-]

twnznz@reddit

My guess is they're converging on memory bandwidth that a DDR4 Huawei Ascend can sustain with reasonable performance.

[-]

Equal-Coyote2023@reddit

que es ascend?

[-]

Cold_Tree190@reddit

Huawei’s ai data center chips they’re manufacturing in China to compete domestically with Nvidia

[-]

Acceptable-Yam2542@reddit

so the sweet spot is basically one 4090 worth of active params. makes sense tbh.

[-]

4xi0m4@reddit

The training cost formula really does pull in the same direction from both ends. C ≈ 6 × N_active × T means the FLOPs budget is directly proportional to active params, so for a fixed compute budget there is an inherent incentive to push N_active as low as the quality floor allows. The inference-side sweet spot of ~10B active hitting the memory bandwidth ceiling on common hardware just compounds that signal. Both converging on the same number is one of those things that looks like coincidence until you realize the constraints are what they are.

[-]

rustedrobot@reddit

Small models (<30b active) permit experimentation in architectures without a massive cost with the benefit that the public will adopt and provide feedback for you. If you stretch up into the 30b+ active range (GLM-5) the results are notably different but not practical for the masses and edge compute. The SOTA non-public models that seed everything are larger.

Internally the major players have worked out the GPU topologies to support this efficiently. The fabric between GPUs is now getting specialized to the (MOE) models and other emerging techniques that have compartmentalized NNs.

It's all about the nuance and depth of context and more parameters is king.

[-]

Specialist_Golf8133@reddit

honestly think we're watching architecture meet hardware in real time. like 10B active hits this sweet spot where you get meaningful compute without blowing your inference budget, and every lab independently landed there. kinda wild that the 'natural' size for useful sparsity maps so cleanly to what fits in memory. makes you wonder if that number shifts hard once we get different gpu configs

[-]

BeneficialVillage148@reddit

Yeah it really feels like an economic sweet spot more than a coincidence

You get near big-model quality while keeping training and inference costs manageable, so everyone ends up around that \~10B active range. Pretty interesting how MoE is shaping that balance.

[-]

Enough_Big4191@reddit

I haven’t seen super clean numbers published, but in practice the gains flatten pretty fast once active params are fixed. Routing more experts mostly hits you on memory overhead and latency, not so much the core compute. And yeah, once you push past longer contexts, KV cache becomes the thing you’re actually paying for, not the experts. Curious, are you looking at this for long-context use cases or more standard 4–8k? The tradeoffs feel very different depending on that.

[-]

Fun_Nebula_9682@reddit

the training economics argument tracks, but there's also a strong inference-side pull toward 10B active. a dense 70B needs 140GB+ to serve, but with MoE you get 10B worth of active compute per token while the rest sits cold in VRAM. near-70B quality at near-10B inference cost per token

both training and inference economics pointing at the same number feels less like coincidence at this point. 10B also roughly saturates the memory bandwidth of a single modern GPU at batch=1, which probably reinforces the convergence from yet another direction

[-]

HealthyCommunicat@reddit

Mistral 4 small being a6b active made it faster than qwen 3.5 122b-a10b but benchmark scores were actually higher - ur questions are interesting indeed, at what size of total parameters does 10b active parameters start not being worth it?

[-]

Front-Relief473@reddit

10b to 30b is usually the dessert area of reasoning performance, and the price/performance ratio is usually not high when it exceeds 30b, so in theory, if the activation parameter can be increased to 30b, it will be a good reasoning effect, so 10b is not the most perfect, but 10b can improve the reasoning speed without reducing the reasoning ability of the model too much.

[-]

ROS_SDN@reddit

Desert?

[-]

yensteel@reddit

I think he truly meant dessert. Aka sweet spot.

[-]

LagOps91@reddit

sort of... in the 100-250b range, you often have about 10b active parameters. beyond that we have models with a lot more, but some also use only 10b active, like trinity large (a 400b model). beyond that 400b size, active parameters are often around 30b, sometimes higher.

[-]

Aaaaaaaaaeeeee@reddit

I didn't realize Minimax was top 2 experts, that's interesting.

There's research on high granularity models:

https://arxiv.org/abs/2602.05711
https://arxiv.org/abs/2508.18756

Please expand on the memory scaling, what do you mean by that?

[-]

stddealer@reddit

For the same reson dense models under ~10B parameters tend to fall apart when it comes to solving more complex tasks.

[-]

GroundbreakingMall54@reddit

honestly i think its because 10B active is roughly the sweet spot where you get good enough reasoning without needing absurd memory bandwidth. like theres a hardware ceiling most people hit and the model designers know it. fitting on consumer gpus matters more than raw param count at this point