Is dynamic moe models possible?

Posted by CurrentNew1039@reddit | LocalLLaMA | View on Reddit | 7 comments

is it possible that a moe model can decide how many billion parameters to activate per token according to the task. eg if qwen 3.6 35b a3b - if a task is harder, it can activate 10b per token, if its easy it can stay in 3 b active.
i know there is a speed caveat there, like it will slow down if it execeeds my computers compute.

but what if we can control how much parameters active ourselves, like 35 b model with dynamic moe, means i can make it a dense model by activating all parameters, or make it moe by reducing the active parameters,

its just a theory i thought, it will help larger parameter model to run on all devices by manually adjusting it that would be awesome

[-]

Monkey_1505@reddit

You'd need a smart model to determine if the problem is hard or easy anyway.

[-]

sine120@reddit

This is essentially what thinking does. More complex problem generates more thinking tokens. Simpler, more contained tasks require fewer.

[-]

suprjami@reddit

"thinking" has nothing to do with what OP is talking about

[-]

No-Refrigerator-1672@reddit

Given that for each token a new set of experts is activated, MoE kinda sorta uses more neurons for more complex task than for smaller tasks, although indirectly.

[-]

Psyko38@reddit

To answer that, I don't think so, because currently a model has fixed weights and a dynamic model would therefore change its weight. The only dynamical systems you could do would be a dynamic n-gram model based on the corpus that changes from the query.

[-]

Double_Cause4609@reddit

This is a common question and tons of people ask about it. It's been researched somewhat in literature but the conclusion is that MoE models are already complex to train, and dynamic MoE is crazy hard to optimize training infrastructure for. It's not impossible, but your codebase would be an extra 5k-10k lines of extremely difficult to debug and profile code at minimum (seriously, code for training LLMs at scale is kind of absurd if you factor in the infra, etc).

Then, at inference, what does it actually get you?

It gets you a pattern that you can emulate by just doing RLVR which elicits long thinking traces that somewhat mix information across discrete tokens anyway through the attention mechanism (see the Ling Lite 2 ablations paper and their favorable results from allocating more FLOPs to attention in sparse MoE models).

So TL;DR: It's actually just way easier for a model to think longer than to think wider.

[-]

fisherwei@reddit

As I understand it, this requires the use of this dynamic MoE architecture to be determined during pre-training, rather than fine-tuned during post-training.