Is the output of only the shared expert(s) in a MOE model coherent?

Posted by gofiend@reddit | LocalLLaMA | View on Reddit | 2 comments

Before I fiddle with this, I wanted to see if anyone else has tried deactivating all but the shared expert in a MoE model to evaluate whether its output is coherent ... or if it can be trivially trained to be useful.

More broadly, I'm very interested in the potential of training a single model to work with different inferencing resources (Google's MatFormer work with Gemma 3n is the obvious other approach).

I'd love to see models that can yield coherent output from just using the shared expert FFN (squeeze a little more memory efficiency by skipping all the router parameters also), from a small set of experts, and of course from the full set.

Yes, this was inspired by the absolutely wild setup in Kimi K2: 384(!) shared FFN experts, with 8 activated during inference plus one shared expert... What can just that one shared expert do?