Is the output of only the shared expert(s) in a MOE model coherent?
Posted by gofiend@reddit | LocalLLaMA | View on Reddit | 2 comments
Before I fiddle with this, I wanted to see if anyone else has tried deactivating all but the shared expert in a MoE model to evaluate whether its output is coherent ... or if it can be trivially trained to be useful.
More broadly, I'm very interested in the potential of training a single model to work with different inferencing resources (Google's MatFormer work with Gemma 3n is the obvious other approach).
I'd love to see models that can yield coherent output from just using the shared expert FFN (squeeze a little more memory efficiency by skipping all the router parameters also), from a small set of experts, and of course from the full set.
Yes, this was inspired by the absolutely wild setup in Kimi K2: 384(!) shared FFN experts, with 8 activated during inference plus one shared expert... What can just that one shared expert do?
Double_Cause4609@reddit
Short answer:
This doesn't do what you want it to do. If you want a dense model, just train a dense model.
MoE is not some magical alternative formulation. The experts are not domain experts skilled in a specific area cleanly delineated by a human like hierarchical analysis of different compartmentalized subjects.
MoE is just an approximation of a dense neural network (see: Approximating Two-Layer Feedforward Networks for Efficient Transformers, [csordas et al]).
Can you take a row out of a typical FFN's matrix and use just that to get a coherent response? That's effectively what you're asking.
A better approach, IMO, if you really must continue on something like this, is to take a dense model, do an SVD or PCA operation, and take the top-k entries (the most relevant ones) and save those as a new network. In theory you'll have preserved the most important components of the model in a now much smaller model.
In the same way, treating an MoE as an approximation of a dense network, you can do the same operation, treating *all* of the experts together as a single "unit" and you can probably extract the majority of the model's internal representations. This requires some additional tricks to account for the explicit block-sparsity in an MoE model.
Note that while PCA is good, it's not perfect,
I will never understand people's obsession with compressing MoE models into dense networks. The entire point (in the context of local usage for end-consumers) is that they give extra quality by trading off memory capacity, meaning that relatively affordable solutions like system DRAM become viable for scaling to quite large and performant models, and this tradeoff is much more desirable than just stacking VRAM endlessly.
gofiend@reddit (OP)
I broadly agree with you!
But I'm not quite trying to go where you think I am. I agree that pulling the shared expert FFNs from a MOE yields a poorly architected dense network, and there are better ways to distill a smaller model from a bigger one.
What I am interested in is training models that scale to available memory bandwidth. You *could* train an MOE such that it's shared expert is a passable standalone LLM, and 2 out of 8 experts enhances it's capabilities and 8 / 384 makes it world-class (you arn't losing much by 8 experts being less efficiently specialized).
Of course I think the Matformer route is probably a better way to do this, but I think we're already investing so much in MOE architectures that it would be interested to explore this alternate route (actually I should do slightly more of a lit search ... someone must have tried this already)
Step one (what I was asking here) is just to get a sense of if anybody has actually looked at what role the shared expert plays, and if it could be trained cheaply to work line a small dense model.
The implication of course is that we would lose relatively little performance (in my proposed approach) in terms of the full MOE if the shared expert was "close" to a small standalone dense model, but possibly a lot if it is "far" from a standalone model