new MoE from ai2, EMO
Posted by ghostderp@reddit | LocalLLaMA | View on Reddit | 19 comments
new MoE release from ai2 - EMO, 1b-active/14b-total trained on 1t tokens
interesting thing is document-level routing. experts cluster around domains like health, news, etc. instead of surface patterns
EducationalGood495@reddit
How would this run on RX 5700XT? Would fit on 8GB vRAM with some offloading, wouldnt it?
nuclearbananana@reddit
This is what I though MoE originally was. Makes more sense imo. Deploy like a quarter of the model depending on if you're programming, writing, asking questions etc.
iLaurens@reddit
Yeah but the beauty of neural nets is that they don't necessarily work like human neural brains. So what seems obvious to us might not be helpful at all for a neural net. Hence the more freedom you give to a network to learn, the more expressive it can be in theory. It can find all kinds of wacky ways to do things and work. Practice is much different, of course, and sometimes you want to architect some bias, like convolutional nets do or transformers. But forcing document level MoE routing is a very strong bias to introduce
Nyghtbynger@reddit
Maybe a good model lies in the hybrid approach (or some very smart approach I can't fathom)
SadBBTumblrPizza@reddit
Almost certainly not. Any time we try to hard-code human ideas and techniques into models they perform worse than just throwing a huge training set at them and letting them figure it out.
Nyghtbynger@reddit
Different Architectures ? Like different forms of neurons
InterestRelative@reddit
> every route needs to learn the same basics independently (e.g. basic language and grammar) in its weights
Isn’t that what shared expert is for? AI2 recently published a paper detailing their approach to training experts separately: https://allenai.org/blog/bar .
The idea is that organizations can train experts on their semi-private data and then merge into MoE. Also they can release an expert publicly without releasing data (though data may leak).
I suppose this work is a continuation of the expert separation idea. I acknowledge the downsides you mentioned, but if they can successfully implement this approach, it could be an intriguing path for LLM development. For example more org may train experts in niche domains. It is cheaper than training a big model.
guiopen@reddit
It seems like an experiment and not a final model, just 1t token pretretraining
__Maximum__@reddit
14b MoE on 1T tokens is not a joke, it's enough to claim they had a breakthrough imo
Silver-Champion-4846@reddit
Don't forget that they are going completely open-source which means they try to use the most open data possible, without (or with as few as possible) all the gray-area crawls like reddit chats and stuff, which is why they don't have as many tokens as the leading open-weights models.
guiopen@reddit
I know, but their olmo series are prettaines on much larger datasets, so even for them this Is an experiment level amount
Silver-Champion-4846@reddit
yeah, did you test the new model?
ComplexType568@reddit
AllenAI never gets ggufs... I hope this one does
Eyelbee@reddit
Allen ai does some great work
Specter_Origin@reddit
They indeed do!
jld1532@reddit
GGUF when? Am I going this right?
TheRealMasonMac@reddit
They also recently released a robotics model: https://allenai.org/blog/molmoact2
Firstbober@reddit
I wonder how it fares compared to other models. Performance wise it should be excellent while delivering really nice intelligence per tok/s. It would be fire for someone to make 200M active EMO model, and then make it an SSM, but that is a wishful thinking (tho NVIDIA could do it?).
ttkciar@reddit
Yaay! When they released Olmo-3, someone asked about MoE, and they said it was in the works. I've wondered about that from time to time, and now this pops up showing they have indeed been working on it :-) kudos to AllenAI!