new MoE from ai2, EMO

Posted by ghostderp@reddit | LocalLLaMA | View on Reddit | 19 comments

new MoE release from ai2 - EMO, 1b-active/14b-total trained on 1t tokens

interesting thing is document-level routing. experts cluster around domains like health, news, etc. instead of surface patterns

models: https://huggingface.co/collections/allenai/emo

[-]

EducationalGood495@reddit

How would this run on RX 5700XT? Would fit on 8GB vRAM with some offloading, wouldnt it?

[-]

nuclearbananana@reddit

This is what I though MoE originally was. Makes more sense imo. Deploy like a quarter of the model depending on if you're programming, writing, asking questions etc.

[-]

Yeah but the beauty of neural nets is that they don't necessarily work like human neural brains. So what seems obvious to us might not be helpful at all for a neural net. Hence the more freedom you give to a network to learn, the more expressive it can be in theory. It can find all kinds of wacky ways to do things and work. Practice is much different, of course, and sometimes you want to architect some bias, like convolutional nets do or transformers. But forcing document level MoE routing is a very strong bias to introduce

[-]

Nyghtbynger@reddit

Maybe a good model lies in the hybrid approach (or some very smart approach I can't fathom)

[-]

SadBBTumblrPizza@reddit

Almost certainly not. Any time we try to hard-code human ideas and techniques into models they perform worse than just throwing a huge training set at them and letting them figure it out.

[-]

Nyghtbynger@reddit

Different Architectures ? Like different forms of neurons

[-]

InterestRelative@reddit

> every route needs to learn the same basics independently (e.g. basic language and grammar) in its weights

Isn’t that what shared expert is for? AI2 recently published a paper detailing their approach to training experts separately: https://allenai.org/blog/bar .
The idea is that organizations can train experts on their semi-private data and then merge into MoE. Also they can release an expert publicly without releasing data (though data may leak).

I suppose this work is a continuation of the expert separation idea. I acknowledge the downsides you mentioned, but if they can successfully implement this approach, it could be an intriguing path for LLM development. For example more org may train experts in niche domains. It is cheaper than training a big model.

[-]

guiopen@reddit

It seems like an experiment and not a final model, just 1t token pretretraining

[-]

Maximum@reddit

14b MoE on 1T tokens is not a joke, it's enough to claim they had a breakthrough imo

[-]

Silver-Champion-4846@reddit

Don't forget that they are going completely open-source which means they try to use the most open data possible, without (or with as few as possible) all the gray-area crawls like reddit chats and stuff, which is why they don't have as many tokens as the leading open-weights models.

[-]

guiopen@reddit

I know, but their olmo series are prettaines on much larger datasets, so even for them this Is an experiment level amount

[-]

Silver-Champion-4846@reddit

yeah, did you test the new model?

[-]

ComplexType568@reddit

AllenAI never gets ggufs... I hope this one does

[-]

Eyelbee@reddit

Allen ai does some great work

[-]

Specter_Origin@reddit

They indeed do!

[-]

jld1532@reddit

GGUF when? Am I going this right?

[-]

TheRealMasonMac@reddit

They also recently released a robotics model: https://allenai.org/blog/molmoact2

[-]

Firstbober@reddit

I wonder how it fares compared to other models. Performance wise it should be excellent while delivering really nice intelligence per tok/s. It would be fire for someone to make 200M active EMO model, and then make it an SSM, but that is a wishful thinking (tho NVIDIA could do it?).

[-]

ttkciar@reddit

Yaay! When they released Olmo-3, someone asked about MoE, and they said it was in the works. I've wondered about that from time to time, and now this pops up showing they have indeed been working on it :-) kudos to AllenAI!