OLMoE - a fully open source sparse MoE with only 1 billion active parameters

[-]

-p-e-w-@reddit

That's a fantastic choice of parameter counts: 7B parameters easily fit into any laptop's RAM, and you get the speed of a 1B model, which when quantized can be 30-50 tokens/s *without a GPU.* That thing is born to be a local assistant.

Reply

[-]

Muennighoff@reddit

This is great feedback. We've been thinking about whether for the next model we should do 1) Larger & a fair bit better (e.g. 2x as many total and active parameters hence a bit more costly) 2) Same size & slightly better (e.g. by training for a bit longer i.e. inference cost would be the same) Given your comment, would you prefer 2)?

Reply

[-]

jld1532@reddit

Old thread, but I'd personally like to see number one but with updates to this model. Liquid just dropped LFM2 24B A2B, and so now I can run a model with more parameters than GPT OSS 20B faster on my laptop. This model and Granite 4 H Tiny are my small quick models while I think Liquid will be my workhorse. Just a perspective.

Reply

[-]

Aaaaaaaaaeeeee@reddit (OP)

Since a deep-seek lite model exists for this size, 2 sounds better to me. 😁 Have to ask - Would it be possible to create a bitnet model? The framework for inference is available for llama.cpp, vllm with bitblas, and t-mac (a library compiled with llama.CPP that can allow less energy consumption during inference) MoEs can work too, according to the paper authors. What are your thoughts in this research?

Reply

[-]

Muennighoff@reddit

Not very familiar with BitNet models but sounds like interesting research! Would love to know if someone gets it to work!

Reply

[-]

innominato5090@reddit

Hello 👋 one of the authors here. Nice to see excitement about the release, lmk if you have any question!

Reply

[-]

The_GSingh@reddit

Any updates?

Reply

[-]

xXWarMachineRoXx@reddit

Ayyy

Reply

[-]

sammcj@reddit

That's really interesting! Are you planning on adding support to llama.cpp/GGUF?

Reply

[-]

DefiantHost6488@reddit

From Ai2 team, here is the link for GGUF: [https://huggingface.co/allenai/OLMoE-1B-7B-0924-Instruct](https://huggingface.co/allenai/OLMoE-1B-7B-0924-Instruct)

Reply

[-]

sammcj@reddit

Awesome! Thanks!

Reply

[-]

innominato5090@reddit

we are working on [VLLM support](https://github.com/vllm-project/vllm/pull/7922#issuecomment-2329286620) ATM, but will look at GGUF too!

Reply

[-]

pallavnawani@reddit

Looks interesting. Waiting for ggufs!

Reply

[-]

DefiantHost6488@reddit

From Ai2 team, here is the link for GGUF: [https://huggingface.co/allenai/OLMoE-1B-7B-0924-Instruct](https://huggingface.co/allenai/OLMoE-1B-7B-0924-Instruct)

Reply

[-]

pallavnawani@reddit

Thanks!

Reply

[-]

exclaim_bot@reddit

>Thanks! You're welcome!

Reply

[-]

FullOf_Bad_Ideas@reddit

It seems like they are comparing against Deepseek v1 16B MoE and not Deepseek V2 Lite 16B MoE which is just much better. I really like how open the release is, this model is actually open source, with nice paper behind it. Anyone was able to translate MoE training speed advantage when doing finetuning? I've tried finetuning Deepseek v2 Lite Coder via qlora recently and it's actually very cpu-heavy and my gpu has just 50% utilization (3090ti, power usage hovers at 250w with 480w tdp), so I get total speed of just 400t/s which is not great as I get around 300 t/s when finetuning Deepseek Coder 33B which has around 10x more activated parameters and 2x total, so move speed advantage is not existent for qlora finetuning as far as I can see it. (16b MoE trained in llama-factory with FA2, ds 33b trained in unsloth). I would love to have a MoE where finetune scales with number of activated parameters. Ah and also with the MoE finetuning, loss just stops going down quickly and it's basically stable, don't know why.

Reply

[-]

Muennighoff@reddit

Sorry about the missing comparison! We focused on comparing to other general language models like \`deepseek-moe-16b-base\` not code models like \`DeepSeek-Coder-V2-Lite-Base\` or the StarCoder series. Though I guess that \`DeepSeek-Coder-V2-Lite-Base\` is also good at non-coding tasks despite the name? Maybe we should have added it, sorry about that! Finetuning via transformers is unfortunately not as fast as it could be as the implementation is inefficient. There is a discussion about it here: [https://github.com/huggingface/transformers/pull/32406#discussion\_r1735470121](https://github.com/huggingface/transformers/pull/32406#discussion_r1735470121) ; If anyone would like to make it more efficient via a PR there that would be amazing!

Reply

[-]

FullOf_Bad_Ideas@reddit

There are various Deepseek V2 Lite models. There is a coding specific model and also general model (the one that would be relevant but was missed). [Thats the base of general model.](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite) [And chat finetune of the base.](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite-Chat) Deepseek v2 coding models are initialized from checkpoints of the general models in the middle of pre-training, so they differ in performance significantly from general models. Coding model isn't too relevant for comparison with your model, it was simply the coding one that I was finetuning this week and my first experience with finetuning MoE.

Reply

[-]

Muennighoff@reddit

Oh seems like I missed those, sorry! I had thought there was only the 236B DeepSeek-V2 model but it seems like they added these smaller ones later on. We are running our evaluation on them and will try to add them to the paper soon!

Reply

[-]

CosmosisQ@reddit

Ah, so *this* is the wonder of preprints!

Reply

[-]

exxon_gas4@reddit

It really is. This is one of the larger open-source models on air right now. I have been using AWS to up train my doc OCR but the incremental improvements are becoming more expensive.

Reply

[-]

thezachlandes@reddit

Curious what you’re doing to improve OCR

Reply

[-]

robotphilanthropist@reddit

We found that OLMoE was only about 20-40% faster to fine tune than OLMo 7B (dense model). I suspect some of that was from rough initial implementations in HF ecosystem for fine-tuning. I didn't look closely at utilization / batch size.

Reply

[-]

Imjustmisunderstood@reddit

Cool and all, but why is it comparing itself against last-gen models? Excited to see benchmarks/anecdotes

Reply

[-]

Muennighoff@reddit

Sorry, you got that impression from the Abstract - The paper contains a lot more comparisons with current gen models like Gemma2, Llama3, DCLM etc. I attached the table from the paper with those (also at https://x.com/Muennighoff/status/1831159131920896102). https://preview.redd.it/ku8wnar76vmd1.png?width=968&format=png&auto=webp&s=9280f1035cd46e441cbb1778546a4eda3abc5a8c

Reply

[-]

Imjustmisunderstood@reddit

Ah there we go. Amazing! Honestly, I keep getting shocked by just how incredible Gemma2 is in practice and on paper. But seeing this architecture being explored and optimized is a going to be a brilliant path towards better generalization across every size of transformer. Very thankful for the open data and checkpoints as well. I was speaking to a friend who works at an LLM research lab, and was pretty annoyed that checkpoints and graphs get lost the ether. They offer so many different branches of an architecture to explore.

Reply

[-]

Muennighoff@reddit

Great to hear that the open data & checkpoints are useful! Performance-wise Gemma2 is indeed very strong - for Gemma2-2.6B, OLMoE-1B-7B is at least able to outperform it on MMLU despite having less active parameters but it does indeed not come close to the much larger Gemma2-9.2B. Hopefully in a future release of OLMoE though!

Reply

[-]

MoffKalast@reddit

You compare yourself to what you're better than, marketing 101 ;) On one hand it's worse than Mistral 7B for the total size (ouch), but does roughly seem to match Gemma 2B in performance while at leat in theory being a lot faster if you can load it, which might actually be a niche.

Reply

[-]

catlordX3@reddit

I'd like to see a self improving model. Maybe I'm just dumb, but there's gotta be a way. To me, the program is accessing the parameters to generate a response. It's like read only or something, but would be cool if LLMs were read/write. For example, I start a new conversation and ask "do you know my name?", it doesn't so replies something to the effect of "I dunno". Then I tell it my name, and the program identifies the parameters that were accessed to generate it's initial response and updates them. Im sure it's wrong thinking but would be cool to simply save over wrong answers on the fly.

Reply

[-]

mrshadow773@reddit

Is it pronounce the same as the previous model or is the E at the end accentuated a bit more heavily

Reply

[-]

robotphilanthropist@reddit

I say OLMo-y, but it is up for debate. Also Olmmm M O E

Reply

[-]

MoffKalast@reddit

O-lmao-E

Reply

[-]

robotphilanthropist@reddit

Some general comments on what you can expect from post-training behavior. 1. Most of the data is single turn instruction following. We want to make a v2 that is better at multi-turn. 2. A moderate focus on code/reasoning but we can still do more. 3. Not that much on system prompts / roleplay, so curious what people find. 4. Working on verifiable instruction following (IFEval). Isn't as good as Llama 3.1 type models, but much better than previous OLMos

Reply

[-]

Healthy-Nebula-3603@reddit

I see on the chart is better than Gemma 2 2b ?

Reply

[-]

Ylsid@reddit

Note that it's being compared with previous generations, but this is still very important research and hopefully can be replicated with more training

Reply

OLMoE - a fully open source sparse MoE with only 1 billion active parameters

Reply to Post

36 Comments

-p-e-w-@reddit

Muennighoff@reddit

jld1532@reddit

Aaaaaaaaaeeeee@reddit (OP)

Muennighoff@reddit

innominato5090@reddit

The_GSingh@reddit

xXWarMachineRoXx@reddit

sammcj@reddit

DefiantHost6488@reddit

sammcj@reddit

innominato5090@reddit

pallavnawani@reddit

DefiantHost6488@reddit

pallavnawani@reddit

exclaim_bot@reddit

FullOf_Bad_Ideas@reddit

Muennighoff@reddit

FullOf_Bad_Ideas@reddit

Muennighoff@reddit

CosmosisQ@reddit

exxon_gas4@reddit

thezachlandes@reddit

robotphilanthropist@reddit

Imjustmisunderstood@reddit

Muennighoff@reddit

Imjustmisunderstood@reddit

Muennighoff@reddit

MoffKalast@reddit

catlordX3@reddit

mrshadow773@reddit

robotphilanthropist@reddit

MoffKalast@reddit

robotphilanthropist@reddit

Healthy-Nebula-3603@reddit

Ylsid@reddit