OLMoE - a fully open source sparse MoE with only 1 billion active parameters

Posted by Aaaaaaaaaeeeee@reddit | LocalLLaMA | View on Reddit | 36 comments

>We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs. - models: https://huggingface.co/collections/allenai/olmoe-66cf678c047657a30c8cd3da - paper: https://arxiv.org/html/2409.02060v1 - data: https://hf.co/datasets/allenai/OLMoE-mix-0924 - code: https://github.com/allenai/OLMoE - logs: https://wandb.ai/ai2-llm/olmoe/reports/