Overview of the Largest Mixture of Expert Models Released So Far

Posted by Small-Fall-6500@reddit | LocalLLaMA | View on Reddit | 6 comments

**Quick Introduction** For a detailed overview about how Mixture of Expert (MoE) models work, there is a detailed HuggingFace blog: ["Mixture of Experts Explained."](https://huggingface.co/blog/moe) The TLDR is that MoE models generally have fewer active parameters compared to dense models of the same size, but at the cost of more total parameters. This list is ordered by date of release and covers MoE models that are over 100b in total parameters which are downloadable right now as of posting. The name of each model is hyperlinked to its corresponding HuggingFace page. The lmsys ranks are from the most recent leaderboard update on November 4, 2024. **The List of MoE Models** **1.** [**Switch-C Transformer by Google**](https://huggingface.co/google/switch-c-2048) * **Architecture Details**: * **Parameters**: 1.6T total * **Experts**: 2048 * **Release Date**: November 2022 (upload to HuggingFace) | [Paper: January 2021](https://arxiv.org/abs/2101.03961) * **Quality Assessment**: Largely outdated, not on lmsys * **Notable Details**: One of the earliest and the current largest released MoE model. Accompanied by smaller MoEs also available on [HuggingFace](https://huggingface.co/collections/google/switch-transformers-release-6548c35c6507968374b56d1f). **2.** [**Grok-1 by X AI**](https://huggingface.co/xai-org/grok-1) * **Architecture Details**: * **Parameters**: 314b total * **Experts**: 8, with 2 chosen * **Context Length:** 8k * **Release Date**: March 17, 2024 * **Quality Assessment**: Not available on lmsys, generally not very good nor widely used * **Notable Details**: Supported by **llamacpp**. Grok-2 (and Grok-2 mini) should be much better, *but Grok-2 is not* [(yet)](https://www.reddit.com/r/LocalLLaMA/comments/1fw7ikv/open_sourcing_grok_2_with_the_release_of_grok_3/) *available for download.* Grok-2 ranks well on lmsys: Grok-2-08-13 ranks 5th Overall (8th with style control) and 6th on Hard Prompts (English). **3.** [**DBRX by Databricks**](https://huggingface.co/databricks/dbrx-instruct) * **Architecture Details**: * **Parameters**: 132b total, 36b active * **Experts**: 16, with 4 chosen * **Context Length**: 32k * **Release Date**: March 27, 2024 * **Quality Assessment**: Rank 90 Overall, 78 Hard Prompts (English) * **Notable Details**: Supported by **llamacpp, exllama v2, and vLLM**. **4.** [**Mixtral 8x22b by Mistral AI**](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1) * **Architecture Details**: * **Parameters**: 141b total, 39b active * **Experts**: 8, with 2 chosen * **Context Length**: 64k * **Release Date**: April 17, 2024 * **Quality Assessment**: Rank 70 Overall, 66 Hard Prompts (English) * **Notable Details**: Supported by **llamacpp, exllama v2, and vLLM**. **5.** [**Arctic by Snowflake**](https://huggingface.co/Snowflake/snowflake-arctic-instruct) * **Architecture Details**: * **Parameters**: 480b total, 17b active (7b sparse, 10b dense) * **Experts**: 128, with 2 chosen * **Context Length**: 4k * **Release Date**: April 24, 2024 * **Quality Assessment**: Rank 99 Overall, 101 Hard Prompts (English) * **Notable Details**: *Very* few active parameters for its size but limited usefulness due to very short context length and poor quality. Has **vLLM** support. **6.** [**Skywork-MoE by Skywork**](https://huggingface.co/Skywork/Skywork-MoE-Base) * **Architecture Details**: * **Parameters**: 146b total, 22b active * **Experts:** 16, with 2 chosen * **Context Length**: 8k * **Release Date**: June 3, 2024 * **Quality Assessment**: This is only the base model, and it is not available on lmsys * **Notable Details**: Only the base model has been released, with the Chat model promised but still unreleased after five months. Has **vLLM** support. **7.** [**Jamba 1.5 Large by AI21 Labs**](https://huggingface.co/ai21labs/AI21-Jamba-1.5-Large) * **Architecture Details**: * **Parameters**: 398b total, 98b active * **Experts:** 16, with 2 chosen * **Context Length**: 256k * **Release Date**: August 22, 2024 * **Quality Assessment**: Rank 34 Overall, 28 Hard Prompts (English) * **Notable Details**: This is a mamba-transformer hybrid that beats all other models tested on the [RULER context benchmark](https://github.com/NVIDIA/RULER). It was released alongside [Jamba 1.5 mini, a 52b MoE](https://huggingface.co/ai21labs/AI21-Jamba-1.5-Mini). It has **vLLM** support, and work has been done to provide [support for Jamba models in llamacpp](https://github.com/ggerganov/llama.cpp/issues/6372), but it's not yet fully implemented. **8.** [**DeepSeek V2.5 by DeepSeek**](https://huggingface.co/deepseek-ai/DeepSeek-V2.5) * **Architecture Details**: * **Parameters**: 236b total, 21b active * **Experts:** 160, with 6 chosen and 2 shared (total 8 active) * **Context Length**: 128k * **Release Date**: September 6, 2024 * **Quality Assessment**: Rank 18 Overall, 6 in Hard Prompts (English) * **Notable Details**: Top ranked MoE released so far. The [earlier DeepSeek V2](https://huggingface.co/collections/deepseek-ai/deepseek-v2-669a1c8b8f2dbc203fbd7746) was released on May 6, 2024. DeepSeek V2.5 is supported by **vLLM and llamacpp**. **9.** [**Hunyuan Large by Tencent**](https://huggingface.co/tencent/Tencent-Hunyuan-Large) * **Architecture Details**: * **Parameters**: 389b total, 52b active * **Experts:** 16, with 1 chosen and 1 shared (2 total active) * **Context Length**: 128k * **Release Date**: November 5, 2024 * **Quality Assessment**: Not currently ranked on lmsys. * **Notable Details**: Recently released, hopefully it shows up on lmsys. It has **vLLM** support. The current best MoE model released so far appears to be DeepSeek V2.5, but Tencent's Hunyuan Large could end up beating it. If/when Grok-2 is released, it would likely be the best available MoE model. However, the true "best" model always depends on the specific usecase. For example, Jamba 1.5 Large may excel at long context tasks compared to DeepSeek V2.5. I should also add that the rankings on the lmsys chatbot arena do not always provide a reliable assessment of model capabilities (especially long context capabilities), but they should be good enough for a rough comparison between models. As I said above, the true "best" model will depend on your specific usecases. The rankings on lmsys can provide a starting point if you don't have the time or resources to test every model yourself. I thought about scouring every release page for benchmarks like MMLU, but that would take even more time (though perhaps it would be worth adding). This list should cover all of the largest MoEs (>100b) released so far, but if anyone has heard of any others I'd love to hear about them (as well as any notable finetunes, like [Wizard 8x22b](https://huggingface.co/alpindale/WizardLM-2-8x22B)). If anyone knows how many active parameters Switch-C or Grok-1 has or knows how to calculate it, or what the context length of Switch-C is, please add a comment and I'll edit the list. Also, if anyone knows the status of support for these models for different backends, please let me know and I'll edit the post. I only added mention for support that I could easily verify, mainly by checking GitHub and HuggingFace. Lastly, if anyone has gotten Hunyuan Large running or tested it online, I would love to hear about it and how it compares to DeepSeek V2.5 or other models. There have been a lot of smaller MoEs released too, and I might make a similar list of them if I get around to it. The smaller MoEs are certainly a lot more accessible, and such a list may be more useful for most people.

6 Comments

[-]

Durian881@reddit

I'm hoping to try out quantised Deepseek 2.5 with distributed inferencing using my 2 Macs (96GB+64GB).

spookperson@reddit

I got Exo ( https://github.com/exo-explore/exo ) working with two macs and was happy to see that Deepseek 2.5 had pretty decent tokens per second. The time to first token was not super fast though.

Hi, I'm a little lost. Would appreciate it if you could guide me on how to run Deepseek 2.5. is there any config to update? I managed to install and got the tiny hat working. Thanks!

This is the comment thread that got me to try exo and I had a discussion with one of the maintainers about best practices for my Mac hardware: [https://www.reddit.com/r/LocalLLaMA/comments/1fhdkdw/comment/lnz3vws/](https://www.reddit.com/r/LocalLLaMA/comments/1fhdkdw/comment/lnz3vws/) When you're using Macs, it is easiest to use the MLX engine (as opposed to Tinygrad). If I remember correctly, these are the steps I used to get exo running: 1) Made a new python virtual environment for exo (I've tested with conda and [pyenv-virtualenv](https://github.com/pyenv/pyenv-virtualenv) but I assume something like pipx would work too) 2) clone https://github.com/exo-explore/exo.git && cd exo pip install -e . exo then it will be running a webserver that you can go to in your browser at [http://127.0.0.1:8000](http://127.0.0.1:8000) and you can try a small model (like llama-3.2-1b) by selecting in the dropdown and starting a chat (it will download the model file using hugging\_face hub). Once you've confirmed it is working and inferences on both computers separately (at a speed you'd expect) then you can try running exo on both Macs at the same time and if they are on the same wifi/router they should automatically connect to each other

Thank you for the detailed write-up. I managed to install eco and got it running on both Macs at the same time. I could see various Llama models but could not select deepseek-coder-v2.5 from the webUI.

Small-Fall-6500@reddit (OP)

One thing I found interesting while looking over the lmsys leaderboard is that there's a new top Non-Proprietary model: The Llama 3.1 Nemotron 70b holds the highest non-proprietary rank on the lmsys leaderboard for Overall, at rank 9, and second highest for Hard Prompts (English) at rank 6 (beaten by Llama 3.1 405b bf16 at rank 4). *However*... With Style Control, Nemotron 70b drops to rank 25 Overall, sitting within several points of DeepSeek V2.5, Qwen 2.5 72B Instruct, Athene 70B (a Llama 3 finetune), and Llama 3.1 70B Instruct. Of the MoE models that appear on lmsys, their ranking on Overall does not change dramatically with the Style Control filter enabled (DeepSeek V2.5 is still the top MoE released), but the two top models to get ranked closer together: Jamba 1.5 Large's rank *increases* to 30 (from 33) while DeepSeek V2.5's rank *decreases* to 22 (from 18).

Reply to Post

6 Comments