Overview of the Largest Mixture of Expert Models Released So Far
Posted by Small-Fall-6500@reddit | LocalLLaMA | View on Reddit | 6 comments
**Quick Introduction**
For a detailed overview about how Mixture of Expert (MoE) models work, there is a detailed HuggingFace blog: ["Mixture of Experts Explained."](https://huggingface.co/blog/moe) The TLDR is that MoE models generally have fewer active parameters compared to dense models of the same size, but at the cost of more total parameters.
This list is ordered by date of release and covers MoE models that are over 100b in total parameters which are downloadable right now as of posting. The name of each model is hyperlinked to its corresponding HuggingFace page. The lmsys ranks are from the most recent leaderboard update on November 4, 2024.
**The List of MoE Models**
**1.** [**Switch-C Transformer by Google**](https://huggingface.co/google/switch-c-2048)
* **Architecture Details**:
* **Parameters**: 1.6T total
* **Experts**: 2048
* **Release Date**: November 2022 (upload to HuggingFace) | [Paper: January 2021](https://arxiv.org/abs/2101.03961)
* **Quality Assessment**: Largely outdated, not on lmsys
* **Notable Details**: One of the earliest and the current largest released MoE model. Accompanied by smaller MoEs also available on [HuggingFace](https://huggingface.co/collections/google/switch-transformers-release-6548c35c6507968374b56d1f).
**2.** [**Grok-1 by X AI**](https://huggingface.co/xai-org/grok-1)
* **Architecture Details**:
* **Parameters**: 314b total
* **Experts**: 8, with 2 chosen
* **Context Length:** 8k
* **Release Date**: March 17, 2024
* **Quality Assessment**: Not available on lmsys, generally not very good nor widely used
* **Notable Details**: Supported by **llamacpp**. Grok-2 (and Grok-2 mini) should be much better, *but Grok-2 is not* [(yet)](https://www.reddit.com/r/LocalLLaMA/comments/1fw7ikv/open_sourcing_grok_2_with_the_release_of_grok_3/) *available for download.* Grok-2 ranks well on lmsys: Grok-2-08-13 ranks 5th Overall (8th with style control) and 6th on Hard Prompts (English).
**3.** [**DBRX by Databricks**](https://huggingface.co/databricks/dbrx-instruct)
* **Architecture Details**:
* **Parameters**: 132b total, 36b active
* **Experts**: 16, with 4 chosen
* **Context Length**: 32k
* **Release Date**: March 27, 2024
* **Quality Assessment**: Rank 90 Overall, 78 Hard Prompts (English)
* **Notable Details**: Supported by **llamacpp, exllama v2, and vLLM**.
**4.** [**Mixtral 8x22b by Mistral AI**](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1)
* **Architecture Details**:
* **Parameters**: 141b total, 39b active
* **Experts**: 8, with 2 chosen
* **Context Length**: 64k
* **Release Date**: April 17, 2024
* **Quality Assessment**: Rank 70 Overall, 66 Hard Prompts (English)
* **Notable Details**: Supported by **llamacpp, exllama v2, and vLLM**.
**5.** [**Arctic by Snowflake**](https://huggingface.co/Snowflake/snowflake-arctic-instruct)
* **Architecture Details**:
* **Parameters**: 480b total, 17b active (7b sparse, 10b dense)
* **Experts**: 128, with 2 chosen
* **Context Length**: 4k
* **Release Date**: April 24, 2024
* **Quality Assessment**: Rank 99 Overall, 101 Hard Prompts (English)
* **Notable Details**: *Very* few active parameters for its size but limited usefulness due to very short context length and poor quality. Has **vLLM** support.
**6.** [**Skywork-MoE by Skywork**](https://huggingface.co/Skywork/Skywork-MoE-Base)
* **Architecture Details**:
* **Parameters**: 146b total, 22b active
* **Experts:** 16, with 2 chosen
* **Context Length**: 8k
* **Release Date**: June 3, 2024
* **Quality Assessment**: This is only the base model, and it is not available on lmsys
* **Notable Details**: Only the base model has been released, with the Chat model promised but still unreleased after five months. Has **vLLM** support.
**7.** [**Jamba 1.5 Large by AI21 Labs**](https://huggingface.co/ai21labs/AI21-Jamba-1.5-Large)
* **Architecture Details**:
* **Parameters**: 398b total, 98b active
* **Experts:** 16, with 2 chosen
* **Context Length**: 256k
* **Release Date**: August 22, 2024
* **Quality Assessment**: Rank 34 Overall, 28 Hard Prompts (English)
* **Notable Details**: This is a mamba-transformer hybrid that beats all other models tested on the [RULER context benchmark](https://github.com/NVIDIA/RULER). It was released alongside [Jamba 1.5 mini, a 52b MoE](https://huggingface.co/ai21labs/AI21-Jamba-1.5-Mini). It has **vLLM** support, and work has been done to provide [support for Jamba models in llamacpp](https://github.com/ggerganov/llama.cpp/issues/6372), but it's not yet fully implemented.
**8.** [**DeepSeek V2.5 by DeepSeek**](https://huggingface.co/deepseek-ai/DeepSeek-V2.5)
* **Architecture Details**:
* **Parameters**: 236b total, 21b active
* **Experts:** 160, with 6 chosen and 2 shared (total 8 active)
* **Context Length**: 128k
* **Release Date**: September 6, 2024
* **Quality Assessment**: Rank 18 Overall, 6 in Hard Prompts (English)
* **Notable Details**: Top ranked MoE released so far. The [earlier DeepSeek V2](https://huggingface.co/collections/deepseek-ai/deepseek-v2-669a1c8b8f2dbc203fbd7746) was released on May 6, 2024. DeepSeek V2.5 is supported by **vLLM and llamacpp**.
**9.** [**Hunyuan Large by Tencent**](https://huggingface.co/tencent/Tencent-Hunyuan-Large)
* **Architecture Details**:
* **Parameters**: 389b total, 52b active
* **Experts:** 16, with 1 chosen and 1 shared (2 total active)
* **Context Length**: 128k
* **Release Date**: November 5, 2024
* **Quality Assessment**: Not currently ranked on lmsys.
* **Notable Details**: Recently released, hopefully it shows up on lmsys. It has **vLLM** support.
The current best MoE model released so far appears to be DeepSeek V2.5, but Tencent's Hunyuan Large could end up beating it. If/when Grok-2 is released, it would likely be the best available MoE model. However, the true "best" model always depends on the specific usecase. For example, Jamba 1.5 Large may excel at long context tasks compared to DeepSeek V2.5.
I should also add that the rankings on the lmsys chatbot arena do not always provide a reliable assessment of model capabilities (especially long context capabilities), but they should be good enough for a rough comparison between models. As I said above, the true "best" model will depend on your specific usecases. The rankings on lmsys can provide a starting point if you don't have the time or resources to test every model yourself. I thought about scouring every release page for benchmarks like MMLU, but that would take even more time (though perhaps it would be worth adding).
This list should cover all of the largest MoEs (>100b) released so far, but if anyone has heard of any others I'd love to hear about them (as well as any notable finetunes, like [Wizard 8x22b](https://huggingface.co/alpindale/WizardLM-2-8x22B)). If anyone knows how many active parameters Switch-C or Grok-1 has or knows how to calculate it, or what the context length of Switch-C is, please add a comment and I'll edit the list. Also, if anyone knows the status of support for these models for different backends, please let me know and I'll edit the post. I only added mention for support that I could easily verify, mainly by checking GitHub and HuggingFace. Lastly, if anyone has gotten Hunyuan Large running or tested it online, I would love to hear about it and how it compares to DeepSeek V2.5 or other models.
There have been a lot of smaller MoEs released too, and I might make a similar list of them if I get around to it. The smaller MoEs are certainly a lot more accessible, and such a list may be more useful for most people.
6 Comments
Durian881@reddit
spookperson@reddit
Durian881@reddit
spookperson@reddit
Durian881@reddit
Small-Fall-6500@reddit (OP)