The Allen Institute for AI (AI2) has released a brand new open source model designed to handle the necessity for a large-scale language model (LLM) that’s each powerful and cost-effective.
The latest model, called OLMoE, uses a sparse mixture of experts (MoE) architecture. It has 7 billion parameters, but uses only one billion parameters per input token. There are two versions: OLMoE-1B-7B, which is more general, and OLMoE-1B-7B-Instruct for instruction optimization.
AI2 emphasized that, unlike other expert mixture models, OLMoE is totally open source.
“However, most MoE models are closed source: while some have made their model weights publicly available, they supply limited to no details about their training data, code, or recipes.” AI2 explained in his paper. “The lack of open resources and insights into these details prevents the development of cost-effective open MoEs on this area that approach the capabilities of closed-source frontier models.”
This makes most MoE models inaccessible to many academics and other researchers.
Nathan Lambert, AI2 researcher, posted on X (formerly Twitter) that OLMOE “will help policy… this could be a start line when academic H100 clusters come online.”
Lambert added that the models are a part of AI2's goal of making open-source models that work in addition to closed models.
“We haven't modified our organization and our goals in any respect since our first OLMo models. We're just slowly improving our open source infrastructure and data. You can use that too. We've released a really state-of-the-art model, not only one which's best on one or two evaluations,” he said.
How is OLMoE structured
AI2 explained that when developing OLMoE, they opted for a fine-grained distribution of 64 small experts and only activated eight of them at a time. The experiments showed that the model performs in addition to other models, but with significantly lower inference costs and smaller memory requirements.
OLMOE builds on AI2's previous open source model OLMO 1.7-7B, which supported a context window of 4,096 tokens, including the Dolma 1.7 training dataset that AI2 developed for OLMO. OLMoE was trained on a mixture of knowledge from DCLM and Dolma, which included a filtered subset of Common Crawl, Dolma CC, Refined Web, StarCoder, C4, Stack Exchange, OpenWebMath, Project Gutenberg, Wikipedia, and others.
According to AI2, OLMoE “outperforms all available models with similar lively parameters, even outperforming larger models corresponding to Llama2-13B-Chat and DeepSeekMoE-16B.” In benchmark tests, OLMoE-1B-7B often performed similarly well to other models with 7B parameters or more, corresponding to Mistral-7B, Llama 3.1-B, and Gemma 2. However, in benchmarks against models with 1B parameters, OLMoE-1B-7B outperformed other open source models corresponding to Pythia, TinyLlama, and even AI2's OLMO.
Open source mixture of experts
One of the goals of AI2 is to make more fully open source AI models available to researchers, including for MoE, which is quickly becoming a preferred model architecture amongst developers.
Many AI model developers use the MoE architecture to construct models. For example: Mistral's Mixtral 8x22B used a sparse MoE system. Grok, X.ai’s AI model, also used the identical system, while rumors that GPT4 also used MoE Persist.
However, AI2 emphasizes that lots of these other AI models don’t offer complete openness and don’t provide details about training data or their source code.
“This is occurring despite MoEs calling for more openness as they impose complex latest design questions on LMs, corresponding to what number of total parameters to make use of versus lively parameters, whether to make use of many small or few large experts, if experts are to be shared, and which routing algorithm to make use of,” the corporate said.
The Open Source Initiativethat defines what makes something open source and promotes it, has began to take a look at what Open Source Funding for AI models.