Yuan 2.0-M32: Mixture of Experts with Attention Router

Read original: arXiv:2405.17976 - Published 5/30/2024 by Shaohua Wu, Jiangang Luo, Xi Chen, Lingjun Li, Xudong Zhao, Tong Yu, Chao Wang, Yue Wang, Fei Wang, Weixu Qiao and 5 others

⛏️

Overview

Yuan 2.0-M32 is a large language model with a mixture-of-experts architecture
It has 32 expert modules, with 2 active at a time, selected by a new "Attention Router" network
The model was trained on 2000B tokens and has high performance on tasks like coding, math, and general expertise
Compared to a dense model of the same size, Yuan 2.0-M32 uses 9.25% less computational resources
The model outperforms the large Llama3-70B model on certain benchmarks, despite having much fewer active parameters

Plain English Explanation

Yuan 2.0-M32 is a powerful artificial intelligence (AI) model that is designed to be highly efficient and capable across a wide range of tasks. At its core, the model has a "mixture-of-experts" architecture, which means it is made up of 32 specialized modules, or "experts," that each excel at different types of problems.

When the model is given a new task or piece of information to process, a special "Attention Router" network decides which 2 of the 32 experts are the most relevant and should be activated to handle the task. This allows the model to focus its computing power on the areas that are most important for the current problem, rather than using the same generic approach for everything.

The researchers trained this model from scratch on an enormous dataset of 2000 billion words, which gives it a deep well of knowledge to draw from. Remarkably, even though the model has a total of 40 billion parameters, it only needs to actively use 3.7 billion of them to achieve its high level of performance. This makes it much more efficient than a traditional "dense" AI model of the same size.

In fact, the researchers found that Yuan 2.0-M32 can match the capabilities of the larger Llama3-70B model, which has 70 billion parameters, while using only 1/19th the computational resources. The model excels at tasks like coding, math, and demonstrating broad expertise, outperforming Llama3-70B on certain benchmarks.

The models and source code for Yuan 2.0-M32 have been released on Github, allowing other researchers and developers to build upon this innovative approach to efficient and capable AI systems.

Technical Explanation

Yuan 2.0-M32 is a large language model that uses a mixture-of-experts architecture, similar to the base Yuan-2.0 2B model. The key difference is that Yuan 2.0-M32 has 32 expert modules, of which only 2 are active at a time.

A new component called the "Attention Router" network is proposed and used to efficiently select the most relevant experts for each input. This allows the model to focus its computational resources on the appropriate areas, boosting accuracy by 3.8% compared to a classical router network.

The model was trained from scratch on a massive dataset of 2000 billion tokens. Despite its large 40 billion total parameter count, only 3.7 billion parameters are actively used during inference. This makes Yuan 2.0-M32 significantly more efficient than a dense model of the same scale, consuming only 9.25% of the computational resources.

Experiments show that Yuan 2.0-M32 has competitive capabilities across a variety of domains, including coding, math, and general expertise. On the MATH and ARC-Challenge benchmarks, it outperforms the larger Llama3-70B model, achieving accuracy scores of 55.89 and 95.8 respectively.

The models and source code for Yuan 2.0-M32 have been released on Github, allowing other researchers and developers to build upon this innovative approach to efficient and capable AI systems.

Critical Analysis

The paper on Yuan 2.0-M32 presents an interesting and promising approach to building large language models that are both highly capable and computationally efficient. The use of a mixture-of-experts architecture and the novel Attention Router network seem to be effective in selectively activating the most relevant parts of the model for each task, leading to significant performance gains.

However, the paper does not provide a deep analysis of the limitations or potential downsides of this approach. For example, it's unclear how the model's performance and efficiency scale as the number of experts is increased beyond 32, or how the Attention Router network's complexity affects training and inference times.

Additionally, the paper focuses primarily on benchmarks and does not discuss the model's real-world performance or potential societal impacts. It would be valuable to see further research on how Yuan 2.0-M32 handles tasks like open-ended problem solving, following instructions, and dealing with biases and ethical considerations.

Overall, the technical innovations presented in this paper are promising, but more comprehensive evaluation and discussion of the model's limitations and real-world implications would be valuable for the research community and the public.

Conclusion

Yuan 2.0-M32 is a highly efficient and capable large language model that demonstrates the potential of mixture-of-experts architectures and selective activation mechanisms. By training on a vast dataset and leveraging its 32 expert modules, the model is able to achieve strong performance across a range of tasks while using significantly fewer computational resources than a traditional dense model of the same scale.

The model's ability to outperform the larger Llama3-70B on certain benchmarks is particularly impressive and highlights the benefits of this innovative approach. As the field of AI continues to advance, efficient and adaptable models like Yuan 2.0-M32 will play an increasingly important role in developing practical, real-world applications that can be deployed at scale.

The open-sourcing of the model and its source code is also a welcome development, as it allows other researchers and developers to build upon this work and explore the broader implications of mixture-of-experts architectures for the future of artificial intelligence.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

⛏️

Yuan 2.0-M32: Mixture of Experts with Attention Router

Shaohua Wu, Jiangang Luo, Xi Chen, Lingjun Li, Xudong Zhao, Tong Yu, Chao Wang, Yue Wang, Fei Wang, Weixu Qiao, Houbo He, Zeru Zhang, Zeyu Sun, Junxiong Mao, Chong Shen

Yuan 2.0-M32, with a similar base architecture as Yuan-2.0 2B, uses a mixture-of-experts architecture with 32 experts of which 2 experts are active. A new router network, Attention Router, is proposed and adopted for a more efficient selection of experts, which improves the accuracy compared to the model with classical router network. Yuan 2.0-M32 is trained with 2000B tokens from scratch, and the training computation consumption is only 9.25% of a dense model at the same parameter scale. Yuan 2.0-M32 demonstrates competitive capability on coding, math, and various domains of expertise, with only 3.7B active parameters of 40B in total, and 7.4 GFlops forward computation per token, both of which are only 1/19 of Llama3-70B. Yuan 2.0-M32 surpass Llama3-70B on MATH and ARC-Challenge benchmark, with accuracy of 55.89 and 95.8 respectively. The models and source codes of Yuan 2.0-M32 are released at Github1.

5/30/2024

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Xuejing Liu, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zhifang Guo, Zhihao Fan

This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, and exhibits competitive performance relative to proprietary models across diverse benchmarks on language understanding, generation, multilingual proficiency, coding, mathematics, and reasoning. The flagship model, Qwen2-72B, showcases remarkable performance: 84.2 on MMLU, 37.9 on GPQA, 64.6 on HumanEval, 89.5 on GSM8K, and 82.4 on BBH as a base language model. The instruction-tuned variant, Qwen2-72B-Instruct, attains 9.1 on MT-Bench, 48.1 on Arena-Hard, and 35.7 on LiveCodeBench. Moreover, Qwen2 demonstrates robust multilingual capabilities, proficient in approximately 30 languages, spanning English, Chinese, Spanish, French, German, Arabic, Russian, Korean, Japanese, Thai, Vietnamese, and more, underscoring its versatility and global reach. To foster community innovation and accessibility, we have made the Qwen2 model weights openly available on Hugging Face and ModelScope, and the supplementary materials including example code on GitHub. These platforms also include resources for quantization, fine-tuning, and deployment, facilitating a wide range of applications and research endeavors.

9/11/2024

👀

Routers in Vision Mixture of Experts: An Empirical Study

Tianlin Liu, Mathieu Blondel, Carlos Riquelme, Joan Puigcerver

Mixture-of-Experts (MoE) models are a promising way to scale up model capacity without significantly increasing computational cost. A key component of MoEs is the router, which decides which subset of parameters (experts) process which feature embeddings (tokens). In this paper, we present a comprehensive study of routers in MoEs for computer vision tasks. We introduce a unified MoE formulation that subsumes different MoEs with two parametric routing tensors. This formulation covers both sparse MoE, which uses a binary or hard assignment between experts and tokens, and soft MoE, which uses a soft assignment between experts and weighted combinations of tokens. Routers for sparse MoEs can be further grouped into two variants: Token Choice, which matches experts to each token, and Expert Choice, which matches tokens to each expert. We conduct head-to-head experiments with 6 different routers, including existing routers from prior work and new ones we introduce. We show that (i) many routers originally developed for language modeling can be adapted to perform strongly in vision tasks, (ii) in sparse MoE, Expert Choice routers generally outperform Token Choice routers, and (iii) soft MoEs generally outperform sparse MoEs with a fixed compute budget. These results provide new insights regarding the crucial role of routers in vision MoE models.

4/22/2024

🎯

360Zhinao Technical Report

360Zhinao Team

We present 360Zhinao models with 7B parameter size and context lengths spanning 4K, 32K and 360K, all available at https://github.com/Qihoo360/360zhinao. For rapid development in pretraining, we establish a stable and sensitive ablation environment to evaluate and compare experiment runs with minimal model size. Under such guidance, we perfect our data cleaning and composition strategies to pretrain $texttt{360Zhinao-7B-Base}$ on 3.4T tokens. We also mainly emphasize data during alignment, where we strive to balance quantity and quality with filtering and reformatting. With tailored data, 360Zhinao-7B's context window is easily extended to 32K and 360K. RMs and RLHF are trained following SFT and credibly applied to specific tasks. All together these contributions lead to 360Zhinao-7B's competitive performance among models of similar size.

5/24/2024