LocMoE+: Enhanced Router with Token Feature Awareness for Efficient LLM Pre-Training

2406.00023

Published 6/4/2024 by Jing Li, Zhijie Sun, Dachao Lin, Xuan He, Yi Lin, Binfan Zheng, Li Zeng, Rongqian Zhao, Xin Chen

LocMoE+: Enhanced Router with Token Feature Awareness for Efficient LLM Pre-Training

Abstract

Mixture-of-Experts (MoE) architectures have recently gained increasing popularity within the domain of large language models (LLMs) due to their ability to significantly reduce training and inference overhead. However, MoE architectures face challenges, such as significant disparities in the number of tokens assigned to each expert and a tendency toward homogenization among experts, which adversely affects the model's semantic generation capabilities. In this paper, we introduce LocMoE+, a refined version of the low-overhead LocMoE, incorporating the following enhancements: (1) Quantification and definition of the affinity between experts and tokens. (2) Implementation of a global-level adaptive routing strategy to rearrange tokens based on their affinity scores. (3) Reestimation of the lower bound for expert capacity, which has been shown to progressively decrease as the token feature distribution evolves. Experimental results demonstrate that, without compromising model convergence or efficacy, the number of tokens each expert processes can be reduced by over 60%. Combined with communication optimizations, this leads to an average improvement in training efficiency ranging from 5.4% to 46.6%. After fine-tuning, LocMoE+ exhibits a performance improvement of 9.7% to 14.1% across the GDAD, C-Eval, and TeleQnA datasets.

Create account to get full access

Overview

This paper introduces LocMoE+, an enhanced router for Mixture of Experts (MoE) models that improves the efficiency of large language model (LLM) pre-training.
LocMoE+ incorporates token feature awareness to better route tokens to the most relevant experts, leading to faster training and inference.
The proposed approach builds on previous work on MoE models, such as LocMoE, Routers, HyperMoE, Multi-Head MoE, and Unchosen Experts.

Plain English Explanation

The paper describes a new way to improve the efficiency of large language models, which are AI systems trained on vast amounts of text data to understand and generate human-like language. These models are often very complex and resource-intensive to train and use.

The key idea is to use a

Mixture of Experts

(MoE) approach, where the model is divided into several "expert" sub-models, each specializing in different types of language tasks. When processing new text, the model uses a "router" to quickly determine which expert(s) are best suited to handle each part of the text.

The researchers' LocMoE+ router builds on this concept by incorporating "token feature awareness" - essentially, the router can better understand the specific characteristics of each word or token in the text, and route it to the most appropriate expert(s). This leads to faster training and inference (using the model) compared to previous MoE approaches.

By making large language models more efficient, the LocMoE+ approach could help reduce the significant computational and energy resources required to develop and deploy these powerful AI systems. This could have important implications for the accessibility and real-world application of advanced language technologies.

Technical Explanation

The paper introduces LocMoE+, an enhanced router for Mixture of Experts (MoE) models that aims to improve the efficiency of large language model (LLM) pre-training. MoE models divide the overall model into several "expert" sub-models, each specializing in different types of language tasks. A router is used to quickly determine which expert(s) are best suited to handle each part of the input text.

LocMoE+ builds on previous work, such as LocMoE, Routers, HyperMoE, Multi-Head MoE, and Unchosen Experts, by incorporating "token feature awareness" into the router. This allows the router to better understand the specific characteristics of each word or token in the input text and route it to the most appropriate expert(s).

The paper presents experiments that demonstrate the effectiveness of LocMoE+ in improving the efficiency of LLM pre-training, as measured by training time and inference speed, compared to previous MoE approaches. The authors attribute these performance gains to the enhanced token feature awareness of the LocMoE+ router.

Critical Analysis

The paper provides a compelling approach to improving the efficiency of large language models, which is a crucial challenge in the field of AI. The incorporation of token feature awareness into the router is a novel and promising idea that builds on previous research in MoE models.

However, the paper does not fully address the potential limitations or drawbacks of the LocMoE+ approach. For example, it is unclear how the token feature awareness mechanism might perform in the face of highly diverse or specialized language tasks, or how it would scale to even larger and more complex language models.

Additionally, the paper focuses primarily on the technical aspects of the LocMoE+ architecture and its performance metrics, but does not explore the broader implications or potential societal impact of making large language models more efficient and accessible. Further research could investigate these areas and provide a more holistic understanding of the significance of this work.

Conclusion

The LocMoE+ paper presents an innovative approach to improving the efficiency of large language model pre-training by incorporating token feature awareness into the router of a Mixture of Experts (MoE) model. This work builds on previous research in MoE architectures and has the potential to reduce the significant computational and energy resources required to develop and deploy advanced language technologies.

By making large language models more efficient, the LocMoE+ approach could help increase the accessibility and real-world application of these powerful AI systems, with important implications for a wide range of natural language processing tasks and applications. Further research is needed to fully understand the limitations and broader implications of this work, but the findings presented in this paper represent an important step forward in the ongoing effort to make large language models more efficient and impactful.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

LocMoE: A Low-Overhead MoE for Large Language Model Training

Jing Li, Zhijie Sun, Xuan He, Li Zeng, Yi Lin, Entong Li, Binfan Zheng, Rongqian Zhao, Xin Chen

The Mixtures-of-Experts (MoE) model is a widespread distributed and integrated learning method for large language models (LLM), which is favored due to its ability to sparsify and expand models efficiently. However, the performance of MoE is limited by load imbalance and high latency of All-to-All communication, along with relatively redundant computation owing to large expert capacity. Load imbalance may result from existing routing policies that consistently tend to select certain experts. The frequent inter-node communication in the All-to-All procedure also significantly prolongs the training time. To alleviate the above performance problems, we propose a novel routing strategy that combines load balance and locality by converting partial inter-node communication to that of intra-node. Notably, we elucidate that there is a minimum threshold for expert capacity, calculated through the maximal angular deviation between the gating weights of the experts and the assigned tokens. We port these modifications on the PanGu-Sigma model based on the MindSpore framework with multi-level routing and conduct experiments on Ascend clusters. The experiment results demonstrate that the proposed LocMoE reduces training time per epoch by 12.68% to 22.24% compared to classical routers, such as hash router and switch router, without impacting the model accuracy.

5/24/2024

cs.LG cs.AI cs.CL

AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models

Zihao Zeng, Yibo Miao, Hongcheng Gao, Hao Zhang, Zhijie Deng

Mixture of experts (MoE) has become the standard for constructing production-level large language models (LLMs) due to its promise to boost model capacity without causing significant overheads. Nevertheless, existing MoE methods usually enforce a constant top-k routing for all tokens, which is arguably restrictive because various tokens (e.g., vs. apple) may require various numbers of experts for feature abstraction. Lifting such a constraint can help make the most of limited resources and unleash the potential of the model for downstream tasks. In this sense, we introduce AdaMoE to realize token-adaptive routing for MoE, where different tokens are permitted to select a various number of experts. AdaMoE makes minimal modifications to the vanilla MoE with top-k routing -- it simply introduces a fixed number of null experts, which do not consume any FLOPs, to the expert set and increases the value of k. AdaMoE does not force each token to occupy a fixed number of null experts but ensures the average usage of the null experts with a load-balancing loss, leading to an adaptive number of null/true experts used by each token. AdaMoE exhibits a strong resemblance to MoEs with expert choice routing while allowing for trivial auto-regressive modeling. AdaMoE is easy to implement and can be effectively applied to pre-trained (MoE-)LLMs. Extensive studies show that AdaMoE can reduce average expert load (FLOPs) while achieving superior performance. For example, on the ARC-C dataset, applying our method to fine-tuning Mixtral-8x7B can reduce FLOPs by 14.5% while increasing accuracy by 1.69%.

6/21/2024

cs.AI

A Closer Look into Mixture-of-Experts in Large Language Models

Ka Man Lo, Zeyu Huang, Zihan Qiu, Zili Wang, Jie Fu

Mixture-of-experts (MoE) is gaining increasing attention due to its unique properties and remarkable performance, especially for language tasks. By sparsely activating a subset of parameters for each token, MoE architecture could increase the model size without sacrificing computational efficiency, achieving a better trade-off between performance and training costs. However, the underlying mechanism of MoE still lacks further exploration, and its modularization degree remains questionable. In this paper, we make an initial attempt to understand the inner workings of MoE-based large language models. Concretely, we comprehensively study the parametric and behavioral features of three recent MoE-based models and reveal some intriguing observations, including (1) Neurons act like fine-grained experts. (2) The router of MoE usually selects experts with larger output norms. (3) The expert diversity increases as the layer increases, while the last layer is an outlier. Based on the observations, we also provide suggestions for a broad spectrum of MoE practitioners, such as router design and expert allocation. We hope this work could shed light on future research on the MoE framework and other modular architectures. Code is available at https://github.com/kamanphoebe/Look-into-MoEs.

6/27/2024

cs.CL cs.LG

👀

Routers in Vision Mixture of Experts: An Empirical Study

Tianlin Liu, Mathieu Blondel, Carlos Riquelme, Joan Puigcerver

Mixture-of-Experts (MoE) models are a promising way to scale up model capacity without significantly increasing computational cost. A key component of MoEs is the router, which decides which subset of parameters (experts) process which feature embeddings (tokens). In this paper, we present a comprehensive study of routers in MoEs for computer vision tasks. We introduce a unified MoE formulation that subsumes different MoEs with two parametric routing tensors. This formulation covers both sparse MoE, which uses a binary or hard assignment between experts and tokens, and soft MoE, which uses a soft assignment between experts and weighted combinations of tokens. Routers for sparse MoEs can be further grouped into two variants: Token Choice, which matches experts to each token, and Expert Choice, which matches tokens to each expert. We conduct head-to-head experiments with 6 different routers, including existing routers from prior work and new ones we introduce. We show that (i) many routers originally developed for language modeling can be adapted to perform strongly in vision tasks, (ii) in sparse MoE, Expert Choice routers generally outperform Token Choice routers, and (iii) soft MoEs generally outperform sparse MoEs with a fixed compute budget. These results provide new insights regarding the crucial role of routers in vision MoE models.

4/22/2024

cs.CV cs.AI cs.LG