LocMoE: A Low-Overhead MoE for Large Language Model Training

2401.13920

Published 5/24/2024 by Jing Li, Zhijie Sun, Xuan He, Li Zeng, Yi Lin, Entong Li, Binfan Zheng, Rongqian Zhao, Xin Chen

LocMoE: A Low-Overhead MoE for Large Language Model Training

Abstract

The Mixtures-of-Experts (MoE) model is a widespread distributed and integrated learning method for large language models (LLM), which is favored due to its ability to sparsify and expand models efficiently. However, the performance of MoE is limited by load imbalance and high latency of All-to-All communication, along with relatively redundant computation owing to large expert capacity. Load imbalance may result from existing routing policies that consistently tend to select certain experts. The frequent inter-node communication in the All-to-All procedure also significantly prolongs the training time. To alleviate the above performance problems, we propose a novel routing strategy that combines load balance and locality by converting partial inter-node communication to that of intra-node. Notably, we elucidate that there is a minimum threshold for expert capacity, calculated through the maximal angular deviation between the gating weights of the experts and the assigned tokens. We port these modifications on the PanGu-Sigma model based on the MindSpore framework with multi-level routing and conduct experiments on Ascend clusters. The experiment results demonstrate that the proposed LocMoE reduces training time per epoch by 12.68% to 22.24% compared to classical routers, such as hash router and switch router, without impacting the model accuracy.

Create account to get full access

Overview

This paper introduces LocMoE, a low-overhead Mixture-of-Experts (MoE) architecture for efficient training of large language models.
MoE is a model design that uses multiple expert networks to handle different types of inputs, with a routing mechanism that assigns each input to the most appropriate expert.
LocMoE aims to reduce the overhead of traditional MoE approaches, which can be computationally expensive, by using a simplified routing mechanism and other optimizations.

Plain English Explanation

In machine learning, large language models are powerful tools that can perform a wide variety of natural language tasks. However, training these models can be computationally intensive and resource-heavy.

One approach to making large language models more efficient is called Mixture-of-Experts (MoE). The basic idea behind MoE is to have multiple specialized "expert" networks, each of which is responsible for handling a different type of input. A routing mechanism then decides which expert network should process each piece of input.

While MoE can be effective, the routing mechanism in traditional MoE approaches can add significant overhead and complexity to the training process. This is where LocMoE comes in - it aims to reduce this overhead by using a simpler routing mechanism and other optimizations.

The key innovation in LocMoE is the way it assigns inputs to the expert networks. Instead of a complex routing network, LocMoE uses a more straightforward approach that assigns each input to the nearest expert based on its location in the input space. This "localized" routing is computationally less expensive than the approaches used in previous work on MoE for large language models.

Additionally, LocMoE incorporates other optimizations, such as weight sharing between experts, to further reduce the overhead of the MoE architecture. This makes it possible to train large language models using MoE in a more efficient and scalable way.

Technical Explanation

The core of LocMoE is its simplified routing mechanism, which assigns each input to the nearest expert network based on its location in the input space. This is in contrast to more complex routing networks used in previous MoE approaches that can add significant computational overhead.

LocMoE also employs weight sharing between the expert networks, which reduces the total number of parameters in the model. This is important for training large language models, where the model size can be a significant constraint.

The paper presents experiments comparing LocMoE to traditional MoE approaches, as well as to standard transformer-based language models. The results show that LocMoE can achieve similar performance to these other models, but with significant reductions in training time and computational cost.

Critical Analysis

The paper provides a thorough evaluation of LocMoE and demonstrates its advantages over traditional MoE approaches. However, the authors acknowledge that LocMoE may not be optimal for all types of inputs or tasks, and that the localized routing mechanism could potentially limit the model's ability to learn complex routing patterns.

Additionally, while LocMoE reduces the overhead of the MoE architecture, it still introduces some additional complexity compared to a standard transformer-based language model. The authors note that the choice between LocMoE and a simpler model will depend on the specific requirements and constraints of the target application.

Overall, LocMoE represents an interesting and promising approach to making MoE-based language models more efficient and practical for real-world use cases. The paper's insights could also inform future research into other low-overhead techniques for large language model training, such as the multi-head MoE or pre-gated MoE architectures.

Conclusion

The LocMoE paper introduces a novel approach to Mixture-of-Experts (MoE) architectures that significantly reduces the computational overhead associated with traditional MoE methods. By using a simplified routing mechanism and other optimizations, LocMoE makes it possible to train large language models with MoE in a more efficient and scalable way.

The paper's insights could have important implications for the development of powerful, yet resource-efficient, natural language processing systems. As the demands for large language models continue to grow, techniques like LocMoE may play a crucial role in making these models more accessible and practical for a wide range of real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

A Closer Look into Mixture-of-Experts in Large Language Models

Ka Man Lo, Zeyu Huang, Zihan Qiu, Zili Wang, Jie Fu

Mixture-of-experts (MoE) is gaining increasing attention due to its unique properties and remarkable performance, especially for language tasks. By sparsely activating a subset of parameters for each token, MoE architecture could increase the model size without sacrificing computational efficiency, achieving a better trade-off between performance and training costs. However, the underlying mechanism of MoE still lacks further exploration, and its modularization degree remains questionable. In this paper, we make an initial attempt to understand the inner workings of MoE-based large language models. Concretely, we comprehensively study the parametric and behavioral features of three recent MoE-based models and reveal some intriguing observations, including (1) Neurons act like fine-grained experts. (2) The router of MoE usually selects experts with larger output norms. (3) The expert diversity increases as the layer increases, while the last layer is an outlier. Based on the observations, we also provide suggestions for a broad spectrum of MoE practitioners, such as router design and expert allocation. We hope this work could shed light on future research on the MoE framework and other modular architectures. Code is available at https://github.com/kamanphoebe/Look-into-MoEs.

6/27/2024

cs.CL cs.LG

LocMoE+: Enhanced Router with Token Feature Awareness for Efficient LLM Pre-Training

Jing Li, Zhijie Sun, Dachao Lin, Xuan He, Yi Lin, Binfan Zheng, Li Zeng, Rongqian Zhao, Xin Chen

Mixture-of-Experts (MoE) architectures have recently gained increasing popularity within the domain of large language models (LLMs) due to their ability to significantly reduce training and inference overhead. However, MoE architectures face challenges, such as significant disparities in the number of tokens assigned to each expert and a tendency toward homogenization among experts, which adversely affects the model's semantic generation capabilities. In this paper, we introduce LocMoE+, a refined version of the low-overhead LocMoE, incorporating the following enhancements: (1) Quantification and definition of the affinity between experts and tokens. (2) Implementation of a global-level adaptive routing strategy to rearrange tokens based on their affinity scores. (3) Reestimation of the lower bound for expert capacity, which has been shown to progressively decrease as the token feature distribution evolves. Experimental results demonstrate that, without compromising model convergence or efficacy, the number of tokens each expert processes can be reduced by over 60%. Combined with communication optimizations, this leads to an average improvement in training efficiency ranging from 5.4% to 46.6%. After fine-tuning, LocMoE+ exhibits a performance improvement of 9.7% to 14.1% across the GDAD, C-Eval, and TeleQnA datasets.

6/4/2024

cs.CL

HyperMoE: Towards Better Mixture of Experts via Transferring Among Experts

Hao Zhao, Zihan Qiu, Huijia Wu, Zili Wang, Zhaofeng He, Jie Fu

The Mixture of Experts (MoE) for language models has been proven effective in augmenting the capacity of models by dynamically routing each input token to a specific subset of experts for processing. Despite the success, most existing methods face a challenge for balance between sparsity and the availability of expert knowledge: enhancing performance through increased use of expert knowledge often results in diminishing sparsity during expert selection. To mitigate this contradiction, we propose HyperMoE, a novel MoE framework built upon Hypernetworks. This framework integrates the computational processes of MoE with the concept of knowledge transferring in multi-task learning. Specific modules generated based on the information of unselected experts serve as supplementary information, which allows the knowledge of experts not selected to be used while maintaining selection sparsity. Our comprehensive empirical evaluations across multiple datasets and backbones establish that HyperMoE significantly outperforms existing MoE methods under identical conditions concerning the number of experts.

5/22/2024

cs.LG cs.AI

Toward Inference-optimal Mixture-of-Expert Large Language Models

Longfei Yun, Yonghao Zhuang, Yao Fu, Eric P Xing, Hao Zhang

Mixture-of-Expert (MoE) based large language models (LLMs), such as the recent Mixtral and DeepSeek-MoE, have shown great promise in scaling model size without suffering from the quadratic growth of training cost of dense transformers. Like dense models, training MoEs requires answering the same question: given a training budget, what is the optimal allocation on the model size and number of tokens? We study the scaling law of MoE-based LLMs regarding the relations between the model performance, model size, dataset size, and the expert degree. Echoing previous research studying MoE in different contexts, we observe the diminishing return of increasing the number of experts, but this seems to suggest we should scale the number of experts until saturation, as the training cost would remain constant, which is problematic during inference time. We propose to amend the scaling law of MoE by introducing inference efficiency as another metric besides the validation loss. We find that MoEs with a few (4/8) experts are the most serving efficient solution under the same performance, but costs 2.5-3.5x more in training. On the other hand, training a (16/32) expert MoE much smaller (70-85%) than the loss-optimal solution, but with a larger training dataset is a promising setup under a training budget.

4/4/2024

cs.LG