Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference

2308.12066

Published 4/30/2024 by Ranggi Hwang, Jianyu Wei, Shijie Cao, Changho Hwang, Xiaohu Tang, Ting Cao, Mao Yang

🤯

Abstract

Large language models (LLMs) based on transformers have made significant strides in recent years, the success of which is driven by scaling up their model size. Despite their high algorithmic performance, the computational and memory requirements of LLMs present unprecedented challenges. To tackle the high compute requirements of LLMs, the Mixture-of-Experts (MoE) architecture was introduced which is able to scale its model size without proportionally scaling up its computational requirements. Unfortunately, MoE's high memory demands and dynamic activation of sparse experts restrict its applicability to real-world problems. Previous solutions that offload MoE's memory-hungry expert parameters to CPU memory fall short because the latency to migrate activated experts from CPU to GPU incurs high performance overhead. Our proposed Pre-gated MoE system effectively tackles the compute and memory challenges of conventional MoE architectures using our algorithm-system co-design. Pre-gated MoE employs our novel pre-gating function which alleviates the dynamic nature of sparse expert activation, allowing our proposed system to address the large memory footprint of MoEs while also achieving high performance. We demonstrate that Pre-gated MoE is able to improve performance, reduce GPU memory consumption, while also maintaining the same level of model quality. These features allow our Pre-gated MoE system to cost-effectively deploy large-scale LLMs using just a single GPU with high performance.

Create account to get full access

Overview

Large language models (LLMs) based on transformers have achieved significant advancements, driven by scaling up their model size.
However, the high computational and memory requirements of LLMs present significant challenges.
To address the high compute requirements, the Mixture-of-Experts (MoE) architecture was introduced, allowing LLMs to scale without proportionally increasing computational requirements.
Unfortunately, MoE's high memory demands and dynamic activation of sparse experts limit its real-world applicability.
Previous solutions that offload MoE's memory-hungry expert parameters to CPU memory fall short due to high latency when migrating activated experts from CPU to GPU.

Plain English Explanation

Large language models (LLMs) are a type of artificial intelligence that have become increasingly powerful in recent years. This is largely due to the fact that researchers have been able to make these models bigger and more complex, allowing them to process and generate language more effectively. Towards Inference-Optimal Mixture-of-Experts for Large Language Models

However, as these LLMs have grown in size and complexity, they have also become much more computationally and memory-intensive. This presents a significant challenge, as the hardware required to run these models can be expensive and power-hungry.

To address this issue, researchers have developed a new architecture called Mixture-of-Experts (MoE). The idea behind MoE is to divide the model into a number of "expert" sub-models, each of which is responsible for a specific part of the task. This allows the model to scale in size without proportionally increasing the computational requirements.

Unfortunately, MoE also has its own challenges. The expert sub-models require a lot of memory, and the dynamic activation of these experts can be inefficient. Previous attempts to address this, such as offloading the expert parameters to the CPU, have fallen short due to the high latency involved in moving data between the CPU and GPU.

Technical Explanation

The Pre-gated MoE system proposed in this paper aims to effectively tackle the compute and memory challenges of conventional MoE architectures through an algorithm-system co-design approach.

The key innovation in the Pre-gated MoE system is the introduction of a novel "pre-gating" function. This function helps to alleviate the dynamic nature of sparse expert activation, allowing the system to better manage the large memory footprint of MoE models while still achieving high performance.

The authors demonstrate that their Pre-gated MoE system is able to improve performance, reduce GPU memory consumption, and maintain the same level of model quality as compared to traditional MoE approaches. These features enable the cost-effective deployment of large-scale LLMs using a single GPU with high performance.

The paper also discusses related work, such as SEER-MoE, Multi-Head Mixture of Experts, and Dense Training, Sparse Inference, which have explored different approaches to addressing the challenges of MoE architectures.

Critical Analysis

The paper presents a promising solution to the memory and performance challenges of MoE-based LLMs. The pre-gating function appears to be a clever way to mitigate the dynamic activation of sparse experts, which has been a key limitation of previous MoE approaches.

However, the paper does not provide a detailed analysis of the trade-offs or potential limitations of the Pre-gated MoE system. For example, it's unclear how the pre-gating function might impact the overall model performance or how it scales as the number of experts increases.

Additionally, the paper could benefit from a more in-depth discussion of the potential real-world implications and applications of the proposed system. While the authors demonstrate improved performance and reduced memory consumption, it would be helpful to understand the specific use cases or deployment scenarios where Pre-gated MoE could be most beneficial.

Conclusion

The Pre-gated MoE system presented in this paper offers a promising solution to the computational and memory challenges faced by large language models based on Transformer architectures. By introducing a novel pre-gating function, the authors have found a way to effectively scale the size of MoE-based LLMs without proportionally increasing the resource requirements.

The ability to deploy large-scale LLMs using a single GPU with high performance could have significant implications for the field of natural language processing, potentially making these powerful models more accessible and cost-effective for a wider range of applications and researchers. As the field of AI continues to advance, innovations like the Pre-gated MoE system will be crucial in pushing the boundaries of what is possible with large language models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

A Closer Look into Mixture-of-Experts in Large Language Models

Ka Man Lo, Zeyu Huang, Zihan Qiu, Zili Wang, Jie Fu

Mixture-of-experts (MoE) is gaining increasing attention due to its unique properties and remarkable performance, especially for language tasks. By sparsely activating a subset of parameters for each token, MoE architecture could increase the model size without sacrificing computational efficiency, achieving a better trade-off between performance and training costs. However, the underlying mechanism of MoE still lacks further exploration, and its modularization degree remains questionable. In this paper, we make an initial attempt to understand the inner workings of MoE-based large language models. Concretely, we comprehensively study the parametric and behavioral features of three recent MoE-based models and reveal some intriguing observations, including (1) Neurons act like fine-grained experts. (2) The router of MoE usually selects experts with larger output norms. (3) The expert diversity increases as the layer increases, while the last layer is an outlier. Based on the observations, we also provide suggestions for a broad spectrum of MoE practitioners, such as router design and expert allocation. We hope this work could shed light on future research on the MoE framework and other modular architectures. Code is available at https://github.com/kamanphoebe/Look-into-MoEs.

6/27/2024

cs.CL cs.LG

MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models

Taehyun Kim, Kwanseok Choi, Youngmock Cho, Jaehoon Cho, Hyuk-Jae Lee, Jaewoong Sim

Mixture-of-Experts (MoE) large language models (LLM) have memory requirements that often exceed the GPU memory capacity, requiring costly parameter movement from secondary memories to the GPU for expert computation. In this work, we present Mixture of Near-Data Experts (MoNDE), a near-data computing solution that efficiently enables MoE LLM inference. MoNDE reduces the volume of MoE parameter movement by transferring only the $textit{hot}$ experts to the GPU, while computing the remaining $textit{cold}$ experts inside the host memory device. By replacing the transfers of massive expert parameters with the ones of small activations, MoNDE enables far more communication-efficient MoE inference, thereby resulting in substantial speedups over the existing parameter offloading frameworks for both encoder and decoder operations.

5/30/2024

cs.LG cs.AI cs.AR

LocMoE: A Low-Overhead MoE for Large Language Model Training

Jing Li, Zhijie Sun, Xuan He, Li Zeng, Yi Lin, Entong Li, Binfan Zheng, Rongqian Zhao, Xin Chen

The Mixtures-of-Experts (MoE) model is a widespread distributed and integrated learning method for large language models (LLM), which is favored due to its ability to sparsify and expand models efficiently. However, the performance of MoE is limited by load imbalance and high latency of All-to-All communication, along with relatively redundant computation owing to large expert capacity. Load imbalance may result from existing routing policies that consistently tend to select certain experts. The frequent inter-node communication in the All-to-All procedure also significantly prolongs the training time. To alleviate the above performance problems, we propose a novel routing strategy that combines load balance and locality by converting partial inter-node communication to that of intra-node. Notably, we elucidate that there is a minimum threshold for expert capacity, calculated through the maximal angular deviation between the gating weights of the experts and the assigned tokens. We port these modifications on the PanGu-Sigma model based on the MindSpore framework with multi-level routing and conduct experiments on Ascend clusters. The experiment results demonstrate that the proposed LocMoE reduces training time per epoch by 12.68% to 22.24% compared to classical routers, such as hash router and switch router, without impacting the model accuracy.

5/24/2024

cs.LG cs.AI cs.CL

💬

SwapMoE: Serving Off-the-shelf MoE-based Language Models with Tunable Memory Budget

Rui Kong, Yuanchun Li, Qingtian Feng, Weijun Wang, Xiaozhou Ye, Ye Ouyang, Linghe Kong, Yunxin Liu

Mixture of experts (MoE) is a popular technique to improve capacity of Large Language Models (LLMs) with conditionally-activated parallel experts. However, serving MoE models on memory-constrained devices is challenging due to the large parameter size. Typical solutions such as memory swapping or expert pruning may lead to significantly higher latency or severe accuracy loss. In this paper, we introduce SwapMoE, a framework for efficient serving of MoE-based large language models with tunable memory budgets. The main idea of SwapMoE is to keep a small dynamic set of important experts, namely Virtual Experts, in the main memory for inference, while seamlessly maintaining how the Virtual Experts map to the actual experts. Experiments have shown that SwapMoE can reduce the memory footprint while maintaining reasonable accuracy. For example, on text summarization tasks with Switch Transformer, SwapMoE can reduce the memory consumption from 14.2 GiB to 4.7 GiB, together with 50% latency reduction and a slight Rouge-2 score drop of 0.041.

5/30/2024

cs.AI