SwapMoE: Serving Off-the-shelf MoE-based Language Models with Tunable Memory Budget

Read original: arXiv:2308.15030 - Published 5/30/2024 by Rui Kong, Yuanchun Li, Qingtian Feng, Weijun Wang, Xiaozhou Ye, Ye Ouyang, Linghe Kong, Yunxin Liu

💬

Overview

Mixture of Experts (MoE) is a technique to improve the capacity of Large Language Models (LLMs) by using conditionally-activated parallel experts.
Serving MoE models on memory-constrained devices is challenging due to the large parameter size.
Typical solutions like memory swapping or expert pruning can lead to higher latency or accuracy loss.
The paper introduces SwapMoE, a framework for efficient serving of MoE-based large language models with tunable memory budgets.

Plain English Explanation

The paper discusses a technique called Mixture of Experts (MoE) that can help make large language models more powerful. MoE works by having multiple "expert" models that each specialize in different tasks, and then using a gating mechanism to decide which expert(s) to use for a given input.

This can improve the overall performance of the language model, but it also comes with a significant memory cost. The large number of parameters in the expert models means they can take up a lot of space, which is a problem when trying to run these models on devices with limited memory.

Typical solutions like swapping parts of the model in and out of memory or removing some of the expert models can help reduce the memory usage, but they often come with drawbacks like slower performance or reduced accuracy.

The SwapMoE framework introduced in this paper aims to address this challenge. The key idea is to keep a small, dynamically-updated set of the most important expert models in memory during inference, while still allowing the full set of experts to be accessed when needed. This allows for reduced memory usage without the same level of performance degradation.

The authors show that SwapMoE can significantly reduce the memory footprint of a MoE-based language model, for example from 14.2 GiB down to 4.7 GiB, while only causing a small drop in the model's performance on a text summarization task.

Technical Explanation

The paper presents SwapMoE, a framework for efficient serving of MoE-based large language models with tunable memory budgets. The key idea is to keep a small dynamic set of important experts, called Virtual Experts, in the main memory for inference, while seamlessly maintaining how the Virtual Experts map to the actual experts.

This is in contrast to typical solutions like memory swapping or expert pruning, which can lead to significantly higher latency or severe accuracy loss.

The SwapMoE framework consists of three main components:

Virtual Expert Manager: Maintains the mapping between Virtual Experts and actual experts, and dynamically updates the set of Virtual Experts based on the memory budget and importance of each expert.
Expert Invoker: Responsible for invoking the appropriate Virtual Experts and actual experts during inference.
Expert Importance Estimator: Estimates the importance of each expert to guide the Virtual Expert Manager in maintaining the optimal set of Virtual Experts.

The authors evaluate SwapMoE on a text summarization task using the Switch Transformer model. They show that SwapMoE can reduce the memory consumption from 14.2 GiB to 4.7 GiB, while also achieving a 50% reduction in latency and only a slight drop in the ROUGE-2 score (0.041).

Critical Analysis

The SwapMoE framework presented in this paper is a promising approach to efficiently serve MoE-based large language models on memory-constrained devices. The authors have clearly identified a significant challenge in the deployment of these models and have proposed a novel solution to address it.

One potential limitation of the SwapMoE approach is that it relies on accurately estimating the importance of each expert model, which could be challenging in practice. The Pre-Gated MoE technique introduced in a related paper may be a useful addition to SwapMoE, as it could help improve the accuracy of the expert importance estimates.

Additionally, the authors only evaluate SwapMoE on a single task (text summarization) using the Switch Transformer model. It would be valuable to see how the framework performs on a wider range of tasks and model architectures, including soft mixtures of experts.

Overall, the SwapMoE framework is a promising contribution to the field of efficient deployment of large language models, and the authors have demonstrated its effectiveness in reducing memory usage and latency while maintaining reasonable accuracy. Further research to address the potential limitations and expand the evaluation could further strengthen the impact of this work.

Conclusion

The paper introduces SwapMoE, a framework for efficient serving of MoE-based large language models with tunable memory budgets. SwapMoE addresses the challenge of deploying memory-intensive MoE models on devices with limited memory by maintaining a small, dynamically-updated set of the most important expert models in memory during inference.

The experimental results show that SwapMoE can significantly reduce the memory footprint of MoE-based models, such as from 14.2 GiB to 4.7 GiB for a text summarization task, while only causing a slight drop in model performance. This represents an important step towards enabling the use of powerful MoE-based language models on a wider range of hardware platforms, potentially unlocking new applications and use cases.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

SwapMoE: Serving Off-the-shelf MoE-based Language Models with Tunable Memory Budget

Rui Kong, Yuanchun Li, Qingtian Feng, Weijun Wang, Xiaozhou Ye, Ye Ouyang, Linghe Kong, Yunxin Liu

Mixture of experts (MoE) is a popular technique to improve capacity of Large Language Models (LLMs) with conditionally-activated parallel experts. However, serving MoE models on memory-constrained devices is challenging due to the large parameter size. Typical solutions such as memory swapping or expert pruning may lead to significantly higher latency or severe accuracy loss. In this paper, we introduce SwapMoE, a framework for efficient serving of MoE-based large language models with tunable memory budgets. The main idea of SwapMoE is to keep a small dynamic set of important experts, namely Virtual Experts, in the main memory for inference, while seamlessly maintaining how the Virtual Experts map to the actual experts. Experiments have shown that SwapMoE can reduce the memory footprint while maintaining reasonable accuracy. For example, on text summarization tasks with Switch Transformer, SwapMoE can reduce the memory consumption from 14.2 GiB to 4.7 GiB, together with 50% latency reduction and a slight Rouge-2 score drop of 0.041.

5/30/2024

LocMoE: A Low-Overhead MoE for Large Language Model Training

Jing Li, Zhijie Sun, Xuan He, Li Zeng, Yi Lin, Entong Li, Binfan Zheng, Rongqian Zhao, Xin Chen

The Mixtures-of-Experts (MoE) model is a widespread distributed and integrated learning method for large language models (LLM), which is favored due to its ability to sparsify and expand models efficiently. However, the performance of MoE is limited by load imbalance and high latency of All-to-All communication, along with relatively redundant computation owing to large expert capacity. Load imbalance may result from existing routing policies that consistently tend to select certain experts. The frequent inter-node communication in the All-to-All procedure also significantly prolongs the training time. To alleviate the above performance problems, we propose a novel routing strategy that combines load balance and locality by converting partial inter-node communication to that of intra-node. Notably, we elucidate that there is a minimum threshold for expert capacity, calculated through the maximal angular deviation between the gating weights of the experts and the assigned tokens. We port these modifications on the PanGu-Sigma model based on the MindSpore framework with multi-level routing and conduct experiments on Ascend clusters. The experiment results demonstrate that the proposed LocMoE reduces training time per epoch by 12.68% to 22.24% compared to classical routers, such as hash router and switch router, without impacting the model accuracy.

5/24/2024

MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models

Taehyun Kim, Kwanseok Choi, Youngmock Cho, Jaehoon Cho, Hyuk-Jae Lee, Jaewoong Sim

Mixture-of-Experts (MoE) large language models (LLM) have memory requirements that often exceed the GPU memory capacity, requiring costly parameter movement from secondary memories to the GPU for expert computation. In this work, we present Mixture of Near-Data Experts (MoNDE), a near-data computing solution that efficiently enables MoE LLM inference. MoNDE reduces the volume of MoE parameter movement by transferring only the $textit{hot}$ experts to the GPU, while computing the remaining $textit{cold}$ experts inside the host memory device. By replacing the transfers of massive expert parameters with the ones of small activations, MoNDE enables far more communication-efficient MoE inference, thereby resulting in substantial speedups over the existing parameter offloading frameworks for both encoder and decoder operations.

5/30/2024

Toward Inference-optimal Mixture-of-Expert Large Language Models

Longfei Yun, Yonghao Zhuang, Yao Fu, Eric P Xing, Hao Zhang

Mixture-of-Expert (MoE) based large language models (LLMs), such as the recent Mixtral and DeepSeek-MoE, have shown great promise in scaling model size without suffering from the quadratic growth of training cost of dense transformers. Like dense models, training MoEs requires answering the same question: given a training budget, what is the optimal allocation on the model size and number of tokens? We study the scaling law of MoE-based LLMs regarding the relations between the model performance, model size, dataset size, and the expert degree. Echoing previous research studying MoE in different contexts, we observe the diminishing return of increasing the number of experts, but this seems to suggest we should scale the number of experts until saturation, as the training cost would remain constant, which is problematic during inference time. We propose to amend the scaling law of MoE by introducing inference efficiency as another metric besides the validation loss. We find that MoEs with a few (4/8) experts are the most serving efficient solution under the same performance, but costs 2.5-3.5x more in training. On the other hand, training a (16/32) expert MoE much smaller (70-85%) than the loss-optimal solution, but with a larger training dataset is a promising setup under a training budget.

4/4/2024