Soft Merging of Experts with Adaptive Routing

Read original: arXiv:2306.03745 - Published 5/14/2024 by Mohammed Muqeeth, Haokun Liu, Colin Raffel

📶

Overview

This paper introduces a new approach called Soft Merging of Experts with Adaptive Routing (SMEAR) that aims to improve the performance of sparsely activated neural networks with conditional computation.
Sparsely activated neural networks can learn to route their inputs through different expert subnetworks, providing a form of modularity. However, these models often underperform their densely activated counterparts as well as models that use non-learned heuristic routing strategies.
The authors hypothesize that these shortcomings stem from the gradient estimation techniques used to train sparsely activated models with non-differentiable discrete routing decisions.
SMEAR avoids discrete routing by using a single merged expert constructed via a weighted average of all of the experts' parameters, enabling standard gradient-based training.

Plain English Explanation

Neural networks are a type of machine learning model inspired by the human brain. They are made up of interconnected nodes, or "neurons," that process information and learn to perform tasks.

Traditionally, neural networks have been "densely activated," meaning that all of their neurons are engaged when processing an input. However, Sparsely activated neural networks with conditional computation can learn to only activate certain subnetworks, or "experts," to process different types of inputs. This can provide a form of modularity and efficiency, as the network can route its inputs through the most appropriate experts.

Despite these potential benefits, sparsely activated models often underperform their densely activated counterparts, as well as models that use pre-defined, "heuristic" routing strategies. The authors of this paper hypothesize that this is because the techniques used to train these models, which involve making discrete routing decisions, can lead to issues with gradient estimation during the training process.

To address this, the researchers introduce SMEAR, a new approach that avoids discrete routing by using a single "merged" expert, constructed as a weighted average of all the experts' parameters. This allows for standard gradient-based training, without the issues that come with non-differentiable routing decisions.

Technical Explanation

The core idea behind SMEAR is to create a single merged expert that is a weighted average of all the individual experts' parameters. This merged expert is then used to process the input, rather than routing the input to a discrete expert.

Specifically, SMEAR first computes a set of routing weights that determine how much each expert should contribute to the merged expert. These weights are learned during training and can adapt based on the input. The merged expert is then constructed as a weighted sum of the individual experts, using these learned routing weights.

By using a single merged expert, SMEAR avoids the need for discrete routing decisions, which can cause issues with gradient estimation during training. Additionally, the computational overhead of SMEAR is relatively low compared to models that route inputs to discrete experts, as it only requires a single forward pass through the merged expert.

The authors evaluate SMEAR on several tasks, including image classification and language modeling, and find that it outperforms both models that learn sparse routing and models that use pre-defined heuristic routing strategies. They also provide qualitative analysis demonstrating that the experts learned via SMEAR exhibit a significant degree of specialization, suggesting that the model is effectively learning to modularize its processing.

Critical Analysis

One potential limitation of the SMEAR approach is that by using a single merged expert, it may lose some of the benefits of having truly independent expert subnetworks. The authors acknowledge this, noting that the experts learned via SMEAR exhibit some degree of specialization, but it's unclear how this compares to the level of modularity achieved by models that route to discrete experts.

Additionally, the authors focus their evaluation on relatively simple tasks, such as image classification and language modeling. It would be interesting to see how SMEAR performs on more complex, multi-task problems, where the potential benefits of modularity might be more pronounced. LocMoE and MVMOE, for example, have explored the use of mixture-of-experts models in more challenging multi-task settings.

Finally, the authors note that the experts learned via SMEAR exhibit a "significant amount of specialization," but they do not provide a clear, quantitative measure of this specialization. A more rigorous analysis of the degree of modularity and specialization achieved by SMEAR, perhaps in comparison to other routing strategies, could help better elucidate the strengths and limitations of this approach.

Conclusion

The SMEAR approach introduced in this paper represents an interesting step forward in the development of sparsely activated neural networks with conditional computation. By avoiding discrete routing decisions and using a single merged expert, SMEAR is able to overcome some of the limitations of previous approaches, delivering improved performance on a range of tasks.

While SMEAR may not achieve the same level of modularity as models that route to discrete experts, its simplicity and ease of training make it a potentially valuable tool in the ongoing effort to create more efficient and adaptive neural network architectures. As the field continues to explore the benefits of mixture-of-experts models and conditional computation, approaches like SMEAR could play an important role in pushing the boundaries of what's possible.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📶

Soft Merging of Experts with Adaptive Routing

Mohammed Muqeeth, Haokun Liu, Colin Raffel

Sparsely activated neural networks with conditional computation learn to route their inputs through different expert subnetworks, providing a form of modularity that densely activated models lack. Despite their possible benefits, models with learned routing often underperform their parameter-matched densely activated counterparts as well as models that use non-learned heuristic routing strategies. In this paper, we hypothesize that these shortcomings stem from the gradient estimation techniques used to train sparsely activated models that use non-differentiable discrete routing decisions. To address this issue, we introduce Soft Merging of Experts with Adaptive Routing (SMEAR), which avoids discrete routing by using a single merged expert constructed via a weighted average of all of the experts' parameters. By routing activations through a single merged expert, SMEAR does not incur a significant increase in computational costs and enables standard gradient-based training. We empirically validate that models using SMEAR outperform models that route based on metadata or learn sparse routing through gradient estimation. Furthermore, we provide qualitative analysis demonstrating that the experts learned via SMEAR exhibit a significant amount of specialization. All of the code used in our experiments is publicly available.

5/14/2024

👀

Routers in Vision Mixture of Experts: An Empirical Study

Tianlin Liu, Mathieu Blondel, Carlos Riquelme, Joan Puigcerver

Mixture-of-Experts (MoE) models are a promising way to scale up model capacity without significantly increasing computational cost. A key component of MoEs is the router, which decides which subset of parameters (experts) process which feature embeddings (tokens). In this paper, we present a comprehensive study of routers in MoEs for computer vision tasks. We introduce a unified MoE formulation that subsumes different MoEs with two parametric routing tensors. This formulation covers both sparse MoE, which uses a binary or hard assignment between experts and tokens, and soft MoE, which uses a soft assignment between experts and weighted combinations of tokens. Routers for sparse MoEs can be further grouped into two variants: Token Choice, which matches experts to each token, and Expert Choice, which matches tokens to each expert. We conduct head-to-head experiments with 6 different routers, including existing routers from prior work and new ones we introduce. We show that (i) many routers originally developed for language modeling can be adapted to perform strongly in vision tasks, (ii) in sparse MoE, Expert Choice routers generally outperform Token Choice routers, and (iii) soft MoEs generally outperform sparse MoEs with a fixed compute budget. These results provide new insights regarding the crucial role of routers in vision MoE models.

4/22/2024

📈

A Survey on Model MoErging: Recycling and Routing Among Specialized Experts for Collaborative Learning

Prateek Yadav, Colin Raffel, Mohammed Muqeeth, Lucas Caccia, Haokun Liu, Tianlong Chen, Mohit Bansal, Leshem Choshen, Alessandro Sordoni

The availability of performant pre-trained models has led to a proliferation of fine-tuned expert models that are specialized to a particular domain or task. Model MoErging methods aim to recycle expert models to create an aggregate system with improved performance or generalization. A key component of MoErging methods is the creation of a router that decides which expert model(s) to use for a particular input or application. The promise, effectiveness, and large design space of MoErging has spurred the development of many new methods over the past few years. This rapid pace of development has made it challenging to compare different MoErging methods, which are rarely compared to one another and are often validated in different experimental setups. To remedy such gaps, we present a comprehensive survey of MoErging methods that includes a novel taxonomy for cataloging key design choices and clarifying suitable applications for each method. Apart from surveying MoErging research, we inventory software tools and applications that make use of MoErging. We additionally discuss related fields of study such as model merging, multitask learning, and mixture-of-experts models. Taken as a whole, our survey provides a unified overview of existing MoErging methods and creates a solid foundation for future work in this burgeoning field.

8/14/2024

💬

Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training

Zexuan Zhong, Mengzhou Xia, Danqi Chen, Mike Lewis

Mixture-of-experts (MoE) models facilitate efficient scaling; however, training the router network introduces the challenge of optimizing a non-differentiable, discrete objective. Recently, a fully-differentiable MoE architecture, SMEAR, was proposed (Muqeeth et al., 2023), which softly merges experts in the parameter space; nevertheless, its effectiveness was only demonstrated in downstream fine-tuning on classification tasks. In this paper, we present Lory, the first approach that scales such architectures to autoregressive language model pre-training. Lory introduces two key techniques: (1) a causal segment routing strategy that achieves high efficiency for expert merging operations while preserving the autoregressive nature of language models; (2) a similarity-based data batching method that encourages expert specialization by grouping similar documents in training instances. We pre-train a series of Lory models on 150B tokens from scratch, with up to 32 experts and 30B (1.5B active) parameters. Experimental results show significant performance gains over parameter-matched dense models on both perplexity (+13.9%) and a variety of downstream tasks (+1.5%-11.1%). Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing. We further demonstrate that the trained experts in Lory capture domain-level specialization without supervision. Our work highlights the potential of fully-differentiable MoE architectures for language model pre-training and advocates future research in this area.

8/20/2024