Filtered not Mixed: Stochastic Filtering-Based Online Gating for Mixture of Large Language Models

2406.02969

Published 6/6/2024 by Raeid Saqur, Anastasis Kratsios, Florian Krach, Yannick Limmer, Jacob-Junqi Tian, John Willes, Blanka Horvath, Frank Rudzicz

cs.LG cs.AI cs.CL

Filtered not Mixed: Stochastic Filtering-Based Online Gating for Mixture of Large Language Models

Abstract

We propose MoE-F -- a formalised mechanism for combining $N$ pre-trained expert Large Language Models (LLMs) in online time-series prediction tasks by adaptively forecasting the best weighting of LLM predictions at every time step. Our mechanism leverages the conditional information in each expert's running performance to forecast the best combination of LLMs for predicting the time series in its next step. Diverging from static (learned) Mixture of Experts (MoE) methods, MoE-F employs time-adaptive stochastic filtering techniques to combine experts. By framing the expert selection problem as a finite state-space, continuous-time Hidden Markov model (HMM), we can leverage the Wohman-Shiryaev filter. Our approach first constructs $N$ parallel filters corresponding to each of the $N$ individual LLMs. Each filter proposes its best combination of LLMs, given the information that they have access to. Subsequently, the $N$ filter outputs are aggregated to optimize a lower bound for the loss of the aggregated LLMs, which can be optimized in closed-form, thus generating our ensemble predictor. Our contributions here are: (I) the MoE-F algorithm -- deployable as a plug-and-play filtering harness, (II) theoretical optimality guarantees of the proposed filtering-based gating algorithm, and (III) empirical evaluation and ablative results using state of the art foundational and MoE LLMs on a real-world Financial Market Movement task where MoE-F attains a remarkable 17% absolute and 48.5% relative F1 measure improvement over the next best performing individual LLM expert.

Create account to get full access

Overview

The paper introduces a novel stochastic filtering-based online gating method for effectively combining multiple large language models (LLMs) to improve performance.
The proposed approach, called Filtered not Mixed (FnM), aims to address the limitations of existing mixture of experts (MoE) techniques by dynamically selecting the most appropriate LLM for each input, rather than blending the outputs of multiple LLMs.
The authors demonstrate the effectiveness of FnM on various language tasks, showing improved performance and efficiency compared to traditional MoE and other state-of-the-art methods.

Plain English Explanation

When working with large language models (LLMs), it's often beneficial to combine the strengths of multiple models to get better overall performance. The paper on a pre-gated MoE algorithm system co-design and the FuseMoE paper on mixture of experts for transformers have explored different approaches to this.

The key idea of the current paper is to use a "stochastic filtering" technique to dynamically select the most appropriate LLM for each input, rather than blending the outputs of multiple LLMs. This method, called Filtered not Mixed (FnM), aims to address the limitations of traditional mixture of experts (MoE) techniques.

Imagine you have a group of experts (the LLMs) that you can consult for different tasks. Instead of always asking all the experts and averaging their responses, FnM selects the single expert that is most likely to provide the best answer for the specific input you're working with. This allows the system to leverage the strengths of each individual expert (LLM) more effectively.

The authors show that FnM outperforms traditional MoE approaches and other state-of-the-art methods on a variety of language tasks, demonstrating improved performance and efficiency. This suggests that this dynamic selection approach can be a powerful way to combine the capabilities of multiple large language models.

Technical Explanation

The paper introduces a novel technique called Filtered not Mixed (FnM) for effectively combining multiple large language models (LLMs) to improve performance on various language tasks.

Traditional mixture of experts (MoE) approaches blend the outputs of multiple LLMs, which can lead to suboptimal performance due to challenges in properly weighting and combining the individual model outputs. In contrast, FnM uses a stochastic filtering-based online gating mechanism to dynamically select the most appropriate LLM for each input, rather than mixing the outputs.

The FnM architecture consists of a gating network that estimates the probability distribution over the available LLMs for each input. This probability distribution is then used to stochastically select the single LLM to be used for that input. By focusing on the most relevant LLM for each input, FnM aims to better leverage the strengths of the individual models.

The authors evaluate FnM on a range of language tasks, including language modeling, open-ended text generation, and few-shot learning. They demonstrate that FnM outperforms traditional MoE approaches as well as other state-of-the-art methods in terms of both performance and efficiency.

The LocMoE paper on low-overhead MoE for large language models and the paper on towards inference-optimal mixture of experts for large language models have also explored efficient ways to combine multiple LLMs, but FnM introduces a unique perspective by focusing on dynamically selecting the most relevant model rather than blending the outputs.

Critical Analysis

The FnM approach presented in the paper offers a promising alternative to traditional MoE techniques for combining large language models. By dynamically selecting the most appropriate LLM for each input, rather than blending the outputs, FnM appears to achieve better performance and efficiency.

One potential limitation of the FnM approach is that it may be more computationally expensive than simply averaging the outputs of multiple LLMs, as it requires running the gating network to estimate the probability distribution over the available models. The authors do not provide a detailed analysis of the computational overhead of the FnM approach compared to other MoE methods.

Additionally, the paper focuses on evaluating FnM on a limited set of language tasks, and it would be interesting to see how the approach performs on a wider range of applications, including more complex or domain-specific tasks. The paper on prompt-prompted mixture of experts for efficient LLM generation explores a similar dynamic selection approach, and a comparative analysis between FnM and this technique could provide further insights.

Overall, the FnM approach presents a compelling alternative to traditional MoE methods for combining large language models, and the results reported in the paper suggest that it is a promising direction for further research and development.

Conclusion

The paper introduces a novel stochastic filtering-based online gating method called Filtered not Mixed (FnM) for effectively combining multiple large language models (LLMs) to improve performance on various language tasks. FnM aims to address the limitations of traditional mixture of experts (MoE) techniques by dynamically selecting the most appropriate LLM for each input, rather than blending the outputs of multiple models.

The authors demonstrate the effectiveness of FnM through experiments on a range of language tasks, showing that it outperforms traditional MoE approaches as well as other state-of-the-art methods in terms of both performance and efficiency. This suggests that the FnM approach can be a powerful way to leverage the strengths of individual LLMs and unlock their full potential in language applications.

The paper's findings contribute to the ongoing research on efficient and effective ways to combine multiple large language models, building on previous work in this area. As the field of large language models continues to evolve, innovative techniques like FnM may play an important role in developing more flexible and capable AI systems for a variety of language-based tasks.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🧠

On Least Square Estimation in Softmax Gating Mixture of Experts

Huy Nguyen, Nhat Ho, Alessandro Rinaldo

Mixture of experts (MoE) model is a statistical machine learning design that aggregates multiple expert networks using a softmax gating function in order to form a more intricate and expressive model. Despite being commonly used in several applications owing to their scalability, the mathematical and statistical properties of MoE models are complex and difficult to analyze. As a result, previous theoretical works have primarily focused on probabilistic MoE models by imposing the impractical assumption that the data are generated from a Gaussian MoE model. In this work, we investigate the performance of the least squares estimators (LSE) under a deterministic MoE model where the data are sampled according to a regression model, a setting that has remained largely unexplored. We establish a condition called strong identifiability to characterize the convergence behavior of various types of expert functions. We demonstrate that the rates for estimating strongly identifiable experts, namely the widely used feed-forward networks with activation functions $mathrm{sigmoid}(cdot)$ and $tanh(cdot)$, are substantially faster than those of polynomial experts, which we show to exhibit a surprising slow estimation rate. Our findings have important practical implications for expert selection.

6/26/2024

stat.ML cs.LG

🤯

Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference

Ranggi Hwang, Jianyu Wei, Shijie Cao, Changho Hwang, Xiaohu Tang, Ting Cao, Mao Yang

Large language models (LLMs) based on transformers have made significant strides in recent years, the success of which is driven by scaling up their model size. Despite their high algorithmic performance, the computational and memory requirements of LLMs present unprecedented challenges. To tackle the high compute requirements of LLMs, the Mixture-of-Experts (MoE) architecture was introduced which is able to scale its model size without proportionally scaling up its computational requirements. Unfortunately, MoE's high memory demands and dynamic activation of sparse experts restrict its applicability to real-world problems. Previous solutions that offload MoE's memory-hungry expert parameters to CPU memory fall short because the latency to migrate activated experts from CPU to GPU incurs high performance overhead. Our proposed Pre-gated MoE system effectively tackles the compute and memory challenges of conventional MoE architectures using our algorithm-system co-design. Pre-gated MoE employs our novel pre-gating function which alleviates the dynamic nature of sparse expert activation, allowing our proposed system to address the large memory footprint of MoEs while also achieving high performance. We demonstrate that Pre-gated MoE is able to improve performance, reduce GPU memory consumption, while also maintaining the same level of model quality. These features allow our Pre-gated MoE system to cost-effectively deploy large-scale LLMs using just a single GPU with high performance.

4/30/2024

cs.LG cs.AI cs.AR

A Closer Look into Mixture-of-Experts in Large Language Models

Ka Man Lo, Zeyu Huang, Zihan Qiu, Zili Wang, Jie Fu

Mixture-of-experts (MoE) is gaining increasing attention due to its unique properties and remarkable performance, especially for language tasks. By sparsely activating a subset of parameters for each token, MoE architecture could increase the model size without sacrificing computational efficiency, achieving a better trade-off between performance and training costs. However, the underlying mechanism of MoE still lacks further exploration, and its modularization degree remains questionable. In this paper, we make an initial attempt to understand the inner workings of MoE-based large language models. Concretely, we comprehensively study the parametric and behavioral features of three recent MoE-based models and reveal some intriguing observations, including (1) Neurons act like fine-grained experts. (2) The router of MoE usually selects experts with larger output norms. (3) The expert diversity increases as the layer increases, while the last layer is an outlier. Based on the observations, we also provide suggestions for a broad spectrum of MoE practitioners, such as router design and expert allocation. We hope this work could shed light on future research on the MoE framework and other modular architectures. Code is available at https://github.com/kamanphoebe/Look-into-MoEs.

6/27/2024

cs.CL cs.LG

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

Tong Zhu, Xiaoye Qu, Daize Dong, Jiacheng Ruan, Jingqi Tong, Conghui He, Yu Cheng

Mixture-of-Experts (MoE) has gained increasing popularity as a promising framework for scaling up large language models (LLMs). However, training MoE from scratch in a large-scale setting still suffers from data-hungry and instability problems. Motivated by this limit, we investigate building MoE models from existing dense large language models. Specifically, based on the well-known LLaMA-2 7B model, we obtain an MoE model by: (1) Expert Construction, which partitions the parameters of original Feed-Forward Networks (FFNs) into multiple experts; (2) Continual Pre-training, which further trains the transformed MoE model and additional gate networks. In this paper, we comprehensively explore different methods for expert construction and various data sampling strategies for continual pre-training. After these stages, our LLaMA-MoE models could maintain language abilities and route the input tokens to specific experts with part of the parameters activated. Empirically, by training 200B tokens, LLaMA-MoE-3.5B models significantly outperform dense models that contain similar activation parameters. The source codes and models are available at https://github.com/pjlab-sys4nlp/llama-moe .

6/26/2024

cs.CL