State Soup: In-Context Skill Learning, Retrieval and Mixing

2406.08423

Published 6/13/2024 by Maciej Pi'oro, Maciej Wo{l}czyk, Razvan Pascanu, Johannes von Oswald, Jo~ao Sacramento

State Soup: In-Context Skill Learning, Retrieval and Mixing

Abstract

A new breed of gated-linear recurrent neural networks has reached state-of-the-art performance on a range of sequence modeling problems. Such models naturally handle long sequences efficiently, as the cost of processing a new input is independent of sequence length. Here, we explore another advantage of these stateful sequence models, inspired by the success of model merging through parameter interpolation. Building on parallels between fine-tuning and in-context learning, we investigate whether we can treat internal states as task vectors that can be stored, retrieved, and then linearly combined, exploiting the linearity of recurrence. We study this form of fast model merging on Mamba-2.8b, a pretrained recurrent model, and present preliminary evidence that simple linear state interpolation methods suffice to improve next-token perplexity as well as downstream in-context learning task performance.

Create account to get full access

Overview

The paper "State Soup: In-Context Skill Learning, Retrieval and Mixing" explores a novel approach to combining and retrieving different skills and knowledge within a single model.
The key contributions include a framework for "state soup" - the blending of different skills and knowledge states - and techniques for in-context learning, retrieval, and mixing of these states.
The research aims to enable AI agents to flexibly apply diverse capabilities in context, rather than being limited to predefined tasks or skills.

Plain English Explanation

The paper introduces the concept of "state soup" - the idea of blending together different skills and knowledge within a single AI model. Rather than having a model that is narrowly specialized for a particular task, the researchers want to create models that can flexibly draw upon a wide range of capabilities and apply them in context as needed.

The core innovation is a framework that allows the model to continuously learn new skills and knowledge, and then seamlessly retrieve and combine these "ingredients" to tackle novel challenges. This is akin to a chef having a diverse pantry of ingredients that they can creatively combine to prepare any dish, rather than being limited to a fixed menu.

By enabling this flexible, in-context skill learning and mixing, the researchers hope to move towards AI agents that can adapt to unpredictable situations and flexibly apply their capabilities, rather than being constrained to predefined tasks. This could have broad implications for the development of more versatile and generally capable AI systems.

Technical Explanation

The paper proposes a framework for "state soup" - the blending and mixing of different skills and knowledge states within a single AI model. This is achieved through techniques for in-context skill learning, skill retrieval, and skill mixing.

The model is designed to continuously expand its repertoire of skills and knowledge, which can then be selectively retrieved and combined as needed to tackle novel challenges. This is enabled by selective state space models that allow for efficient sequence modeling and state management.

Through extensive experiments, the researchers demonstrate the model's ability to learn new skills in-context, retrieve relevant skills, and blend them together to solve complex, multifaceted problems. This flexibility and versatility points towards the development of more generally capable AI agents that can adapt to unpredictable situations.

Critical Analysis

The paper presents a compelling vision for flexible, context-aware AI agents that can dynamically combine diverse skills and knowledge. However, the practicality and scalability of the "state soup" approach remain to be thoroughly tested.

While the researchers demonstrate the technique's effectiveness on certain benchmark tasks, the real-world applicability and robustness of the approach will need to be further explored. Integrating such a flexible, open-ended skill learning and mixing framework into practical AI systems may introduce significant engineering challenges.

Additionally, the paper does not address potential issues around the interpretability and transparency of the skill blending process. As AI systems become more complex and capable of dynamically combining diverse capabilities, ensuring that their decision-making and behavior remains understandable and accountable will be crucial.

Conclusion

The "State Soup" paper introduces an innovative framework for enabling AI agents to flexibly learn, retrieve, and combine diverse skills and knowledge in context. By moving beyond rigid, task-specific models, this research points towards a future of more adaptable and generally capable AI systems.

While the technical approach shows promise, significant work remains to translate these ideas into practical, scalable, and accountable AI solutions. Nonetheless, the "state soup" concept represents an exciting step towards a new generation of AI agents that can dynamically apply their capabilities to tackle complex, unpredictable challenges.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

State-Space Modeling in Long Sequence Processing: A Survey on Recurrence in the Transformer Era

Matteo Tiezzi, Michele Casoni, Alessandro Betti, Marco Gori, Stefano Melacci

Effectively learning from sequential data is a longstanding goal of Artificial Intelligence, especially in the case of long sequences. From the dawn of Machine Learning, several researchers engaged in the search of algorithms and architectures capable of processing sequences of patterns, retaining information about the past inputs while still leveraging the upcoming data, without losing precious long-term dependencies and correlations. While such an ultimate goal is inspired by the human hallmark of continuous real-time processing of sensory information, several solutions simplified the learning paradigm by artificially limiting the processed context or dealing with sequences of limited length, given in advance. These solutions were further emphasized by the large ubiquity of Transformers, that have initially shaded the role of Recurrent Neural Nets. However, recurrent networks are facing a strong recent revival due to the growing popularity of (deep) State-Space models and novel instances of large-context Transformers, which are both based on recurrent computations to go beyond several limits of currently ubiquitous technologies. In fact, the fast development of Large Language Models enhanced the interest in efficient solutions to process data over time. This survey provides an in-depth summary of the latest approaches that are based on recurrent models for sequential data processing. A complete taxonomy over the latest trends in architectural and algorithmic solutions is reported and discussed, guiding researchers in this appealing research field. The emerging picture suggests that there is room for thinking of novel routes, constituted by learning algorithms which depart from the standard Backpropagation Through Time, towards a more realistic scenario where patterns are effectively processed online, leveraging local-forward computations, opening to further research on this topic.

6/14/2024

cs.LG

Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, Weizhu Chen

Efficiently modeling sequences with infinite context length has been a long-standing problem. Past works suffer from either the quadratic computation complexity or the limited extrapolation ability on length generalization. In this work, we present Samba, a simple hybrid architecture that layer-wise combines Mamba, a selective State Space Model (SSM), with Sliding Window Attention (SWA). Samba selectively compresses a given sequence into recurrent hidden states while still maintaining the ability to precisely recall memories with the attention mechanism. We scale Samba up to 3.8B parameters with 3.2T training tokens and show that Samba substantially outperforms the state-of-the-art models based on pure attention or SSMs on a wide range of benchmarks. When trained on 4K length sequences, Samba can be efficiently extrapolated to 256K context length with perfect memory recall and show improved token predictions up to 1M context length. As a linear-time sequence model, Samba enjoys a 3.73x higher throughput compared to Transformers with grouped-query attention when processing user prompts of 128K length, and 3.64x speedup when generating 64K tokens with unlimited streaming. A sample implementation of Samba is publicly available in https://github.com/microsoft/Samba.

6/12/2024

cs.CL cs.LG

Universal In-Context Approximation By Prompting Fully Recurrent Models

Aleksandar Petrov, Tom A. Lamb, Alasdair Paren, Philip H. S. Torr, Adel Bibi

Zero-shot and in-context learning enable solving tasks without model fine-tuning, making them essential for developing generative model solutions. Therefore, it is crucial to understand whether a pretrained model can be prompted to approximate any function, i.e., whether it is a universal in-context approximator. While it was recently shown that transformer models do possess this property, these results rely on their attention mechanism. Hence, these findings do not apply to fully recurrent architectures like RNNs, LSTMs, and the increasingly popular SSMs. We demonstrate that RNNs, LSTMs, GRUs, Linear RNNs, and linear gated architectures such as Mamba and Hawk/Griffin can also serve as universal in-context approximators. To streamline our argument, we introduce a programming language called LSRL that compiles to these fully recurrent architectures. LSRL may be of independent interest for further studies of fully recurrent models, such as constructing interpretability benchmarks. We also study the role of multiplicative gating and observe that architectures incorporating such gating (e.g., LSTMs, GRUs, Hawk/Griffin) can implement certain operations more stably, making them more viable candidates for practical in-context universal approximation.

6/4/2024

cs.LG cs.AI cs.CL

🤷

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu, Tri Dao

Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5$times$ higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.

6/3/2024

cs.LG cs.AI