Zamba: A Compact 7B SSM Hybrid Model

Read original: arXiv:2405.16712 - Published 5/28/2024 by Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, Beren Millidge

Overview

• This paper introduces Zamba, a compact 7B Sequence-to-Sequence Modeling (SSM) hybrid model.

• Zamba is a large language model that combines transformer and Mamba components, aiming to achieve strong performance while being more compact and efficient than pure transformer models.

• The paper details the architecture and training of Zamba, as well as its evaluation on a range of natural language processing tasks.

Plain English Explanation

Zamba is a new type of large language model that tries to be more efficient and compact than traditional transformer-based models. It combines transformers, which are a common type of neural network used in many language models, with a novel component called Mamba.

The key idea behind Zamba is to create a model that can perform well on language tasks while using fewer computational resources. Transformers, while powerful, can be very large and resource-intensive. Zamba aims to maintain strong performance by using Mamba to handle some of the language modeling tasks, allowing the overall model to be smaller and more efficient.

The paper describes the inner workings of Zamba in detail, explaining how the transformer and Mamba components work together. It then shows that Zamba can achieve competitive results on a variety of language understanding benchmarks, while being more compact than pure transformer models.

This research is significant because it explores new ways to build large language models that are more practical to deploy, especially in resource-constrained environments like on mobile devices or low-power servers. By finding ways to make these powerful models more efficient, the authors are working towards making advanced natural language processing more accessible and widely usable.

Technical Explanation

The Zamba model combines a transformer encoder-decoder architecture with a Mamba component. The transformer handles the core language modeling tasks, while the Mamba module is used to capture additional linguistic structure and patterns.

The Mamba component is a type of self-supervised neural network that learns to represent language in a more compact and efficient way. It is trained alongside the transformer using a multi-task objective that includes both language modeling and specialized Mamba-specific tasks.

The authors evaluate Zamba on a range of natural language understanding benchmarks, including question answering, textual entailment, and natural language inference. They show that Zamba achieves competitive performance compared to pure transformer models, while being significantly more compact (7B parameters versus 11B for the transformer baseline).

The paper also provides detailed analyses of the Zamba architecture, including the contribution of the Mamba component and the impact of different training strategies. The authors demonstrate that the hybrid approach of combining transformers and Mamba leads to improved performance and efficiency compared to using either component alone.

Critical Analysis

The authors present a well-designed and comprehensive evaluation of the Zamba model, exploring its performance across a diverse set of language tasks. The results suggest that the hybrid transformer-Mamba approach can be an effective way to build large language models that are more compact and efficient than pure transformer-based alternatives.

However, the paper does not provide a deep discussion of the limitations or potential drawbacks of the Zamba approach. For example, it would be valuable to understand how the model's performance compares to other hybrid or specialized language models, or how the Mamba component affects the model's interpretability and explainability.

Additionally, the authors could have explored the transferability of the Zamba model to other domains or tasks beyond the ones evaluated in the paper. This would help assess the generalizability of the approach and its broader applicability.

Overall, the Zamba research is a promising step towards more efficient and practical large language models. Further investigation into the model's limitations and potential extensions could provide additional insights and help guide future work in this area.

Conclusion

The Zamba paper presents a novel hybrid language model that combines transformers and a Mamba component to achieve strong performance while being more compact than pure transformer-based models. The authors demonstrate the effectiveness of this approach through comprehensive evaluations on a range of natural language understanding tasks.

This research is significant because it explores new ways to build large language models that are more efficient and practical to deploy, particularly in resource-constrained environments. By finding ways to maintain high performance while reducing model size and computational requirements, the authors are working towards making advanced natural language processing more accessible and widely usable.

The Zamba model represents an important step forward in the ongoing efforts to develop more compact and efficient language models. As the field of natural language processing continues to advance, research like this will be crucial in translating these powerful AI techniques into real-world applications and services that can benefit a wide range of users and industries.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Zamba: A Compact 7B SSM Hybrid Model

Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, Beren Millidge

In this technical report, we present Zamba, a novel 7B SSM-transformer hybrid model which achieves competitive performance against leading open-weight models at a comparable scale. Zamba is trained on 1T tokens from openly available datasets and is the best non-transformer model at this scale. Zamba pioneers a unique architecture combining a Mamba backbone with a single shared attention module, thus obtaining the benefits of attention at minimal parameter cost. Due to its architecture, Zamba is significantly faster at inference than comparable transformer models and requires substantially less memory for generation of long sequences. Zamba is pretrained in two phases: the first phase is based on existing web datasets, while the second one consists of annealing the model over high-quality instruct and synthetic datasets, and is characterized by a rapid learning rate decay. We open-source the weights and all checkpoints for Zamba, through both phase 1 and annealing phases.

5/28/2024

Jamba-1.5: Hybrid Transformer-Mamba Models at Scale

Jamba Team, Barak Lenz, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben Aviram, Chen Almagor, Clara Fridman, Dan Padnos, Daniel Gissin, Daniel Jannai, Dor Muhlgay, Dor Zimberg, Edden M Gerber, Elad Dolev, Eran Krakovsky, Erez Safahi, Erez Schwartz, Gal Cohen, Gal Shachaf, Haim Rozenblum, Hofit Bata, Ido Blass, Inbal Magar, Itay Dalmedigos, Jhonathan Osin, Julie Fadlon, Maria Rozman, Matan Danos, Michael Gokhman, Mor Zusman, Naama Gidron, Nir Ratner, Noam Gat, Noam Rozen, Oded Fried, Ohad Leshno, Omer Antverg, Omri Abend, Opher Lieber, Or Dagan, Orit Cohavi, Raz Alon, Ro'i Belson, Roi Cohen, Rom Gilad, Roman Glozman, Shahar Lev, Shaked Meirom, Tal Delbari, Tal Ness, Tomer Asida, Tom Ben Gal, Tom Braude, Uriya Pumerantz, Yehoshua Cohen, Yonatan Belinkov, Yuval Globerson, Yuval Peleg Levy, Yoav Shoham

We present Jamba-1.5, new instruction-tuned large language models based on our Jamba architecture. Jamba is a hybrid Transformer-Mamba mixture of experts architecture, providing high throughput and low memory usage across context lengths, while retaining the same or better quality as Transformer models. We release two model sizes: Jamba-1.5-Large, with 94B active parameters, and Jamba-1.5-Mini, with 12B active parameters. Both models are fine-tuned for a variety of conversational and instruction-following capabilties, and have an effective context length of 256K tokens, the largest amongst open-weight models. To support cost-effective inference, we introduce ExpertsInt8, a novel quantization technique that allows fitting Jamba-1.5-Large on a machine with 8 80GB GPUs when processing 256K-token contexts without loss of quality. When evaluated on a battery of academic and chatbot benchmarks, Jamba-1.5 models achieve excellent results while providing high throughput and outperforming other open-weight models on long-context benchmarks. The model weights for both sizes are publicly available under the Jamba Open Model License and we release ExpertsInt8 as open source.

8/23/2024

Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avashalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, Yoav Shoham

We present Jamba, a new base large language model based on a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture. Specifically, Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model families. MoE is added in some of these layers to increase model capacity while keeping active parameter usage manageable. This flexible architecture allows resource- and objective-specific configurations. In the particular configuration we have implemented, we end up with a powerful model that fits in a single 80GB GPU. Built at large scale, Jamba provides high throughput and small memory footprint compared to vanilla Transformers, and at the same time state-of-the-art performance on standard language model benchmarks and long-context evaluations. Remarkably, the model presents strong results for up to 256K tokens context length. We study various architectural decisions, such as how to combine Transformer and Mamba layers, and how to mix experts, and show that some of them are crucial in large scale modeling. We also describe several interesting properties of these architectures which the training and evaluation of Jamba have revealed, and plan to release checkpoints from various ablation runs, to encourage further exploration of this novel architecture. We make the weights of our implementation of Jamba publicly available under a permissive license.

7/4/2024

Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, Weizhu Chen

Efficiently modeling sequences with infinite context length has been a long-standing problem. Past works suffer from either the quadratic computation complexity or the limited extrapolation ability on length generalization. In this work, we present Samba, a simple hybrid architecture that layer-wise combines Mamba, a selective State Space Model (SSM), with Sliding Window Attention (SWA). Samba selectively compresses a given sequence into recurrent hidden states while still maintaining the ability to precisely recall memories with the attention mechanism. We scale Samba up to 3.8B parameters with 3.2T training tokens and show that Samba substantially outperforms the state-of-the-art models based on pure attention or SSMs on a wide range of benchmarks. When trained on 4K length sequences, Samba can be efficiently extrapolated to 256K context length with perfect memory recall and show improved token predictions up to 1M context length. As a linear-time sequence model, Samba enjoys a 3.73x higher throughput compared to Transformers with grouped-query attention when processing user prompts of 128K length, and 3.64x speedup when generating 64K tokens with unlimited streaming. A sample implementation of Samba is publicly available in https://github.com/microsoft/Samba.

6/12/2024