Jamba-1.5: Hybrid Transformer-Mamba Models at Scale

Read original: arXiv:2408.12570 - Published 8/23/2024 by Jamba Team, Barak Lenz, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben Aviram, Chen Almagor, Clara Fridman, Dan Padnos and 51 others

Jamba-1.5: Hybrid Transformer-Mamba Models at Scale

Overview

Jamba-1.5 is a hybrid Transformer-Mamba language model that aims to combine the strengths of two popular AI architectures.
The paper explores how to scale these hybrid models to achieve state-of-the-art performance on various natural language processing tasks.
Key aspects include the model architecture, serving considerations, and experimental results.

Plain English Explanation

The Jamba-1.5 paper introduces a new type of language model that blends two different machine learning approaches - Transformers and Mambas. Transformers are a popular architecture used in many powerful language models like GPT-3, while Mambas are a more compact and efficient alternative.

The researchers wanted to see if they could combine the strengths of these two approaches to create a high-performing language model that is also efficient and scalable. They call this new model Jamba-1.5. The key idea is to leverage the expressive power of Transformers while using Mamba components to make the overall model more compact and easier to deploy at scale.

The paper goes into the technical details of the Jamba-1.5 architecture and how it was trained and optimized for performance. It also discusses practical considerations around serving and deploying such a large-scale language model. The experimental results show that Jamba-1.5 can match or exceed the accuracy of leading language models while being more efficient.

Overall, this research explores an intriguing hybrid approach that could help advance the state-of-the-art in natural language AI while making these powerful models more practical to use in real-world applications.

Technical Explanation

The Jamba-1.5 paper presents a new hybrid Transformer-Mamba language model architecture designed for scalability and efficiency.

Model Architecture

The core idea is to combine the strengths of Transformer and Mamba components within a single model. Transformers are known for their expressive power and performance on a wide range of language tasks, while Mambas offer a more compact and efficient alternative.

The Jamba-1.5 architecture consists of:

A Transformer encoder to capture contextual representations
A Mamba decoder to generate output sequences
Carefully designed connection points between the Transformer and Mamba modules

This hybrid approach aims to leverage the benefits of both architectures - the flexibility of Transformers and the efficiency of Mambas.

Serving Considerations and Improvements

The paper also addresses practical deployment challenges for large language models like Jamba-1.5. It explores techniques to optimize the model for serving, including:

Model pruning and quantization to reduce memory footprint
Efficient parallelization and batching strategies
Customized inference hardware and software optimizations

These serving-focused innovations help make Jamba-1.5 more scalable and practical to deploy in real-world applications.

Critical Analysis

The Jamba-1.5 paper presents a thoughtful and well-executed approach to scaling hybrid Transformer-Mamba language models. The experimental results demonstrate that this hybrid architecture can match or exceed the performance of state-of-the-art language models while being more efficient.

However, the paper does not delve deeply into the potential downsides or limitations of the Jamba-1.5 approach. For example, it would be valuable to understand how the hybrid model's performance compares to pure Transformer or pure Mamba baselines on specific tasks. Additionally, the paper does not discuss potential biases or fairness issues that may arise from training such a large-scale language model.

Further research could also explore the generalizability of the Jamba-1.5 approach to other domains beyond language modeling, such as multimodal tasks or specialized applications. Investigating the model's robustness to distribution shift or adversarial attacks could also provide valuable insights.

Conclusion

The Jamba-1.5 paper presents a compelling hybrid Transformer-Mamba language model that aims to combine the strengths of two popular AI architectures. The research demonstrates how to scale these hybrid models to achieve state-of-the-art performance while also addressing practical deployment considerations.

This work contributes to the ongoing efforts to develop more efficient and scalable language models that can be widely adopted in real-world applications. As the field of natural language AI continues to advance, innovative approaches like Jamba-1.5 could help push the boundaries of what is possible.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Jamba-1.5: Hybrid Transformer-Mamba Models at Scale

Jamba Team, Barak Lenz, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben Aviram, Chen Almagor, Clara Fridman, Dan Padnos, Daniel Gissin, Daniel Jannai, Dor Muhlgay, Dor Zimberg, Edden M Gerber, Elad Dolev, Eran Krakovsky, Erez Safahi, Erez Schwartz, Gal Cohen, Gal Shachaf, Haim Rozenblum, Hofit Bata, Ido Blass, Inbal Magar, Itay Dalmedigos, Jhonathan Osin, Julie Fadlon, Maria Rozman, Matan Danos, Michael Gokhman, Mor Zusman, Naama Gidron, Nir Ratner, Noam Gat, Noam Rozen, Oded Fried, Ohad Leshno, Omer Antverg, Omri Abend, Opher Lieber, Or Dagan, Orit Cohavi, Raz Alon, Ro'i Belson, Roi Cohen, Rom Gilad, Roman Glozman, Shahar Lev, Shaked Meirom, Tal Delbari, Tal Ness, Tomer Asida, Tom Ben Gal, Tom Braude, Uriya Pumerantz, Yehoshua Cohen, Yonatan Belinkov, Yuval Globerson, Yuval Peleg Levy, Yoav Shoham

We present Jamba-1.5, new instruction-tuned large language models based on our Jamba architecture. Jamba is a hybrid Transformer-Mamba mixture of experts architecture, providing high throughput and low memory usage across context lengths, while retaining the same or better quality as Transformer models. We release two model sizes: Jamba-1.5-Large, with 94B active parameters, and Jamba-1.5-Mini, with 12B active parameters. Both models are fine-tuned for a variety of conversational and instruction-following capabilties, and have an effective context length of 256K tokens, the largest amongst open-weight models. To support cost-effective inference, we introduce ExpertsInt8, a novel quantization technique that allows fitting Jamba-1.5-Large on a machine with 8 80GB GPUs when processing 256K-token contexts without loss of quality. When evaluated on a battery of academic and chatbot benchmarks, Jamba-1.5 models achieve excellent results while providing high throughput and outperforming other open-weight models on long-context benchmarks. The model weights for both sizes are publicly available under the Jamba Open Model License and we release ExpertsInt8 as open source.

8/23/2024

Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avashalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, Yoav Shoham

We present Jamba, a new base large language model based on a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture. Specifically, Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model families. MoE is added in some of these layers to increase model capacity while keeping active parameter usage manageable. This flexible architecture allows resource- and objective-specific configurations. In the particular configuration we have implemented, we end up with a powerful model that fits in a single 80GB GPU. Built at large scale, Jamba provides high throughput and small memory footprint compared to vanilla Transformers, and at the same time state-of-the-art performance on standard language model benchmarks and long-context evaluations. Remarkably, the model presents strong results for up to 256K tokens context length. We study various architectural decisions, such as how to combine Transformer and Mamba layers, and how to mix experts, and show that some of them are crucial in large scale modeling. We also describe several interesting properties of these architectures which the training and evaluation of Jamba have revealed, and plan to release checkpoints from various ablation runs, to encourage further exploration of this novel architecture. We make the weights of our implementation of Jamba publicly available under a permissive license.

7/4/2024

Zamba: A Compact 7B SSM Hybrid Model

Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, Beren Millidge

In this technical report, we present Zamba, a novel 7B SSM-transformer hybrid model which achieves competitive performance against leading open-weight models at a comparable scale. Zamba is trained on 1T tokens from openly available datasets and is the best non-transformer model at this scale. Zamba pioneers a unique architecture combining a Mamba backbone with a single shared attention module, thus obtaining the benefits of attention at minimal parameter cost. Due to its architecture, Zamba is significantly faster at inference than comparable transformer models and requires substantially less memory for generation of long sequences. Zamba is pretrained in two phases: the first phase is based on existing web datasets, while the second one consists of annealing the model over high-quality instruct and synthetic datasets, and is characterized by a rapid learning rate decay. We open-source the weights and all checkpoints for Zamba, through both phase 1 and annealing phases.

5/28/2024

The Mamba in the Llama: Distilling and Accelerating Hybrid Models

Junxiong Wang, Daniele Paliotta, Avner May, Alexander M. Rush, Tri Dao

Linear RNN architectures, like Mamba, can be competitive with Transformer models in language modeling while having advantageous deployment characteristics. Given the focus on training large-scale Transformer models, we consider the challenge of converting these pretrained models for deployment. We demonstrate that it is feasible to distill large Transformers into linear RNNs by reusing the linear projection weights from attention layers with academic GPU resources. The resulting hybrid model, which incorporates a quarter of the attention layers, achieves performance comparable to the original Transformer in chat benchmarks and outperforms open-source hybrid Mamba models trained from scratch with trillions of tokens in both chat benchmarks and general benchmarks. Moreover, we introduce a hardware-aware speculative decoding algorithm that accelerates the inference speed of Mamba and hybrid models. Overall we show how, with limited computation resources, we can remove many of the original attention layers and generate from the resulting model more efficiently. Our top-performing model, distilled from Llama3-8B-Instruct, achieves a 29.61 length-controlled win rate on AlpacaEval 2 against GPT-4 and 7.35 on MT-Bench, surpassing the best instruction-tuned linear RNN model.

8/28/2024