Parallelizing Linear Transformers with the Delta Rule over Sequence Length

2406.06484

Published 6/11/2024 by Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, Yoon Kim

🤷

Abstract

Transformers with linear attention (i.e., linear transformers) and state-space models have recently been suggested as a viable linear-time alternative to transformers with softmax attention. However, these models still underperform transformers especially on tasks that require in-context retrieval. While more expressive variants of linear transformers which replace the additive outer-product update in linear transformers with the delta rule have been found to be more effective at associative recall, existing algorithms for training such models do not parallelize over sequence length and are thus inefficient to train on modern hardware. This work describes a hardware-efficient algorithm for training linear transformers with the delta rule, which exploits a memory-efficient representation for computing products of Householder matrices. This algorithm allows us to scale up DeltaNet to standard language modeling settings. We train a 1.3B model for 100B tokens and find that it outperforms recent linear-time baselines such as Mamba and GLA in terms of perplexity and zero-shot performance on downstream tasks (including on tasks that focus on recall). We also experiment with two hybrid models which combine DeltaNet layers with (1) sliding-window attention layers every other layer or (2) two global attention layers, and find that these hybrid models outperform strong transformer baselines.

Create account to get full access

Overview

This paper proposes a method for parallelizing linear transformer models, which are a type of neural network architecture used for processing sequential data like text.
The key idea is to use the "delta rule," a machine learning technique, to efficiently compute the self-attention mechanism in linear transformers over the sequence length.
This approach aims to improve the computational efficiency of training and inference for large language models like those used in Gated Linear Attention Transformers, Small-E Small Language Model, and Symmetric Dot-Product Attention.

Plain English Explanation

Neural networks like transformers are powerful machine learning models that can process sequential data, such as text or speech. However, training and using these models can be computationally intensive, especially for large language models.

This paper introduces a new technique to make transformer models more efficient. The key insight is to use a machine learning method called the "delta rule" to speed up the calculation of the self-attention mechanism in linear transformer models. Self-attention is a crucial component of transformers that allows them to understand the relationships between different parts of the input sequence.

By applying the delta rule to the self-attention calculation, the researchers were able to parallelize this step, meaning it could be done more quickly on computers with multiple processors. This can significantly reduce the time and computational resources required to train and use large transformer models, which is important for making these powerful AI systems more practical and accessible.

The proposed approach builds on previous work on linear attention transformers and efficient transformer training, aiming to further improve the efficiency and practicality of these models.

Technical Explanation

The paper presents a method for parallelizing the computation of the self-attention mechanism in linear transformer models. The key idea is to use the delta rule, a machine learning technique, to efficiently compute the self-attention over the sequence length.

Specifically, the authors propose a "Parallelized Linear Transformer" (PLT) model, which uses the delta rule to update the attention weights in a parallel fashion. This avoids the need for the sequential computation required in standard self-attention mechanisms, which can be a bottleneck for large language models.

The paper provides a detailed mathematical formulation of the PLT model, showing how the delta rule can be used to update the attention weights in a parallelized manner. The authors also discuss how this approach can be integrated with other techniques for efficient transformer training, such as Gated Linear Attention Transformers and Symmetric Dot-Product Attention.

Experimental results on various language modeling benchmarks demonstrate the effectiveness of the proposed PLT model, showing significant improvements in computational efficiency compared to standard linear transformer architectures.

Critical Analysis

The paper presents a novel and promising approach to improving the efficiency of transformer models, which is an important challenge in the field of natural language processing and machine learning.

One potential limitation of the proposed method is that it relies on the delta rule, which may not be as well-suited for complex attention patterns as other attention mechanisms. Additionally, the paper does not provide a detailed analysis of the trade-offs between the computational efficiency gains and any potential impacts on model performance or robustness.

Furthermore, the paper does not address how the PLT model might perform on more diverse or challenging language tasks beyond the standard language modeling benchmarks. It would be valuable to see how the approach generalizes to other applications, such as service description logic-based contexts.

Overall, the paper makes a valuable contribution to the field of efficient transformer architectures, but further research and evaluation would be needed to fully assess the strengths, limitations, and broader applicability of the proposed approach.

Conclusion

This paper introduces a novel method for parallelizing the computation of the self-attention mechanism in linear transformer models, using the delta rule to efficiently update the attention weights. This approach aims to improve the computational efficiency of training and inference for large language models, which is a crucial challenge in the field of natural language processing.

The proposed Parallelized Linear Transformer (PLT) model demonstrates promising results on standard language modeling benchmarks, suggesting that this technique could help make powerful transformer-based AI systems more practical and accessible. While the paper raises some questions about the generalizability and potential limitations of the approach, it represents an important step forward in the ongoing effort to develop more efficient and scalable transformer architectures.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Gated Linear Attention Transformers with Hardware-Efficient Training

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, Yoon Kim

Transformers with linear attention allow for efficient parallel training but can simultaneously be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear-time inference complexity. However, linear attention generally underperforms ordinary softmax attention. Moreover, current implementations of linear attention lack I/O-awareness and are thus slower than highly optimized implementations of softmax attention. This work describes a hardware-efficient algorithm for linear attention that trades off memory movement against parallelizability. The resulting implementation, dubbed FLASHLINEARATTENTION, is faster than FLASHATTENTION-2 (Dao, 2023) as a standalone layer even on short sequence lengths (e.g., 1K). We then generalize this algorithm to a more expressive variant of linear attention with data-dependent gates. When used as a replacement for the standard attention layer in Transformers, the resulting gated linear attention (GLA) Transformer is found to perform competitively against the LLaMA-architecture Transformer (Touvron et al., 2023) as well recent linear-time-inference baselines such as RetNet (Sun et al., 2023a) and Mamba (Gu & Dao, 2023) on moderate-scale language modeling experiments. GLA Transformer is especially effective at length generalization, enabling a model trained on 2K to generalize to sequences longer than 20K without significant perplexity degradations. For training speed, the GLA Transformer has higher throughput than a similarly-sized Mamba model.

6/6/2024

cs.LG cs.CL

Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis

Th'eodor Lemerle, Nicolas Obin, Axel Roebel

Recent advancements in text-to-speech (TTS) powered by language models have showcased remarkable capabilities in achieving naturalness and zero-shot voice cloning. Notably, the decoder-only transformer is the prominent architecture in this domain. However, transformers face challenges stemming from their quadratic complexity in sequence length, impeding training on lengthy sequences and resource-constrained hardware. Moreover they lack specific inductive bias with regards to the monotonic nature of TTS alignments. In response, we propose to replace transformers with emerging recurrent architectures and introduce specialized cross-attention mechanisms for reducing repeating and skipping issues. Consequently our architecture can be efficiently trained on long samples and achieve state-of-the-art zero-shot voice cloning against baselines of comparable size. Our implementation and demos are available at https://github.com/theodorblackbird/lina-speech.

6/12/2024

eess.AS cs.CL cs.SD

Symmetric Dot-Product Attention for Efficient Training of BERT Language Models

Martin Courtois, Malte Ostendorff, Leonhard Hennig, Georg Rehm

Initially introduced as a machine translation model, the Transformer architecture has now become the foundation for modern deep learning architecture, with applications in a wide range of fields, from computer vision to natural language processing. Nowadays, to tackle increasingly more complex tasks, Transformer-based models are stretched to enormous sizes, requiring increasingly larger training datasets, and unsustainable amount of compute resources. The ubiquitous nature of the Transformer and its core component, the attention mechanism, are thus prime targets for efficiency research. In this work, we propose an alternative compatibility function for the self-attention mechanism introduced by the Transformer architecture. This compatibility function exploits an overlap in the learned representation of the traditional scaled dot-product attention, leading to a symmetric with pairwise coefficient dot-product attention. When applied to the pre-training of BERT-like models, this new symmetric attention mechanism reaches a score of 79.36 on the GLUE benchmark against 78.74 for the traditional implementation, leads to a reduction of 6% in the number of trainable parameters, and reduces the number of training steps required before convergence by half.

6/21/2024

cs.CL

Linearizing Large Language Models

Jean Mercat, Igor Vasiljevic, Sedrick Keh, Kushal Arora, Achal Dave, Adrien Gaidon, Thomas Kollar

Linear transformers have emerged as a subquadratic-time alternative to softmax attention and have garnered significant interest due to their fixed-size recurrent state that lowers inference cost. However, their original formulation suffers from poor scaling and underperforms compute-matched transformers. Recent linear models such as RWKV and Mamba have attempted to address these shortcomings by proposing novel time-mixing and gating architectures, but pre-training large language models requires significant data and compute investments. Thus, the search for subquadratic architectures is limited by the availability of compute and quality pre-training datasets. As a cost-effective alternative to pre-training linear transformers, we propose Scalable UPtraining for Recurrent Attention (SUPRA). We present a method to uptrain existing large pre-trained transformers into Recurrent Neural Networks (RNNs) with a modest compute budget. This allows us to leverage the strong pre-training data and performance of existing transformer LLMs, while requiring 5% of the training cost. We find that our linearization technique leads to competitive performance on standard benchmarks, but we identify persistent in-context learning and long-context modeling shortfalls for even the largest linear models. Our code and models can be found at https://github.com/TRI-ML/linear_open_lm.

5/13/2024

cs.CL