SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning

Read original: arXiv:2012.09852 - Published 7/22/2024 by Hanrui Wang, Zhekai Zhang, Song Han

👀

Overview

The attention mechanism has become popular in natural language processing (NLP) applications, outperforming convolutional and recurrent architectures.
However, attention is computationally expensive due to its quadratic complexity, complicated data movement, and low arithmetic intensity.
Existing neural network (NN) accelerators are optimized for convolutional or recurrent models and cannot efficiently support attention.

Plain English Explanation

The attention mechanism is a technique used in natural language processing (NLP) that has shown superior performance compared to other methods like convolutional and recurrent architectures. This means it can better understand and process human language.

However, the attention mechanism has a major downside - it is computationally very intensive. This is because the way it works requires a lot of complex calculations that grow exponentially as the input text gets longer. It also requires a lot of data to be moved around, which is slow, and the actual computations aren't very efficient.

The problem is that current hardware designed to run neural networks, called NN accelerators, are optimized for other types of neural network models, not the attention mechanism. So they can't run attention-based models very well.

Technical Explanation

To address these issues, the paper presents SpAtten, an algorithm-architecture co-design that leverages several techniques to make attention computations more efficient:

Token Sparsity: Inspired by the redundancy in human language, SpAtten employs a novel "cascade token pruning" method to quickly identify and remove unimportant tokens (words) from the input sentence. This reduces the overall computation required.
Head Sparsity: SpAtten also uses "cascade head pruning" to remove unessential attention "heads" (sub-components) that don't contribute much to the final output. This further reduces computation.
Quantization: SpAtten uses "progressive quantization" to first compute the attention outputs using only the most significant bits of the data. If the confidence in the result is low, it then fetches the least significant bits and recomputes, trading off some computation for reduced memory access.

To efficiently implement these techniques in hardware, SpAtten includes a novel "top-k engine" that can quickly rank the importance of tokens and attention heads.

Extensive experiments show that SpAtten provides significant benefits over existing solutions, including:

10x reduction in DRAM (memory) access with no accuracy loss
1.6x to 162x speedup compared to other attention accelerators and GPUs
1.4x to 4059x energy savings

Critical Analysis

The techniques proposed in SpAtten, such as token and head pruning, seem promising for improving the efficiency of attention mechanisms. The paper provides a thorough evaluation on a wide range of benchmarks, demonstrating substantial performance and energy gains.

However, the authors do not discuss potential limitations or caveats of their approach. For example, the effectiveness of the pruning methods may depend on the specific NLP task and dataset, and there could be cases where important information is lost. Additionally, the overhead of the top-k engine and progressive quantization is not deeply analyzed.

Further research could explore the robustness of SpAtten to different input distributions, as well as investigate ways to adaptively adjust the pruning and quantization levels based on the input characteristics. Comparisons to other recent attention optimization techniques, such as sparse attention, would also provide a more comprehensive understanding of the state-of-the-art.

Conclusion

The SpAtten paper presents an innovative approach to improving the efficiency of attention mechanisms in NLP applications. By leveraging token and head sparsity, as well as progressive quantization, the authors have demonstrated significant performance and energy gains over existing solutions.

While the techniques show promise, further research is needed to fully understand the limitations and explore ways to make the approach even more robust and adaptive. Nevertheless, the work represents an important step forward in addressing the computational challenges of attention-based models, which are becoming increasingly prevalent in modern language understanding systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👀

SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning

Hanrui Wang, Zhekai Zhang, Song Han

The attention mechanism is becoming increasingly popular in Natural Language Processing (NLP) applications, showing superior performance than convolutional and recurrent architectures. However, attention becomes the compution bottleneck because of its quadratic computational complexity to input length, complicated data movement and low arithmetic intensity. Moreover, existing NN accelerators mainly focus on optimizing convolutional or recurrent models, and cannot efficiently support attention. In this paper, we present SpAtten, an efficient algorithm-architecture co-design that leverages token sparsity, head sparsity, and quantization opportunities to reduce the attention computation and memory access. Inspired by the high redundancy of human languages, we propose the novel cascade token pruning to prune away unimportant tokens in the sentence. We also propose cascade head pruning to remove unessential heads. Cascade pruning is fundamentally different from weight pruning since there is no trainable weight in the attention mechanism, and the pruned tokens and heads are selected on the fly. To efficiently support them on hardware, we design a novel top-k engine to rank token and head importance scores with high throughput. Furthermore, we propose progressive quantization that first fetches MSBs only and performs the computation; if the confidence is low, it fetches LSBs and recomputes the attention outputs, trading computation for memory reduction. Extensive experiments on 30 benchmarks show that, on average, SpAtten reduces DRAM access by 10.0x with no accuracy loss, and achieves 1.6x, 3.0x, 162x, 347x speedup, and 1,4x, 3.2x, 1193x, 4059x energy savings over A3 accelerator, MNNFast accelerator, TITAN Xp GPU, Xeon CPU, respectively.

7/22/2024

HiP Attention: Sparse Sub-Quadratic Attention with Hierarchical Attention Pruning

Heejun Lee, Geon Park, Youngwan Lee, Jina Kim, Wonyoung Jeong, Myeongjae Jeon, Sung Ju Hwang

In modern large language models (LLMs), increasing sequence lengths is a crucial challenge for enhancing their comprehension and coherence in handling complex tasks such as multi-modal question answering. However, handling long context sequences with LLMs is prohibitively costly due to the conventional attention mechanism's quadratic time and space complexity, and the context window size is limited by the GPU memory. Although recent works have proposed linear and sparse attention mechanisms to address this issue, their real-world applicability is often limited by the need to re-train pre-trained models. In response, we propose a novel approach, Hierarchically Pruned Attention (HiP), which simultaneously reduces the training and inference time complexity from $O(T^2)$ to $O(T log T)$ and the space complexity from $O(T^2)$ to $O(T)$. To this end, we devise a dynamic sparse attention mechanism that generates an attention mask through a novel tree-search-like algorithm for a given query on the fly. HiP is training-free as it only utilizes the pre-trained attention scores to spot the positions of the top-$k$ most significant elements for each query. Moreover, it ensures that no token is overlooked, unlike the sliding window-based sub-quadratic attention methods, such as StreamingLLM. Extensive experiments on diverse real-world benchmarks demonstrate that HiP significantly reduces prompt (i.e., prefill) and decoding latency and memory usage while maintaining high generation performance with little or no degradation. As HiP allows pretrained LLMs to scale to millions of tokens on commodity GPUs with no additional engineering due to its easy plug-and-play deployment, we believe that our work will have a large practical impact, opening up the possibility to many long-context LLM applications previously infeasible.

6/17/2024

Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers

Chao Lou, Zixia Jia, Zilong Zheng, Kewei Tu

Accommodating long sequences efficiently in autoregressive Transformers, especially within an extended context window, poses significant challenges due to the quadratic computational complexity and substantial KV memory requirements inherent in self-attention mechanisms. In this work, we introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome these computational and memory obstacles while maintaining performance. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query, thereby enabling gradient-based optimization. As a result, SPARSEK Attention offers linear time complexity and constant memory footprint during generation. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods and provides significant speed improvements during both training and inference, particularly in language modeling and downstream tasks. Furthermore, our method can be seamlessly integrated into pre-trained Large Language Models (LLMs) with minimal fine-tuning, offering a practical solution for effectively managing long-range dependencies in diverse applications.

6/26/2024

🤿

Data-Informed Global Sparseness in Attention Mechanisms for Deep Neural Networks

Ileana Rugina, Rumen Dangovski, Li Jing, Preslav Nakov, Marin Soljav{c}i'c

Attention mechanisms play a crucial role in the neural revolution of Natural Language Processing (NLP). With the growth of attention-based models, several pruning techniques have been developed to identify and exploit sparseness, making these models more efficient. Most efforts focus on hard-coding attention patterns or pruning attention weights based on training data. We propose Attention Pruning (AP), a framework that observes attention patterns in a fixed dataset and generates a global sparseness mask. AP saves 90% of attention computation for language modeling and about 50% for machine translation and GLUE tasks, maintaining result quality. Our method reveals important distinctions between self- and cross-attention patterns, guiding future NLP research. Our framework can reduce both latency and memory requirements for any attention-based model, aiding in the development of improved models for existing or new NLP applications. We have demonstrated this with encoder and autoregressive transformer models using Triton GPU kernels and make our code publicly available at https://github.com/irugina/AP.

5/20/2024