HiP Attention: Sparse Sub-Quadratic Attention with Hierarchical Attention Pruning

2406.09827

Published 6/17/2024 by Heejun Lee, Geon Park, Youngwan Lee, Jina Kim, Wonyoung Jeong, Myeongjae Jeon, Sung Ju Hwang

HiP Attention: Sparse Sub-Quadratic Attention with Hierarchical Attention Pruning

Abstract

In modern large language models (LLMs), increasing sequence lengths is a crucial challenge for enhancing their comprehension and coherence in handling complex tasks such as multi-modal question answering. However, handling long context sequences with LLMs is prohibitively costly due to the conventional attention mechanism's quadratic time and space complexity, and the context window size is limited by the GPU memory. Although recent works have proposed linear and sparse attention mechanisms to address this issue, their real-world applicability is often limited by the need to re-train pre-trained models. In response, we propose a novel approach, Hierarchically Pruned Attention (HiP), which simultaneously reduces the training and inference time complexity from $O(T^2)$ to $O(T log T)$ and the space complexity from $O(T^2)$ to $O(T)$. To this end, we devise a dynamic sparse attention mechanism that generates an attention mask through a novel tree-search-like algorithm for a given query on the fly. HiP is training-free as it only utilizes the pre-trained attention scores to spot the positions of the top-$k$ most significant elements for each query. Moreover, it ensures that no token is overlooked, unlike the sliding window-based sub-quadratic attention methods, such as StreamingLLM. Extensive experiments on diverse real-world benchmarks demonstrate that HiP significantly reduces prompt (i.e., prefill) and decoding latency and memory usage while maintaining high generation performance with little or no degradation. As HiP allows pretrained LLMs to scale to millions of tokens on commodity GPUs with no additional engineering due to its easy plug-and-play deployment, we believe that our work will have a large practical impact, opening up the possibility to many long-context LLM applications previously infeasible.

Create account to get full access

Overview

This paper introduces HiP Attention, a sparse sub-quadratic attention mechanism that uses hierarchical attention pruning to reduce computational complexity.
HiP Attention aims to address the high computational cost of standard attention mechanisms, which scale quadratically with sequence length.
The proposed approach leverages a hierarchical attention pruning strategy to selectively compute attention scores only for the most relevant token pairs, achieving sub-quadratic complexity.

Plain English Explanation

The core idea behind HiP Attention is to reduce the computational burden of the attention mechanism, which is a key component in many deep learning models. Attention allows the model to focus on the most relevant parts of the input when generating an output, but the standard attention mechanism has a major drawback: its computational cost scales quadratically with the length of the input sequence.

HiP Attention tackles this issue by being "sparse" - it only computes attention scores for a subset of the token pairs, rather than all of them. This is achieved through a hierarchical attention pruning strategy, where the model first identifies the most important high-level connections, and then progressively refines the attention at lower levels of the hierarchy. By selectively computing attention, HiP Attention can achieve sub-quadratic complexity, making it more efficient to use in large-scale models and on long input sequences.

The key benefit of HiP Attention is that it can maintain the performance of standard attention mechanisms while dramatically reducing the computational resources required. This could enable the use of attention-based models in more memory-constrained or real-time applications, or allow for scaling up model size and complexity without incurring prohibitive computational costs.

Technical Explanation

HiP Attention introduces a novel attention mechanism that combines a hierarchical attention pruning strategy with a sparse attention computation. The hierarchical pruning first identifies the most important high-level attention connections, and then progressively refines the attention at lower levels of the hierarchy. This allows the model to selectively compute attention scores only for the most relevant token pairs, achieving sub-quadratic complexity.

The hierarchical pruning is implemented using a top-down attention masking approach, where the model first computes a coarse attention map, and then iteratively refines this map by computing attention at finer granularities. This process is guided by learnable attention pruning thresholds that determine which attention scores should be computed at each level of the hierarchy.

The sparse attention computation is enabled by using a sparse matrix multiplication operation to compute the attention scores. This avoids the need to compute attention for all token pairs, further reducing the computational cost.

The authors evaluate HiP Attention on a variety of tasks, including language modeling, machine translation, and image classification, and demonstrate that it can match the performance of standard attention mechanisms while achieving significant speedups, especially on long input sequences.

Critical Analysis

The key innovation of HiP Attention is the hierarchical attention pruning strategy, which allows the model to selectively compute attention scores and achieve sub-quadratic complexity. This is a promising approach to address the high computational cost of attention, which has been a longstanding challenge in the field.

One potential limitation of the approach is that the hierarchical pruning may not be able to capture all the nuanced relationships between tokens, as it relies on a top-down pruning strategy. There may be cases where important lower-level connections are missed by the coarse attention maps. The authors acknowledge this and suggest that further research is needed to explore more sophisticated pruning strategies.

Additionally, the performance of HiP Attention may be sensitive to the choice of pruning thresholds, which are learned by the model. It's not clear how well the approach will generalize to different tasks and datasets, and how much tuning of these hyperparameters may be required.

Overall, HiP Attention is a compelling approach that demonstrates the potential for sparse and hierarchical attention mechanisms to improve the efficiency of attention-based models. The authors have made a valuable contribution to the ongoing research on improving the scalability and efficiency of attention-based deep learning models.

Conclusion

The HiP Attention mechanism introduced in this paper represents a significant step forward in addressing the high computational cost of attention, a critical component in many state-of-the-art deep learning models. By leveraging a hierarchical attention pruning strategy and sparse attention computation, HiP Attention can achieve sub-quadratic complexity, making it more efficient to use in large-scale models and on long input sequences.

The authors have demonstrated the effectiveness of HiP Attention across a range of tasks, suggesting that it could have broad applicability in the field of deep learning. If further research can address the potential limitations of the hierarchical pruning strategy, HiP Attention could enable the development of more efficient and scalable attention-based models, with implications for a wide variety of applications, from natural language processing to computer vision.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

Qianchao Zhu, Jiangfei Duan, Chang Chen, Siran Liu, Xiuhong Li, Guanyu Feng, Xin Lv, Huanqi Cao, Xiao Chuanfu, Xingcheng Zhang, Dahua Lin, Chao Yang

Large language models (LLMs) now support extremely long context windows, but the quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency. Existing approaches to address this complexity require additional pretraining or finetuning, and often sacrifice model accuracy. In this paper, we first provide both theoretical and empirical foundations for near-lossless sparse attention. We find dynamically capturing head-specific sparse patterns at runtime with low overhead is crucial. To address this, we propose SampleAttention, an adaptive structured and near-lossless sparse attention. Leveraging observed significant sparse patterns, SampleAttention attends to a fixed percentage of adjacent tokens to capture local window patterns, and employs a two-stage query-guided key-value filtering approach, which adaptively select a minimum set of key-values with low overhead, to capture column stripe patterns. Comprehensive evaluations show that SampleAttention can seamlessly replace vanilla attention in off-the-shelf LLMs with nearly no accuracy loss, and reduces TTFT by up to $2.42times$ compared with FlashAttention.

7/1/2024

cs.CL cs.AI cs.LG

Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers

Chao Lou, Zixia Jia, Zilong Zheng, Kewei Tu

Accommodating long sequences efficiently in autoregressive Transformers, especially within an extended context window, poses significant challenges due to the quadratic computational complexity and substantial KV memory requirements inherent in self-attention mechanisms. In this work, we introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome these computational and memory obstacles while maintaining performance. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query, thereby enabling gradient-based optimization. As a result, SPARSEK Attention offers linear time complexity and constant memory footprint during generation. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods and provides significant speed improvements during both training and inference, particularly in language modeling and downstream tasks. Furthermore, our method can be seamlessly integrated into pre-trained Large Language Models (LLMs) with minimal fine-tuning, offering a practical solution for effectively managing long-range dependencies in diverse applications.

6/26/2024

cs.CL cs.LG

Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers

Rya Sanovar, Srikant Bharadwaj, Renee St. Amant, Victor Ruhle, Saravan Rajmohan

Transformer-based models have emerged as one of the most widely used architectures for natural language processing, natural language generation, and image generation. The size of the state-of-the-art models has increased steadily reaching billions of parameters. These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators, such as GPUs. Specifically, the time and memory complexity of the attention operation is quadratic in terms of the total context length, i.e., prompt and output tokens. Thus, several optimizations such as key-value tensor caching and FlashAttention computation have been proposed to deliver the low latency demands of applications relying on such large models. However, these techniques do not cater to the computationally distinct nature of different phases during inference. To that end, we propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase (decode-phase) of decoder-only transformer models. LeanAttention enables scaling the attention mechanism implementation for the challenging case of long context lengths by re-designing the execution flow for the decode-phase. We identify that the associative property of online softmax can be treated as a reduction operation thus allowing us to parallelize the attention computation over these large context lengths. We extend the stream-K style reduction of tiled calculation to self-attention to enable parallel computation resulting in an average of 2.6x attention execution speedup over FlashAttention-2 and up to 8.33x speedup for 512k context lengths.

5/20/2024

cs.AR cs.LG

🤿

Data-Informed Global Sparseness in Attention Mechanisms for Deep Neural Networks

Ileana Rugina, Rumen Dangovski, Li Jing, Preslav Nakov, Marin Soljav{c}i'c

Attention mechanisms play a crucial role in the neural revolution of Natural Language Processing (NLP). With the growth of attention-based models, several pruning techniques have been developed to identify and exploit sparseness, making these models more efficient. Most efforts focus on hard-coding attention patterns or pruning attention weights based on training data. We propose Attention Pruning (AP), a framework that observes attention patterns in a fixed dataset and generates a global sparseness mask. AP saves 90% of attention computation for language modeling and about 50% for machine translation and GLUE tasks, maintaining result quality. Our method reveals important distinctions between self- and cross-attention patterns, guiding future NLP research. Our framework can reduce both latency and memory requirements for any attention-based model, aiding in the development of improved models for existing or new NLP applications. We have demonstrated this with encoder and autoregressive transformer models using Triton GPU kernels and make our code publicly available at https://github.com/irugina/AP.

5/20/2024

cs.CL