RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

Read original: arXiv:2409.10516 - Published 9/19/2024 by Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang and 4 others

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

Overview

This paper introduces a technique called "RetrievalAttention" that accelerates the inference of large language models (LLMs) on long-context inputs.
The key idea is to use vector retrieval to selectively attend to relevant parts of the input, reducing the computational cost compared to standard attention mechanisms.
The authors demonstrate that RetrievalAttention can provide significant speedups on various long-context tasks without significantly impacting model performance.

Plain English Explanation

The paper presents a new approach called RetrievalAttention that helps large language models (LLMs) process long texts more efficiently. LLMs are powerful AI models that can generate human-like text, but they can struggle when dealing with very long input documents or conversations.

The key idea behind RetrievalAttention is to selectively focus the model's attention on the most relevant parts of the input, rather than considering the entire context. This is done by using vector retrieval - a technique that can quickly find the parts of the input that are most relevant to the current output being generated.

By only paying attention to the most important parts of the input, RetrievalAttention can significantly speed up the model's inference process without a major impact on its performance. This makes it easier to use LLMs for tasks that require processing long documents or conversations, like summarization or question answering.

Technical Explanation

The paper introduces a new attention mechanism called RetrievalAttention that accelerates the inference of large language models (LLMs) on long-context inputs. The core idea is to use vector retrieval to selectively attend to the most relevant parts of the input, rather than considering the entire context at each step.

Specifically, the authors propose a two-stage attention process:

Retrieval Stage: A lightweight retrieval model is used to quickly identify the most relevant parts of the input based on the current output being generated.
Attention Stage: The LLM's attention mechanism then focuses only on the relevant retrieved parts, rather than the full input sequence.

The authors demonstrate that RetrievalAttention can provide significant speedups on various long-context tasks, such as summarization and question answering, without a major impact on model performance. They compare RetrievalAttention to several baseline attention mechanisms and show its advantages in terms of both inference time and memory usage.

Critical Analysis

The RetrievalAttention approach presented in this paper is a promising technique for accelerating LLM inference on long-context inputs. The authors' experiments show that it can provide substantial speedups without significantly impacting model performance.

However, the paper does not explore some potential limitations or areas for further research:

Robustness: The authors do not investigate how RetrievalAttention might perform on more diverse or challenging long-context tasks, or how it would handle noisy or adversarial inputs.
Generalization: The experiments focus on a relatively narrow set of tasks and datasets. It would be valuable to see how RetrievalAttention generalizes to a broader range of long-context applications.
Computational Overhead: While RetrievalAttention reduces the overall inference time, the additional computational cost of the retrieval stage is not explored in detail. This could be an important consideration for real-world deployment.

Overall, the RetrievalAttention technique is a promising step towards making LLMs more efficient and scalable for long-context applications. Further research exploring its robustness, generalization, and practical implementation trade-offs would help strengthen the conclusions and potential impact of this work.

Conclusion

The RetrievalAttention technique presented in this paper offers a novel approach to accelerating the inference of large language models on long-context inputs. By using vector retrieval to selectively attend to the most relevant parts of the input, the authors demonstrate significant speedups without major performance degradation.

This work represents an important step towards making LLMs more practical and scalable for real-world applications that involve processing long documents, conversations, or other extended contexts. As language models continue to grow in size and capability, techniques like RetrievalAttention will be crucial for ensuring their efficient and effective deployment.

The authors have laid the groundwork for further research exploring the robustness, generalization, and practical implementation of this approach. Continued advancements in this direction could unlock new possibilities for LLMs to tackle increasingly complex and long-form language understanding and generation tasks.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, Chen Chen, Fan Yang, Yuqing Yang, Lili Qiu

Transformer-based Large Language Models (LLMs) have become increasingly important. However, due to the quadratic time complexity of attention computation, scaling LLMs to longer contexts incurs extremely slow inference latency and high GPU memory consumption for caching key-value (KV) vectors. This paper proposes RetrievalAttention, a training-free approach to both accelerate attention computation and reduce GPU memory consumption. By leveraging the dynamic sparsity of attention mechanism, RetrievalAttention proposes to use approximate nearest neighbor search (ANNS) indexes for KV vectors in CPU memory and retrieves the most relevant ones with vector search during generation. Unfortunately, we observe that the off-the-shelf ANNS indexes are often ineffective for such retrieval tasks due to the out-of-distribution (OOD) between query vectors and key vectors in attention mechanism. RetrievalAttention addresses the OOD challenge by designing an attention-aware vector search algorithm that can adapt to the distribution of query vectors. Our evaluation shows that RetrievalAttention only needs to access 1--3% of data while maintaining high model accuracy. This leads to significant reduction in the inference cost of long-context LLMs with much lower GPU memory footprint. In particular, RetrievalAttention only needs a single NVIDIA RTX4090 (24GB) for serving 128K tokens in LLMs with 8B parameters, which is capable of generating one token in 0.188 seconds.

9/19/2024

Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

Qianchao Zhu, Jiangfei Duan, Chang Chen, Siran Liu, Xiuhong Li, Guanyu Feng, Xin Lv, Huanqi Cao, Xiao Chuanfu, Xingcheng Zhang, Dahua Lin, Chao Yang

Large language models (LLMs) now support extremely long context windows, but the quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency. Existing approaches to address this complexity require additional pretraining or finetuning, and often sacrifice model accuracy. In this paper, we first provide both theoretical and empirical foundations for near-lossless sparse attention. We find dynamically capturing head-specific sparse patterns at runtime with low overhead is crucial. To address this, we propose SampleAttention, an adaptive structured and near-lossless sparse attention. Leveraging observed significant sparse patterns, SampleAttention attends to a fixed percentage of adjacent tokens to capture local window patterns, and employs a two-stage query-guided key-value filtering approach, which adaptively select a minimum set of key-values with low overhead, to capture column stripe patterns. Comprehensive evaluations show that SampleAttention can seamlessly replace vanilla attention in off-the-shelf LLMs with nearly no accuracy loss, and reduces TTFT by up to $2.42times$ compared with FlashAttention.

7/1/2024

Eigen Attention: Attention in Low-Rank Space for KV Cache Compression

Utkarsh Saxena, Gobinda Saha, Sakshi Choudhary, Kaushik Roy

Large language models (LLMs) represent a groundbreaking advancement in the domain of natural language processing due to their impressive reasoning abilities. Recently, there has been considerable interest in increasing the context lengths for these models to enhance their applicability to complex tasks. However, at long context lengths and large batch sizes, the key-value (KV) cache, which stores the attention keys and values, emerges as the new bottleneck in memory usage during inference. To address this, we propose Eigen Attention, which performs the attention operation in a low-rank space, thereby reducing the KV cache memory overhead. Our proposed approach is orthogonal to existing KV cache compression techniques and can be used synergistically with them. Through extensive experiments over OPT, MPT, and Llama model families, we demonstrate that Eigen Attention results in up to 40% reduction in KV cache sizes and up to 60% reduction in attention operation latency with minimal drop in performance.

8/13/2024

Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers

Rya Sanovar, Srikant Bharadwaj, Renee St. Amant, Victor Ruhle, Saravan Rajmohan

Transformer-based models have emerged as one of the most widely used architectures for natural language processing, natural language generation, and image generation. The size of the state-of-the-art models has increased steadily reaching billions of parameters. These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators, such as GPUs. Specifically, the time and memory complexity of the attention operation is quadratic in terms of the total context length, i.e., prompt and output tokens. Thus, several optimizations such as key-value tensor caching and FlashAttention computation have been proposed to deliver the low latency demands of applications relying on such large models. However, these techniques do not cater to the computationally distinct nature of different phases during inference. To that end, we propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase (decode-phase) of decoder-only transformer models. LeanAttention enables scaling the attention mechanism implementation for the challenging case of long context lengths by re-designing the execution flow for the decode-phase. We identify that the associative property of online softmax can be treated as a reduction operation thus allowing us to parallelize the attention computation over these large context lengths. We extend the stream-K style reduction of tiled calculation to self-attention to enable parallel computation resulting in an average of 2.6x attention execution speedup over FlashAttention-2 and up to 8.33x speedup for 512k context lengths.

5/20/2024