ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition

Read original: arXiv:2402.15220 - Published 8/2/2024 by Lu Ye, Ze Tao, Yong Huang, Yang Li

🗣️

Overview

Large language models (LLMs) use a key component called self-attention, but it can cause significant latency issues for long sequences
In multi-tenant LLM serving scenarios, the cost of self-attention can be optimized by taking advantage of shared system prompts across multiple requests
This paper introduces ChunkAttention, a self-attention module that can detect matching prompt prefixes and share key/value tensors to improve memory utilization

Plain English Explanation

Large language models (LLMs) are powerful AI systems that can generate human-like text. A crucial part of how they work is called self-attention, which allows the model to focus on the most relevant parts of the input when generating new text. However, self-attention can be computationally expensive, especially when dealing with long sequences of text.

In situations where multiple users are accessing an LLM at the same time (a "multi-tenant" scenario), the authors of this paper realized that there's often overlap in the starting part (the "prefix") of the text prompts that users provide to the model. By detecting these shared prefixes, the system can avoid redundant computations and memory usage, ultimately speeding up the self-attention process.

The authors introduce a new technique called ChunkAttention that breaks up the key and value tensors (important internal representations used in self-attention) into smaller "chunks" and organizes them in a special data structure called a prefix tree. This allows the system to quickly identify matching prefixes across multiple requests and reuse the corresponding key and value tensors, rather than recalculating them from scratch.

Technical Explanation

The paper presents ChunkAttention, a novel self-attention module designed to optimize memory usage and inference latency in multi-tenant LLM serving scenarios. The key innovation is the use of a prefix-aware self-attention mechanism that can detect shared prompt prefixes across multiple requests and efficiently share the corresponding key/value tensors.

To achieve this, ChunkAttention breaks the monolithic key/value tensors into smaller chunks and structures them into an auxiliary prefix tree. This prefix tree-based KV cache enables an efficient self-attention kernel, where a two-phase partition algorithm is used to improve data locality during the self-attention computation.

The experiments show that ChunkAttention can speed up the self-attention kernel by 3.2-4.8 times compared to state-of-the-art implementations, with the length of the system prompt ranging from 1024 to 4096. This significant performance improvement is enabled by the prefix-aware design that leverages the shared prompt prefixes commonly found in multi-tenant LLM serving scenarios.

Critical Analysis

The paper presents a well-designed and thoroughly evaluated solution to a practical challenge faced in LLM serving. The key contribution, the ChunkAttention module, is a clever and effective way to optimize self-attention computation by exploiting the shared prompt prefixes across multiple requests.

One potential limitation of the approach is that it may be less effective in scenarios where the prompt prefixes are highly diverse or dynamic, reducing the opportunities for reusing key/value tensors. The authors acknowledge this and suggest that further research could explore adaptive techniques to handle more varied prompt patterns.

Additionally, while the performance improvements are significant, the paper does not provide a detailed analysis of the memory usage tradeoffs or the impact on overall model latency and throughput in a production-like setting. Further investigation into these aspects would help fully assess the practical benefits and trade-offs of ChunkAttention.

Overall, the ChunkAttention approach represents an important step forward in optimizing the computational and memory costs of self-attention in LLMs, with promising implications for improving the efficiency and scalability of large-scale language model serving.

Conclusion

This paper introduces ChunkAttention, an innovative self-attention module that can significantly improve the performance of large language models (LLMs) in multi-tenant serving scenarios. By leveraging the shared prompt prefixes across multiple requests, ChunkAttention is able to optimize the memory usage and computation required for the self-attention process, resulting in up to a 4.8x speedup compared to state-of-the-art techniques.

The key ideas behind ChunkAttention, such as the prefix tree-based KV cache and the two-phase partition algorithm, demonstrate the potential for creative architectural solutions to address the challenges of efficiently serving large language models at scale. As LLMs continue to grow in size and complexity, innovations like ChunkAttention will be crucial for making these powerful AI systems more practical and accessible for a wide range of real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🗣️

ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition

Lu Ye, Ze Tao, Yong Huang, Yang Li

Self-attention is an essential component of large language models (LLM) but a significant source of inference latency for long sequences. In multi-tenant LLM serving scenarios, the compute and memory operation cost of self-attention can be optimized by using the probability that multiple LLM requests have shared system prompts in prefixes. In this paper, we introduce ChunkAttention, a prefix-aware self-attention module that can detect matching prompt prefixes across multiple requests and share their key/value tensors in memory at runtime to improve the memory utilization of KV cache. This is achieved by breaking monolithic key/value tensors into smaller chunks and structuring them into the auxiliary prefix tree. Consequently, on top of the prefix-tree based KV cache, we design an efficient self-attention kernel, where a two-phase partition algorithm is implemented to improve the data locality during self-attention computation in the presence of shared system prompts. Experiments show that ChunkAttention can speed up the self-attention kernel by 3.2-4.8$times$ compared to the state-of-the-art implementation, with the length of the system prompt ranging from 1024 to 4096.

8/2/2024

Eigen Attention: Attention in Low-Rank Space for KV Cache Compression

Utkarsh Saxena, Gobinda Saha, Sakshi Choudhary, Kaushik Roy

Large language models (LLMs) represent a groundbreaking advancement in the domain of natural language processing due to their impressive reasoning abilities. Recently, there has been considerable interest in increasing the context lengths for these models to enhance their applicability to complex tasks. However, at long context lengths and large batch sizes, the key-value (KV) cache, which stores the attention keys and values, emerges as the new bottleneck in memory usage during inference. To address this, we propose Eigen Attention, which performs the attention operation in a low-rank space, thereby reducing the KV cache memory overhead. Our proposed approach is orthogonal to existing KV cache compression techniques and can be used synergistically with them. Through extensive experiments over OPT, MPT, and Llama model families, we demonstrate that Eigen Attention results in up to 40% reduction in KV cache sizes and up to 60% reduction in attention operation latency with minimal drop in performance.

8/13/2024

AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, Pengfei Zuo

Interacting with humans through multi-turn conversations is a fundamental feature of large language models (LLMs). However, existing LLM serving engines executing multi-turn conversations are inefficient due to the need to repeatedly compute the key-value (KV) caches of historical tokens, incurring high serving costs. To address the problem, this paper proposes CachedAttention, a new attention mechanism that enables reuse of KV caches across multi-turn conversations, significantly reducing the repetitive computation overheads. CachedAttention maintains a hierarchical KV caching system that leverages cost-effective memory/storage mediums to save KV caches for all requests. To reduce KV cache access overheads from slow mediums, CachedAttention employs layer-wise pre-loading and asynchronous saving schemes to overlap the KV cache access with the GPU computation. To ensure that the KV caches to be accessed are placed in the fastest hierarchy, CachedAttention employs scheduler-aware fetching and eviction schemes to consciously place the KV caches in different layers based on the hints from the inference job scheduler. To avoid the invalidation of the saved KV caches incurred by context window overflow, CachedAttention enables the saved KV caches to remain valid via decoupling the positional encoding and effectively truncating the KV caches. Extensive experimental results demonstrate that CachedAttention significantly decreases the time to the first token (TTFT) by up to 87%, improves the prompt prefilling throughput by up to 7.8$times$ for multi-turn conversations, and reduces the end-to-end inference cost by up to 70%.

7/2/2024

Beyond KV Caching: Shared Attention for Efficient LLMs

Bingli Liao, Danilo Vasconcellos Vargas

The efficiency of large language models (LLMs) remains a critical challenge, particularly in contexts where computational resources are limited. Traditional attention mechanisms in these models, while powerful, require significant computational and memory resources due to the necessity of recalculating and storing attention weights across different layers. This paper introduces a novel Shared Attention (SA) mechanism, designed to enhance the efficiency of LLMs by directly sharing computed attention weights across multiple layers. Unlike previous methods that focus on sharing intermediate Key-Value (KV) caches, our approach utilizes the isotropic tendencies of attention distributions observed in advanced LLMs post-pretraining to reduce both the computational flops and the size of the KV cache required during inference. We empirically demonstrate that implementing SA across various LLMs results in minimal accuracy loss on standard benchmarks. Our findings suggest that SA not only conserves computational resources but also maintains robust model performance, thereby facilitating the deployment of more efficient LLMs in resource-constrained environments.

7/19/2024