KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head

Read original: arXiv:2410.00161 - Published 10/2/2024 by Isaac Rehg

KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head

Overview

The paper presents a new compression technique called KV-Compress that can efficiently compress key-value (KV) caches used in attention-based models.
KV-Compress enables variable compression rates per attention head, allowing for higher compression in less important areas and lower compression in more important areas.
The method involves paging the KV cache to reduce memory footprint and exploits the heterogeneity of attention heads to achieve better overall compression.

Plain English Explanation

KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head is a new technique for compressing the key-value (KV) caches used in attention-based machine learning models.

Attention-based models, like those used in large language models, maintain a KV cache to store information that is frequently accessed during the model's computations. This cache can take up a significant amount of memory, so finding ways to compress it efficiently is important.

KV-Compress addresses this by allowing the compression rate to vary across different attention heads in the model.** Attention heads** are the individual components that focus on different parts of the input when computing the model's output. Some attention heads are more important than others, so KV-Compress applies higher compression to the less important heads and lower compression to the more important ones.

The technique also pages the KV cache, which means it divides the cache into smaller chunks that can be loaded and unloaded from memory as needed. This further reduces the memory footprint of the cache.

Overall, KV-Compress is a clever way to selectively compress the KV cache in attention-based models, allowing for significant memory savings without compromising the model's performance.

Technical Explanation

KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head introduces a new compression technique for the key-value (KV) caches used in attention-based models.

Attention-based models, such as those used in large language models, maintain a KV cache to store information that is frequently accessed during the model's computations. This cache can consume a significant amount of memory, so compressing it efficiently is important for reducing the model's overall memory footprint.

The key insight behind KV-Compress is that different attention heads in the model have varying levels of importance. Some heads are more critical for the model's performance than others. KV-Compress exploits this heterogeneity by applying higher compression rates to the less important attention heads and lower compression rates to the more important ones.

The technique also introduces paging to the KV cache, which involves dividing the cache into smaller chunks that can be loaded and unloaded from memory as needed. This further reduces the memory footprint of the cache by only keeping the most relevant parts in memory at any given time.

The authors evaluate KV-Compress on various attention-based models, including Transformers and BERT, and demonstrate significant memory savings without compromising the models' performance. For example, they achieve up to 2.6x compression on the KV cache of a Transformer model while maintaining the same model accuracy.

Critical Analysis

The paper presents a well-designed and thorough evaluation of the KV-Compress technique, including comparisons to other KV cache compression methods and analysis of the trade-offs between compression rate and model performance.

One potential limitation is that the technique relies on the heterogeneity of attention heads, which may not be present in all attention-based models. The authors acknowledge this and suggest that their approach could be extended to other types of model components beyond attention heads.

Additionally, the paper does not discuss the computational overhead of the compression and decompression operations, which could be an important factor in real-world deployment scenarios. Evaluating the impact on inference latency would be a valuable addition to the analysis.

Overall, KV-Compress appears to be a promising technique for efficiently compressing the memory-intensive KV caches in attention-based models, and the paper provides a solid foundation for further research and development in this area.

Conclusion

KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head presents a novel compression technique for the key-value (KV) caches used in attention-based machine learning models. The method, called KV-Compress, takes advantage of the heterogeneity of attention heads to apply variable compression rates, with higher compression in less important areas and lower compression in more important areas.

By also introducing paging to the KV cache, KV-Compress is able to significantly reduce the memory footprint of the cache without compromising the model's performance. The authors demonstrate the effectiveness of their approach on various attention-based models, showcasing memory savings of up to 2.6x.

This work represents an important step forward in the efficient deployment of large, attention-based models, which are increasingly crucial for a wide range of AI applications. The insights and techniques presented in this paper could have far-reaching implications for the field of machine learning, paving the way for more memory-efficient and scalable model architectures.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

New!KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head

Isaac Rehg

Context lengths of Large Language Models (LLMs) have exploded in recent years, with 128k-token context becoming a standard and million-token context becoming a reality. Efficiently supporting long-context inference remains challenging as the memory that must be allocated in key-value (KV) cache for a generation scales with its context length, limiting the number of long-context requests that can be served concurrently under a given memory budget. KV cache compression can mitigate this issue by removing under-utilized KVs from each attention head's cache and reducing its memory footprint. Higher theoretical compression rates can be achieved when the number of removed KVs varies across attention heads, but application of such a strategy within existing inference frameworks adds fragmentation and cannot realize the theoretical compression rates in physical memory. We introduce KV-Compress, a novel compression method that evicts contiguous KV blocks within a PagedAttention framework, reducing the memory footprint of the KV cache proportionally to this theoretical compression rate. Our method achieves state-of-the-art performance on LongBench for both Mistral-7B-Instruct-v0.2 and Llama-3.1-8B-Instruct while lowering the total number of compressed KVs by 4x compared with prior methods. Evaluations on Llama-3.1-8B-Instruct and Llama-3.1-70B-Instruct-FP8 achieve compression rates up to 8x with negligible impact on performance, and up to 64x while retaining over 90% of full-cache performance for all but three of the suite's subsets. We benchmark an integration of our method with vLLM that increases total throughput by up to 5.18x by enabling larger decoding batches.

10/2/2024

RazorAttention: Efficient KV Cache Compression Through Retrieval Heads

Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Shikuan Hong, Yiwu Yao, Gongyi Wang

The memory and computational demands of Key-Value (KV) cache present significant challenges for deploying long-context language models. Previous approaches attempt to mitigate this issue by selectively dropping tokens, which irreversibly erases critical information that might be needed for future queries. In this paper, we propose a novel compression technique for KV cache that preserves all token information. Our investigation reveals that: i) Most attention heads primarily focus on the local context; ii) Only a few heads, denoted as retrieval heads, can essentially pay attention to all input tokens. These key observations motivate us to use separate caching strategy for attention heads. Therefore, we propose RazorAttention, a training-free KV cache compression algorithm, which maintains a full cache for these crucial retrieval heads and discards the remote tokens in non-retrieval heads. Furthermore, we introduce a novel mechanism involving a compensation token to further recover the information in the dropped tokens. Extensive evaluations across a diverse set of large language models (LLMs) demonstrate that RazorAttention achieves a reduction in KV cache size by over 70% without noticeable impacts on performance. Additionally, RazorAttention is compatible with FlashAttention, rendering it an efficient and plug-and-play solution that enhances LLM inference efficiency without overhead or retraining of the original model.

7/24/2024

🌐

KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches

Jiayi Yuan (Henry), Hongyi Liu (Henry), Shaochen (Henry), Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, Zirui Liu, Xia Hu

Long context capability is a crucial competency for large language models (LLMs) as it mitigates the human struggle to digest long-form texts. This capability enables complex task-solving scenarios such as book summarization, code assistance, and many more tasks that are traditionally manpower-intensive. However, transformer-based LLMs face significant challenges with long context input due to the growing size of the KV cache and the intrinsic complexity of attending to extended inputs; where multiple schools of efficiency-driven approaches -- such as KV cache quantization, token dropping, prompt compression, linear-time sequence models, and hybrid architectures -- have been proposed to produce efficient yet long context-capable models. Despite these advancements, no existing work has comprehensively benchmarked these methods in a reasonably aligned environment. In this work, we fill this gap by providing a taxonomy of current methods and evaluating 10+ state-of-the-art approaches across seven categories of long context tasks. Our work reveals numerous previously unknown phenomena and offers insights -- as well as a friendly workbench -- for the future development of long context-capable LLMs. The source code will be available at https://github.com/henryzhongsc/longctx_bench

7/2/2024

💬

MiniCache: KV Cache Compression in Depth Dimension for Large Language Models

Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, Bohan Zhuang

A critical approach for efficiently deploying computationally demanding large language models (LLMs) is Key-Value (KV) caching. The KV cache stores key-value states of previously generated tokens, significantly reducing the need for repetitive computations and thereby lowering latency in autoregressive generation. However, the size of the KV cache grows linearly with sequence length, posing challenges for applications requiring long context input and extensive sequence generation. In this paper, we present a simple yet effective approach, called MiniCache, to compress the KV cache across layers from a novel depth perspective, significantly reducing the memory footprint for LLM inference. Our approach is based on the observation that KV cache states exhibit high similarity between the adjacent layers in the middle-to-deep portion of LLMs. To facilitate merging, we propose disentangling the states into the magnitude and direction components, interpolating the directions of the state vectors while preserving their lengths unchanged. Furthermore, we introduce a token retention strategy to keep highly distinct state pairs unmerged, thus preserving the information with minimal additional storage overhead. Our MiniCache is training-free and general, complementing existing KV cache compression strategies, such as quantization and sparsity. We conduct a comprehensive evaluation of MiniCache utilizing various models including LLaMA-2, LLaMA-3, Phi-3, Mistral, and Mixtral across multiple benchmarks, demonstrating its exceptional performance in achieving superior compression ratios and high throughput. On the ShareGPT dataset, LLaMA-2-7B with 4-bit MiniCache achieves a remarkable compression ratio of up to 5.02x, enhances inference throughput by approximately 5x, and reduces the memory footprint by 41% compared to the FP16 full cache baseline, all while maintaining near-lossless performance.

9/10/2024