Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

Read original: arXiv:2403.09636 - Published 7/24/2024 by Piotr Nawrot, Adrian {L}a'ncucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti

109

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

Overview

The paper presents a technique called "Dynamic Memory Compression" (DMC) that can accelerate the inference of large language models (LLMs) by compressing their memory usage.
DMC works by dynamically compressing the key-value memory used in the multi-head self-attention mechanism of LLMs, reducing the memory footprint without significant loss in model accuracy.
The paper demonstrates that DMC can achieve up to 3.8x speedup in inference latency and 2.6x reduction in memory usage on popular LLMs like GPT-2 and BERT.

Plain English Explanation

Dynamic Memory Compression (DMC) is a technique that can help make large language models (LLMs) run faster and use less memory during inference. LLMs, like GPT-2 and BERT, are powerful AI models that can generate human-like text, answer questions, and perform other language-related tasks.

The key insight behind DMC is that LLMs use a lot of memory to store the "key-value" pairs used in their self-attention mechanism, which is a crucial component that allows the models to understand the context and relationships in the input text. DMC can dynamically compress this memory usage without significantly affecting the model's accuracy.

By compressing the key-value memory, DMC can speed up the inference (or running) of LLMs by up to 3.8 times and reduce their memory usage by up to 2.6 times. This means that LLMs can run faster and use less computational resources, which is important for real-world applications where fast and efficient inference is crucial, such as in chatbots, language translation, and content generation.

Technical Explanation

The core of LLMs is the multi-head self-attention mechanism, which allows the model to understand the relationships and context in the input text. This mechanism generates "key-value" pairs that represent the relevant information in the input, and these key-value pairs take up a significant amount of memory in the model.

DMC works by dynamically compressing these key-value pairs during inference, reducing the memory footprint without significantly impacting the model's accuracy. The authors propose two key techniques to achieve this:

Selective Compression: DMC selectively compresses the key-value pairs based on their importance, determined by the attention scores. This ensures that the most relevant information is preserved while less important data is compressed.
Adaptive Compression Ratio: DMC adaptively adjusts the compression ratio for different key-value pairs, depending on the attention scores. This allows for more aggressive compression of less important pairs, further reducing the memory usage.

The paper presents experiments on popular LLMs like GPT-2 and BERT, demonstrating that DMC can achieve up to 3.8x speedup in inference latency and 2.6x reduction in memory usage without significant accuracy degradation.

Critical Analysis

The paper provides a thorough technical explanation of the DMC technique and its effectiveness in accelerating LLM inference. However, the authors do not fully address the potential limitations or caveats of their approach.

For example, the paper does not discuss the impact of DMC on the model's ability to capture long-range dependencies or its performance on more complex language tasks, such as multi-turn dialogues or open-ended generation. Additionally, the authors do not explore how DMC might interact with other model optimization techniques, such as model pruning or weight quantization.

Furthermore, the paper focuses on the inference stage of LLMs, but does not consider the potential impact of DMC on the training process. It would be valuable to understand how the dynamic compression of key-value pairs might affect the model's learning and generalization capabilities.

Conclusion

The Dynamic Memory Compression (DMC) technique presented in this paper offers a promising approach to accelerating the inference of large language models (LLMs) while significantly reducing their memory usage. By selectively and adaptively compressing the key-value pairs used in the multi-head self-attention mechanism, DMC can achieve up to 3.8x speedup and 2.6x memory reduction without substantial accuracy loss.

This innovation has the potential to make LLMs more accessible and practical for a wider range of real-world applications, where fast and efficient inference is crucial. As the field of natural language processing continues to advance, techniques like DMC will play an important role in making these powerful AI models more deployable and scalable.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

109

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

Piotr Nawrot, Adrian {L}a'ncucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti

Transformers have emerged as the backbone of large language models (LLMs). However, generation remains inefficient due to the need to store in memory a cache of key-value representations for past tokens, whose size scales linearly with the input sequence length and batch size. As a solution, we propose Dynamic Memory Compression (DMC), a method for online key-value cache compression at inference time. Most importantly, the model learns to apply different compression ratios in different heads and layers. We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to 7x throughput increase during auto-regressive inference on an NVIDIA H100 GPU. DMC is applied via continued pre-training on a negligible percentage of the original data without adding any extra parameters. DMC preserves the original downstream performance with up to 4x cache compression, outperforming up-trained grouped-query attention (GQA) and key-value eviction policies (H$_2$O, TOVA). GQA and DMC can be even combined to obtain compounded gains. Hence, DMC can serve as a drop-in replacement for KV caching in existing LLMs to fit longer contexts and larger batches within any given memory budget.

7/24/2024

Effectively Compress KV Heads for LLM

Hao Yu, Zelan Yang, Shen Li, Yong Li, Jianxin Wu

The advent of pre-trained large language models (LLMs) has revolutionized various natural language processing tasks. These models predominantly employ an auto-regressive decoding mechanism that utilizes Key-Value (KV) caches to eliminate redundant calculations for previous tokens. Nevertheless, as context lengths and batch sizes increase, the linear expansion in memory footprint of KV caches becomes a key bottleneck of LLM deployment, which decreases generation speeds significantly. To mitigate this issue, previous techniques like multi-query attention (MQA) and grouped-query attention (GQA) have been developed, in order to reduce KV heads to accelerate inference with comparable accuracy to multi-head attention (MHA). Despite their effectiveness, existing strategies for compressing MHA often overlook the intrinsic properties of the KV caches. In this work, we explore the low-rank characteristics of the KV caches and propose a novel approach for compressing KV heads. In particular, we carefully optimize the MHA-to-GQA transformation to minimize compression error, and to remain compatible with rotary position embeddings (RoPE), we also introduce specialized strategies for key caches with RoPE. We demonstrate that our method can compress half or even three-quarters of KV heads while maintaining performance comparable to the original LLMs, which presents a promising direction for more efficient LLM deployment in resource-constrained environments.

6/12/2024

💬

MiniCache: KV Cache Compression in Depth Dimension for Large Language Models

Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, Bohan Zhuang

A critical approach for efficiently deploying computationally demanding large language models (LLMs) is Key-Value (KV) caching. The KV cache stores key-value states of previously generated tokens, significantly reducing the need for repetitive computations and thereby lowering latency in autoregressive generation. However, the size of the KV cache grows linearly with sequence length, posing challenges for applications requiring long context input and extensive sequence generation. In this paper, we present a simple yet effective approach, called MiniCache, to compress the KV cache across layers from a novel depth perspective, significantly reducing the memory footprint for LLM inference. Our approach is based on the observation that KV cache states exhibit high similarity between the adjacent layers in the middle-to-deep portion of LLMs. To facilitate merging, we propose disentangling the states into the magnitude and direction components, interpolating the directions of the state vectors while preserving their lengths unchanged. Furthermore, we introduce a token retention strategy to keep highly distinct state pairs unmerged, thus preserving the information with minimal additional storage overhead. Our MiniCache is training-free and general, complementing existing KV cache compression strategies, such as quantization and sparsity. We conduct a comprehensive evaluation of MiniCache utilizing various models including LLaMA-2, LLaMA-3, Phi-3, Mistral, and Mixtral across multiple benchmarks, demonstrating its exceptional performance in achieving superior compression ratios and high throughput. On the ShareGPT dataset, LLaMA-2-7B with 4-bit MiniCache achieves a remarkable compression ratio of up to 5.02x, enhances inference throughput by approximately 5x, and reduces the memory footprint by 41% compared to the FP16 full cache baseline, all while maintaining near-lossless performance.

9/10/2024

📈

Contemporary Model Compression on Large Language Models Inference

Dong Liu

Large Language Models (LLMs) have revolutionized natural language processing by achieving state-of-the-art results across a variety of tasks. However, the computational demands of LLM inference, including high memory consumption and slow processing speeds, pose significant challenges for real-world applications, particularly on resource-constrained devices. Efficient inference is crucial for scaling the deployment of LLMs to a broader range of platforms, including mobile and edge devices. This survey explores contemporary techniques in model compression that address these challenges by reducing the size and computational requirements of LLMs while maintaining their performance. We focus on model-level compression methods, including quantization, knowledge distillation, and pruning, as well as system-level optimizations like KV cache efficient design. Each of these methodologies offers a unique approach to optimizing LLMs, from reducing numerical precision to transferring knowledge between models and structurally simplifying neural networks. Additionally, we discuss emerging trends in system-level design that further enhance the efficiency of LLM inference. This survey aims to provide a comprehensive overview of current advancements in model compression and their potential to make LLMs more accessible and practical for diverse applications.

9/4/2024