Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models

2404.09529

Published 4/16/2024 by Siyan Zhao, Daniel Israel, Guy Van den Broeck, Aditya Grover

Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models

Abstract

During inference for transformer-based large language models (LLM), prefilling is the computation of the key-value (KV) cache for input tokens in the prompt prior to autoregressive generation. For longer input prompt lengths, prefilling will incur a significant overhead on decoding time. In this work, we highlight the following pitfall of prefilling: for batches containing high-varying prompt lengths, significant computation is wasted by the standard practice of padding sequences to the maximum length. As LLMs increasingly support longer context lengths, potentially up to 10 million tokens, variations in prompt lengths within a batch become more pronounced. To address this, we propose Prepacking, a simple yet effective method to optimize prefilling computation. To avoid redundant computation on pad tokens, prepacking combines prompts of varying lengths into a sequence and packs multiple sequences into a compact batch using a bin-packing algorithm. It then modifies the attention mask and positional encoding to compute multiple prefilled KV-caches for multiple prompts within a single sequence. On standard curated dataset containing prompts with varying lengths, we obtain a significant speed and memory efficiency improvements as compared to the default padding-based prefilling computation within Huggingface across a range of base model configurations and inference serving scenarios.

Get summaries of the top AI research delivered straight to your inbox:

Overview

This paper presents a simple technique called "prepacking" to improve the throughput and efficiency of large language models (LLMs).
Prepacking involves pre-filling the input buffer of the LLM with relevant context, reducing the time required to process each new query.
The authors demonstrate that prepacking can significantly increase the throughput of LLMs without compromising their performance on downstream tasks.

Plain English Explanation

Large language models (LLMs) like GPT-3 have become powerful tools for a wide range of natural language processing tasks. However, these models can be computationally intensive and slow to process each new input.

The key idea behind "prepacking" is to pre-fill the input buffer of the LLM with relevant context before the model receives a new query. This allows the model to start processing the new input immediately, without having to spend time building up the necessary context from scratch.

For example, if you were using an LLM to answer questions about a specific topic, you could pre-fill the input buffer with background information about that topic. Then, when a new question comes in, the model can quickly generate a response by combining the pre-filled context with the new information.

By reducing the time required to process each input, prepacking can significantly boost the throughput of LLMs, allowing them to handle more queries in the same amount of time. The authors show that this technique can improve throughput by up to 50% without negatively impacting the model's performance on downstream tasks.

Technical Explanation

The paper begins by outlining the Transformer architecture that underpins most modern LLMs. The authors then introduce the prepacking technique, which involves pre-filling the input buffer of the Transformer with relevant context before processing a new query.

To evaluate the effectiveness of prepacking, the researchers conduct experiments on several benchmark tasks, including language modeling, question answering, and text generation. They compare the performance and throughput of LLMs with and without prepacking, and find that prepacking can significantly improve throughput without compromising the model's accuracy on these tasks.

The authors attribute the performance gains to the reduced time required to process each input, as the model no longer needs to spend time building up the necessary context from scratch. They also explore the impact of different prepacking strategies, such as using fixed vs. dynamic context, and find that the optimal approach depends on the specific task and application.

Critical Analysis

The prepacking technique presented in this paper is a relatively simple yet effective way to improve the efficiency of LLMs. The authors provide a clear and well-designed experimental setup to evaluate the benefits of prepacking, and the results are compelling.

However, the paper does not address some potential limitations or areas for further research. For example, it's unclear how prepacking would scale to more complex or open-ended tasks, where the relevant context may be harder to anticipate and pre-fill. Additionally, the authors do not explore the trade-offs between the benefits of prepacking and the additional memory or computational resources required to implement it.

Despite these limitations, the prepacking technique represents a valuable contribution to the ongoing efforts to make LLMs more efficient and practical for real-world applications. The authors' clear and accessible presentation of the technique, along with the promising experimental results, make this paper a useful resource for researchers and practitioners working in this space.

Conclusion

The "prepacking" technique presented in this paper offers a simple and effective way to improve the throughput and efficiency of large language models. By pre-filling the input buffer with relevant context, the model can start processing new queries immediately, without the overhead of building up the necessary context from scratch.

The authors demonstrate that prepacking can increase throughput by up to 50% without compromising the model's performance on a range of benchmark tasks. While the technique has some limitations, it represents a valuable contribution to the ongoing efforts to make LLMs more practical and deployable in real-world applications.

Overall, this paper provides a clear and accessible introduction to the prepacking technique, making it a useful resource for researchers and practitioners working to improve the efficiency and performance of large language models.

Related Papers

Fewer Truncations Improve Language Modeling

Hantian Ding, Zijian Wang, Giovanni Paolini, Varun Kumar, Anoop Deoras, Dan Roth, Stefano Soatto

In large language model training, input documents are typically concatenated together and then split into sequences of equal length to avoid padding tokens. Despite its efficiency, the concatenation approach compromises data integrity -- it inevitably breaks many documents into incomplete pieces, leading to excessive truncations that hinder the model from learning to compose logically coherent and factually consistent content that is grounded on the complete context. To address the issue, we propose Best-fit Packing, a scalable and efficient method that packs documents into training sequences through length-aware combinatorial optimization. Our method completely eliminates unnecessary truncations while retaining the same training efficiency as concatenation. Empirical results from both text and code pre-training show that our method achieves superior performance (e.g., relatively +4.7% on reading comprehension; +16.8% in context following; and +9.2% on program synthesis), and reduces closed-domain hallucination effectively by up to 58.3%.

5/3/2024

cs.CL cs.AI cs.LG

🤯

Prompt Cache: Modular Attention Reuse for Low-Latency Inference

In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, Lin Zhong

We present Prompt Cache, an approach for accelerating inference for large language models (LLM) by reusing attention states across different LLM prompts. Many input prompts have overlapping text segments, such as system messages, prompt templates, and documents provided for context. Our key insight is that by precomputing and storing the attention states of these frequently occurring text segments on the inference server, we can efficiently reuse them when these segments appear in user prompts. Prompt Cache employs a schema to explicitly define such reusable text segments, called prompt modules. The schema ensures positional accuracy during attention state reuse and provides users with an interface to access cached states in their prompt. Using a prototype implementation, we evaluate Prompt Cache across several LLMs. We show that Prompt Cache significantly reduce latency in time-to-first-token, especially for longer prompts such as document-based question answering and recommendations. The improvements range from 8x for GPU-based inference to 60x for CPU-based inference, all while maintaining output accuracy and without the need for model parameter modifications.

4/26/2024

cs.CL cs.AI

🛸

SnapKV: LLM Knows What You are Looking for Before Generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen

Large Language Models (LLMs) have made remarkable progress in processing extensive contexts, with the Key-Value (KV) cache playing a vital role in enhancing their performance. However, the growth of the KV cache in response to increasing input length poses challenges to memory and time efficiency. To address this problem, this paper introduces SnapKV, an innovative and fine-tuning-free approach that efficiently minimizes KV cache size while still delivering comparable performance in real-world applications. We discover that each attention head in the model consistently focuses on specific prompt attention features during generation. Meanwhile, this robust pattern can be obtained from an `observation' window located at the end of the prompts. Drawing on this insight, SnapKV automatically compresses KV caches by selecting clustered important KV positions for each attention head. Our approach significantly reduces the growing computational overhead and memory footprint when processing long input sequences. Specifically, SnapKV achieves a consistent decoding speed with a 3.6x increase in generation speed and an 8.2x enhancement in memory efficiency compared to baseline when processing inputs of 16K tokens. At the same time, it maintains comparable performance to baseline models across 16 long sequence datasets. Moreover, SnapKV can process up to 380K context tokens on a single A100-80GB GPU using HuggingFace implementation with minor changes, exhibiting only a negligible accuracy drop in the Needle-in-a-Haystack test. Further comprehensive studies suggest SnapKV's potential for practical applications.

4/24/2024

cs.CL cs.AI

Think before you speak: Training Language Models With Pause Tokens

Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, Vaishnavh Nagarajan

Language models generate responses by producing a series of tokens in immediate succession: the $(K+1)^{th}$ token is an outcome of manipulating $K$ hidden vectors per layer, one vector per preceding token. What if instead we were to let the model manipulate say, $K+10$ hidden vectors, before it outputs the $(K+1)^{th}$ token? We operationalize this idea by performing training and inference on language models with a (learnable) $textit{pause}$ token, a sequence of which is appended to the input prefix. We then delay extracting the model's outputs until the last pause token is seen, thereby allowing the model to process extra computation before committing to an answer. We empirically evaluate $textit{pause-training}$ on decoder-only models of 1B and 130M parameters with causal pretraining on C4, and on downstream tasks covering reasoning, question-answering, general understanding and fact recall. Our main finding is that inference-time delays show gains when the model is both pre-trained and finetuned with delays. For the 1B model, we witness gains on 8 of 9 tasks, most prominently, a gain of $18%$ EM score on the QA task of SQuAD, $8%$ on CommonSenseQA and $1%$ accuracy on the reasoning task of GSM8k. Our work raises a range of conceptual and practical future research questions on making delayed next-token prediction a widely applicable new paradigm.

4/23/2024

cs.CL cs.AI cs.LG