InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference

Read original: arXiv:2409.04992 - Published 9/10/2024 by Xiurui Pan, Endian Li, Qiao Li, Shengwen Liang, Yizhou Shan, Ke Zhou, Yingwei Luo, Xiaolin Wang, Jie Zhang

InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference

Overview

In-storage attention offloading for cost-effective long-context LLM inference
Proposes a novel technique called InstInfer to improve the efficiency of large language model (LLM) inference
Demonstrates significant performance and cost benefits compared to existing approaches

Plain English Explanation

InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference presents a novel technique to improve the efficiency of running large language models (LLMs) on long input contexts. LLMs are powerful AI models that can generate human-like text, but they require a lot of computing power to run, which can be expensive.

The key idea behind InstInfer is to offload part of the computation required for the attention mechanism (a crucial component of LLMs) to the storage system itself, rather than performing it on the main CPU. This allows the CPU to focus on other tasks, improving overall efficiency and reducing costs. The authors demonstrate that InstInfer can significantly improve performance and reduce costs compared to existing approaches for running LLMs on long input contexts.

Technical Explanation

The paper first explains the challenges of LLM inference, particularly the computational complexity of the attention mechanism, which is a crucial component of LLMs. The authors then introduce the InstInfer approach, which offloads part of the attention computation to the storage system itself, rather than performing it on the main CPU. This allows the CPU to focus on other tasks, improving overall efficiency and reducing costs.

The authors evaluate InstInfer using both simulation and real-world experiments, comparing it to existing approaches for running LLMs on long input contexts. They demonstrate that InstInfer can achieve significant performance and cost benefits, reducing the inference time by up to 67% and the overall cost by up to 61% compared to baseline methods.

Critical Analysis

The paper presents a promising approach to improving the efficiency of LLM inference, particularly for long input contexts. The authors have carefully designed and evaluated their InstInfer technique, providing empirical evidence of its benefits.

One potential limitation of the research is that it focuses primarily on the attention mechanism and does not address other aspects of LLM inference that may also impact performance and cost, such as the feedforward and pooling layers. Additionally, the paper does not explore the tradeoffs between different hardware configurations or the impact of varying input sizes and model complexities.

Further research could investigate the performance of InstInfer on a wider range of LLM architectures and input scenarios, as well as explore the possibility of extending the in-storage offloading concept to other components of the LLM inference process. Consideration of practical deployment challenges, such as integration with existing storage systems and potential security implications, could also be beneficial.

Conclusion

InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference presents a novel technique called InstInfer that aims to improve the efficiency of running large language models (LLMs) on long input contexts. By offloading part of the attention computation to the storage system, InstInfer can significantly reduce the inference time and overall cost compared to existing approaches.

The research demonstrates the potential benefits of rethinking the traditional LLM inference pipeline and leveraging the capabilities of storage systems to optimize performance and cost. As the demand for LLM-powered applications continues to grow, techniques like InstInfer may play an important role in making these models more accessible and cost-effective for a wide range of use cases.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference

Xiurui Pan, Endian Li, Qiao Li, Shengwen Liang, Yizhou Shan, Ke Zhou, Yingwei Luo, Xiaolin Wang, Jie Zhang

The widespread of Large Language Models (LLMs) marks a significant milestone in generative AI. Nevertheless, the increasing context length and batch size in offline LLM inference escalate the memory requirement of the key-value (KV) cache, which imposes a huge burden on the GPU VRAM, especially for resource-constraint scenarios (e.g., edge computing and personal devices). Several cost-effective solutions leverage host memory or SSDs to reduce storage costs for offline inference scenarios and improve the throughput. Nevertheless, they suffer from significant performance penalties imposed by intensive KV cache accesses due to limited PCIe bandwidth. To address these issues, we propose InstInfer, a novel LLM inference system that offloads the most performance-critical computation (i.e., attention in decoding phase) and data (i.e., KV cache) parts to Computational Storage Drives (CSDs), which minimize the enormous KV transfer overheads. InstInfer designs a dedicated flash-aware in-storage attention engine with KV cache management mechanisms to exploit the high internal bandwidths of CSDs instead of being limited by the PCIe bandwidth. The optimized P2P transmission between GPU and CSDs further reduces data migration overheads. Experimental results demonstrate that for a 13B model using an NVIDIA A6000 GPU, InstInfer improves throughput for long-sequence inference by up to 11.1$times$, compared to existing SSD-based solutions such as FlexGen.

9/10/2024

🤯

Efficient LLM inference solution on Intel GPU

Hui Wu, Yi Gan, Feng Yuan, Jing Ma, Wei Zhu, Yutao Xu, Hong Zhu, Yuhua Zhu, Xiaoli Liu, Jinghui Gu, Peng Zhao

Transformer based Large Language Models (LLMs) have been widely used in many fields, and the efficiency of LLM inference becomes hot topic in real applications. However, LLMs are usually complicatedly designed in model structure with massive operations and perform inference in the auto-regressive mode, making it a challenging task to design a system with high efficiency. In this paper, we propose an efficient LLM inference solution with low latency and high throughput. Firstly, we simplify the LLM decoder layer by fusing data movement and element-wise operations to reduce the memory access frequency and lower system latency. We also propose a segment KV cache policy to keep key/value of the request and response tokens in separate physical memory for effective device memory management, helping enlarge the runtime batch size and improve system throughput. A customized Scaled-Dot-Product-Attention kernel is designed to match our fusion policy based on the segment KV cache solution. We implement our LLM inference solution on Intel GPU and publish it publicly. Compared with the standard HuggingFace implementation, the proposed solution achieves up to 7x lower token latency and 27x higher throughput for some popular LLMs on Intel GPU.

6/26/2024

InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management

Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim

Transformer-based large language models (LLMs) demonstrate impressive performance across various natural language processing tasks. Serving LLM inference for generating long contents, however, poses a challenge due to the enormous memory footprint of the transient state, known as the key-value (KV) cache, which scales with the sequence length and batch size. In this paper, we present InfiniGen, a novel KV cache management framework tailored for long-text generation, which synergistically works with modern offloading-based inference systems. InfiniGen leverages the key insight that a few important tokens that are essential for computing the subsequent attention layer in the Transformer can be speculated by performing a minimal rehearsal with the inputs of the current layer and part of the query weight and key cache of the subsequent layer. This allows us to prefetch only the essential KV cache entries (without fetching them all), thereby mitigating the fetch overhead from the host memory in offloading-based LLM serving systems. Our evaluation on several representative LLMs shows that InfiniGen improves the overall performance of a modern offloading-based system by up to 3.00x compared to prior KV cache management methods while offering substantially better model accuracy.

7/1/2024

New!Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU

Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo

Multimodal Large Language Models (MLLMs) are distinguished by their multimodal comprehensive ability and widely used in many real-world applications including GPT-4o, autonomous driving and robotics. Despite their impressive performance, the multimodal inputs always incur long context. The inference under long context requires caching massive Key and Value states (KV cache) of previous tokens, which introduces high latency and excessive memory consumption. Due to this reason, it is challenging to deploy streaming inference of MLLMs on edge devices, which largely constrains the power and usage of MLLMs in real-world applications. In this paper, we introduce Inf-MLLM, an efficient inference framework for MLLMs, which enable streaming inference of MLLM on a single GPU with infinite context. Inf-MLLM is based on our key observation of the attention pattern in both LLMs and MLLMs called attention saddles. Thanks to the newly discovered attention pattern, Inf-MLLM maintains a size-constrained KV cache by dynamically caching recent tokens and relevant tokens. Furthermore, Inf-MLLM proposes attention bias, a novel approach to enable MLLMs to capture long-term dependency. We show that Inf-MLLM enables multiple LLMs and MLLMs to achieve stable performance over 4M-token long texts and multi-round conversations with 1-hour-long videos on a single GPU. In addition, Inf-MLLM exhibits superior streaming reasoning quality than existing methods such as StreamingLLM and 2x speedup than H2O.

9/17/2024