Splitwise: Efficient generative LLM inference using phase splitting

Read original: arXiv:2311.18677 - Published 5/21/2024 by Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, 'I~nigo Goiri, Saeed Maleki, Ricardo Bianchini

Splitwise: Efficient generative LLM inference using phase splitting

Overview

This paper proposes a novel approach called "Splitwise" for efficient generative inference on large language models (LLMs).
The key idea is to split the inference process into multiple phases, allowing for more efficient resource utilization and faster overall inference.
The authors demonstrate the effectiveness of Splitwise on several LLM-based tasks, showing significant performance improvements compared to traditional approaches.

Plain English Explanation

Large language models (LLMs) have become incredibly powerful, but running them can be computationally expensive and time-consuming. The Splitwise approach introduced in this paper aims to make LLM inference more efficient.

The main idea is to split the inference process into multiple phases. Instead of running the entire model all at once, Splitwise divides the computation into smaller, more manageable steps. This allows for better optimization of resources, such as memory and processing power, leading to faster overall inference times.

For example, imagine you're trying to generate a long piece of text using an LLM. Traditionally, the model would need to process the entire text at once, which can be slow and resource-intensive. With Splitwise, the text generation could be broken down into shorter chunks, with each chunk processed separately. This would enable more efficient use of available hardware, resulting in quicker text generation.

By adopting this phase-splitting approach, the authors demonstrate significant performance improvements across various LLM-based tasks, such as text generation, question answering, and language translation. This could make LLMs more accessible and practical for a wider range of applications, including edge computing and other resource-constrained environments.

Technical Explanation

The key innovation introduced in the Splitwise paper is the concept of phase splitting for efficient generative LLM inference. The authors argue that traditional LLM inference approaches, which process the entire input at once, can be suboptimal in terms of resource utilization and overall performance.

To address this, the Splitwise framework divides the inference process into multiple phases. In the first phase, the model generates a partial output based on the initial input. This partial output is then used to guide the subsequent phases, where the model refines and expands the output incrementally.

By breaking down the inference into manageable steps, Splitwise can better optimize the use of available computational resources, such as memory and GPU/CPU utilization. This leads to significant performance improvements compared to monolithic LLM inference approaches.

The authors evaluate Splitwise on several LLM-based tasks, including text generation, question answering, and language translation. The results demonstrate that Splitwise can achieve up to 3x speedups in inference time while maintaining comparable or even improved output quality compared to traditional methods.

Critical Analysis

The Splitwise approach presents a promising solution for improving the efficiency of generative LLM inference. By splitting the inference process into multiple phases, the authors demonstrate the ability to better utilize available computational resources and achieve significant performance gains.

However, the paper does not address the potential overhead associated with the phase splitting process itself. While the overall inference time is reduced, there may be additional computational costs involved in managing the phased approach, which could limit the benefits in certain scenarios.

Additionally, the authors focus on a limited set of LLM-based tasks, and it would be valuable to understand the performance of Splitwise on a wider range of applications, including more complex or domain-specific tasks. Further research is needed to assess the generalizability of the Splitwise approach and its applicability to different LLM architectures and use cases.

Another potential area of concern is the impact of phase splitting on the quality and coherence of the generated outputs. While the authors report comparable or improved output quality, the effects of the phased approach on the semantic and logical consistency of the generated content could be an important consideration, especially for applications where high-quality, human-like output is crucial.

Despite these potential limitations, the Splitwise paper presents a compelling and innovative approach to improving the efficiency of generative LLM inference. As the demand for LLM-powered applications continues to grow, solutions like Splitwise could play a crucial role in making these models more accessible and practical, especially in resource-constrained environments such as edge computing.

Conclusion

The Splitwise paper introduces a novel approach for efficient generative inference on large language models (LLMs). By splitting the inference process into multiple phases, the authors demonstrate significant performance improvements in terms of inference time while maintaining comparable or even improved output quality.

This phase-splitting approach could have far-reaching implications for the deployment and practical application of LLMs, particularly in scenarios where computational resources are limited, such as edge computing and mobile devices. By making LLM-powered applications more efficient and accessible, the Splitwise framework could contribute to the broader adoption and integration of these powerful models in a wide range of real-world use cases, including text generation, question answering, and language translation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Splitwise: Efficient generative LLM inference using phase splitting

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, 'I~nigo Goiri, Saeed Maleki, Ricardo Bianchini

Recent innovations in generative large language models (LLMs) have made their applications and use-cases ubiquitous. This has led to large-scale deployments of these models, using complex, expensive, and power-hungry AI accelerators, most commonly GPUs. These developments make LLM inference efficiency an important challenge. Based on our extensive characterization, we find that there are two main phases during an LLM inference request: a compute-intensive prompt computation, and a memory-intensive token generation, each with distinct latency, throughput, memory, and power characteristics. Despite state-of-the-art batching and scheduling, the token generation phase underutilizes compute resources. Specifically, unlike compute-intensive prompt computation phases, token generation phases do not require the compute capability of the latest GPUs, and can be run with lower power and cost. With Splitwise, we propose splitting the two phases of a LLM inference request on to separate machines. This allows us to use hardware that is well-suited for each phase, and provision resources independently per phase. However, splitting an inference request across machines requires state transfer from the machine running prompt computation over to the machine generating tokens. We implement and optimize this state transfer using the fast back-plane interconnects available in today's GPU clusters. We use the Splitwise technique to design LLM inference clusters using the same or different types of machines for the prompt computation and token generation phases. Our clusters are optimized for three key objectives: throughput, cost, and power. In particular, we show that we can achieve 1.4x higher throughput at 20% lower cost than current designs. Alternatively, we can achieve 2.35x more throughput with the same cost and power budgets.

5/21/2024

Distributed Inference Performance Optimization for LLMs on CPUs

Pujiang He, Shan Zhou, Changqing Li, Wenhuan Huang, Weifei Yu, Duyi Wang, Chen Meng, Sheng Gui

Large language models (LLMs) hold tremendous potential for addressing numerous real-world challenges, yet they typically demand significant computational resources and memory. Deploying LLMs onto a resource-limited hardware device with restricted memory capacity presents considerable challenges. Distributed computing emerges as a prevalent strategy to mitigate single-node memory constraints and expedite LLM inference performance. To reduce the hardware limitation burden, we proposed an efficient distributed inference optimization solution for LLMs on CPUs. We conduct experiments with the proposed solution on 5th Gen Intel Xeon Scalable Processors, and the result shows the time per output token for the LLM with 72B parameter is 140 ms/token, much faster than the average human reading speed about 200ms per token.

7/2/2024

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference

Joyjit Kundu, Wenzhe Guo, Ali BanaGozar, Udari De Alwis, Sourav Sengupta, Puneet Gupta, Arindam Mallik

Aligning future system design with the ever-increasing compute needs of large language models (LLMs) is undoubtedly an important problem in today's world. Here, we propose a general performance modeling methodology and workload analysis of distributed LLM training and inference through an analytical framework that accurately considers compute, memory sub-system, network, and various parallelization strategies (model parallel, data parallel, pipeline parallel, and sequence parallel). We validate our performance predictions with published data from literature and relevant industry vendors (e.g., NVIDIA). For distributed training, we investigate the memory footprint of LLMs for different activation re-computation methods, dissect the key factors behind the massive performance gain from A100 to B200 ($sim$ 35x speed-up closely following NVIDIA's scaling trend), and further run a design space exploration at different technology nodes (12 nm to 1 nm) to study the impact of logic, memory, and network scaling on the performance. For inference, we analyze the compute versus memory boundedness of different operations at a matrix-multiply level for different GPU systems and further explore the impact of DRAM memory technology scaling on inference latency. Utilizing our modeling framework, we reveal the evolution of performance bottlenecks for both LLM training and inference with technology scaling, thus, providing insights to design future systems for LLM training and inference.

7/23/2024

🤯

Efficient LLM inference solution on Intel GPU

Hui Wu, Yi Gan, Feng Yuan, Jing Ma, Wei Zhu, Yutao Xu, Hong Zhu, Yuhua Zhu, Xiaoli Liu, Jinghui Gu, Peng Zhao

Transformer based Large Language Models (LLMs) have been widely used in many fields, and the efficiency of LLM inference becomes hot topic in real applications. However, LLMs are usually complicatedly designed in model structure with massive operations and perform inference in the auto-regressive mode, making it a challenging task to design a system with high efficiency. In this paper, we propose an efficient LLM inference solution with low latency and high throughput. Firstly, we simplify the LLM decoder layer by fusing data movement and element-wise operations to reduce the memory access frequency and lower system latency. We also propose a segment KV cache policy to keep key/value of the request and response tokens in separate physical memory for effective device memory management, helping enlarge the runtime batch size and improve system throughput. A customized Scaled-Dot-Product-Attention kernel is designed to match our fusion policy based on the segment KV cache solution. We implement our LLM inference solution on Intel GPU and publish it publicly. Compared with the standard HuggingFace implementation, the proposed solution achieves up to 7x lower token latency and 27x higher throughput for some popular LLMs on Intel GPU.

6/26/2024