DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

Read original: arXiv:2401.09670 - Published 6/7/2024 by Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

Overview

The paper "DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving" presents a novel approach to efficiently serving large language models (LLMs).
Key ideas include disaggregating the inference process into separate prefill and decoding stages, and optimizing resource utilization to improve overall system throughput.
The proposed system, DistServe, aims to overcome limitations of existing LLM serving approaches and achieve higher goodput, which measures actual useful work done per unit of computing resources.

Plain English Explanation

Large language models (LLMs) are powerful AI systems that can generate human-like text, answer questions, and perform various language tasks. However, serving these LLMs efficiently at scale is a significant challenge. DistServe tackles this problem by splitting the inference process into two distinct stages: prefill and decoding.

The prefill stage prepares the initial input for the LLM, while the decoding stage generates the final output. By separating these steps, DistServe can optimize resource utilization and improve the overall throughput of the system, a metric known as "goodput." This is important because it means the system can do more useful work with the same computing resources, making it more efficient and cost-effective.

DistServe also disaggregates the prefill and decoding stages, allowing them to be executed on separate hardware resources. This flexibility enables better load balancing and multiplexing of different requests, further enhancing the system's efficiency.

Technical Explanation

The key innovation in DistServe is the disaggregation of the LLM inference process into prefill and decoding stages. The prefill stage prepares the initial input for the LLM, which can involve tasks like tokenization, embedding, and context retrieval. The decoding stage then generates the final output using the prepared input.

By separating these steps, DistServe can optimize resource utilization and improve overall system throughput. The prefill and decoding stages can be executed on separate hardware resources, allowing for better load balancing and multiplexing of different requests. This flexibility helps to bridge the distribution gap between the prefill and decoding workloads, leading to higher goodput.

To further enhance efficiency, DistServe employs techniques like dynamic batching and early exit to optimize resource utilization. The system also leverages precomputed prefill results to reduce latency and improve overall responsiveness.

Critical Analysis

The DistServe approach presents a promising solution to the challenge of efficiently serving LLMs at scale. By disaggregating the inference process, the system can better optimize resource utilization and achieve higher goodput. However, the paper does not address potential limitations or trade-offs of this approach.

One area for further research could be the impact of the prefill and decoding separation on the overall quality of the LLM's outputs. It's possible that the distribution gap between the two stages could introduce some degradation in performance, which would need to be carefully evaluated.

Additionally, the paper does not discuss the overhead or complexity introduced by the disaggregation and dynamic resource allocation mechanisms. Implementing these features may increase system complexity and introduce additional operational challenges.

Conclusion

The DistServe paper introduces an innovative approach to serving large language models more efficiently. By disaggregating the inference process into prefill and decoding stages, the system can optimize resource utilization and improve overall throughput, a crucial metric for real-world deployment of LLMs. While the technical details are complex, the core idea of separating and optimizing different components of the inference process is a promising direction for enhancing the scalability and cost-effectiveness of large language model serving.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang

DistServe improves the performance of large language models (LLMs) serving by disaggregating the prefill and decoding computation. Existing LLM serving systems colocate the two phases and batch the computation of prefill and decoding across all users and requests. We find that this strategy not only leads to strong prefill-decoding interferences but also couples the resource allocation and parallelism plans for both phases. LLM applications often emphasize individual latency for each phase: time to first token (TTFT) for the prefill phase and time per output token (TPOT) of each request for the decoding phase. In the presence of stringent latency requirements, existing systems have to prioritize one latency over the other, or over-provision compute resources to meet both. DistServe assigns prefill and decoding computation to different GPUs, hence eliminating prefill-decoding interferences. Given the application's TTFT and TPOT requirements, DistServe co-optimizes the resource allocation and parallelism strategy tailored for each phase. DistServe also places the two phases according to the serving cluster's bandwidth to minimize the communication caused by disaggregation. As a result, DistServe significantly improves LLM serving performance in terms of the maximum rate that can be served within both TTFT and TPOT constraints on each GPU. Our evaluations show that on various popular LLMs, applications, and latency requirements, DistServe can serve 7.4x more requests or 12.6x tighter SLO, compared to state-of-the-art systems, while staying within latency constraints for > 90% of requests.

6/7/2024

P/D-Serve: Serving Disaggregated Large Language Model at Scale

Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo, Zhengyong Zhang

Serving disaggregated large language models (LLMs) over tens of thousands of xPU devices (GPUs or NPUs) with reliable performance faces multiple challenges. 1) Ignoring the diversity (various prefixes and tidal requests), treating all the prompts in a mixed pool is inadequate. To facilitate the similarity per scenario and minimize the inner mismatch on P/D (prefill and decoding) processing, fine-grained organization is required, dynamically adjusting P/D ratios for better performance. 2) Due to inaccurate estimation on workload (queue status or maintained connections), the global scheduler easily incurs unnecessary timeouts in prefill. 3) Block-fixed device-to-device (D2D) KVCache transfer over cluster-level RDMA (remote direct memory access) fails to achieve desired D2D utilization as expected. To overcome previous problems, this paper proposes an end-to-end system P/D-Serve, complying with the paradigm of MLOps (machine learning operations), which models end-to-end (E2E) P/D performance and enables: 1) fine-grained P/D organization, mapping the service with RoCE (RDMA over converged ethernet) as needed, to facilitate similar processing and dynamic adjustments on P/D ratios; 2) on-demand forwarding upon rejections for idle prefill, decoupling the scheduler from regular inaccurate reports and local queues, to avoid timeouts in prefill; and 3) efficient KVCache transfer via optimized D2D access. P/D-Serve is implemented upon Ascend and MindSpore, has been deployed over tens of thousands of NPUs for more than eight months in commercial use, and further achieves 60%, 42% and 46% improvements on E2E throughput, time-to-first-token (TTFT) SLO (service level objective) and D2D transfer time. As the E2E system with optimizations, P/D-Serve achieves 6.7x increase on throughput, compared with aggregated LLMs.

8/16/2024

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, Ramachandran Ramjee

Each LLM serving request goes through two phases. The first is prefill which processes the entire input prompt and produces the first output token and the second is decode which generates the rest of output tokens, one-at-a-time. Prefill iterations have high latency but saturate GPU compute due to parallel processing of the input prompt. In contrast, decode iterations have low latency but also low compute utilization because a decode iteration processes only a single token per request. This makes batching highly effective for decodes and consequently for overall throughput. However, batching multiple requests leads to an interleaving of prefill and decode iterations which makes it challenging to achieve both high throughput and low latency. We introduce an efficient LLM inference scheduler, Sarathi-Serve, to address this throughput-latency tradeoff. Sarathi-Serve introduces chunked-prefills which splits a prefill request into near equal sized chunks and creates stall-free schedules that adds new requests in a batch without pausing ongoing decodes. Stall-free scheduling unlocks the opportunity to improve throughput with large batch sizes while minimizing the effect of batching on latency. Furthermore, uniform batches in Sarathi-Serve ameliorate the imbalance between iterations resulting in minimal pipeline bubbles. Our techniques yield significant improvements in inference performance across models and hardware under tail latency constraints. For Mistral-7B on single A100 GPUs, we achieve 2.6x higher serving capacity and up to 3.7x higher serving capacity for the Yi-34B model on two A100 GPUs as compared to vLLM. When used with pipeline parallelism on Falcon-180B, Sarathi-Serve provides up to 5.6x gain in the end-to-end serving capacity. The source code for Sarathi-Serve is available at https://github.com/microsoft/sarathi-serve.

6/19/2024

LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism

Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, Xin Jin

The context window of large language models (LLMs) is rapidly increasing, leading to a huge variance in resource usage between different requests as well as between different phases of the same request. Restricted by static parallelism strategies, existing LLM serving systems cannot efficiently utilize the underlying resources to serve variable-length requests in different phases. To address this problem, we propose a new parallelism paradigm, elastic sequence parallelism (ESP), to elastically adapt to the variance between different requests and phases. Based on ESP, we design and build LoongServe, an LLM serving system that (1) improves computation efficiency by elastically adjusting the degree of parallelism in real-time, (2) improves communication efficiency by reducing key-value cache migration overhead and overlapping partial decoding communication with computation, and (3) improves GPU memory efficiency by reducing key-value cache fragmentation across instances. Our evaluation under diverse real-world datasets shows that LoongServe improves the maximum throughput by up to 3.85$times$ compared to the chunked prefill and 5.81$times$ compared to the prefill-decoding disaggregation.

4/16/2024