P/D-Serve: Serving Disaggregated Large Language Model at Scale

Read original: arXiv:2408.08147 - Published 8/16/2024 by Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun and 20 others

P/D-Serve: Serving Disaggregated Large Language Model at Scale

Overview

The paper presents P/D-Serve, a system for serving disaggregated large language models at scale.
P/D-Serve aims to improve the efficiency and performance of serving large language models by separating the model inference from the prompt processing and decoding.
The system is designed to handle high throughput and low latency requirements for real-world applications.

Plain English Explanation

P/D-Serve: Serving Disaggregated Large Language Model at Scale describes a new approach for running large language models, which are complex AI systems that can generate human-like text. Traditional ways of running these models often struggle to handle the high demand and fast response times required for real-world applications.

The key idea behind P/D-Serve is to split up the different tasks involved in running a large language model. Normally, all the steps - processing the input text, running the model to generate output, and formatting the results - happen together. P/D-Serve separates these into distinct components that can be optimized and scaled independently.

This disaggregated design allows P/D-Serve to better manage the resources and workload needed to serve large language models. It can dynamically allocate computing power based on demand, and avoid bottlenecks that plague more monolithic approaches. The result is a system that can handle high throughput and low latency, which is crucial for real-time applications like chatbots or virtual assistants.

Technical Explanation

P/D-Serve achieves this by splitting the serving process into two main components:

Prompt Processor: Handles the pre-processing of the input text, including tokenization, padding, and batching.
Decoder: Responsible for running the large language model and generating the output text.

These components are disaggregated and scaled independently based on the workload. The prompt processor can be replicated to handle high input throughput, while the decoder can be provisioned with more GPU resources to accelerate the model inference.

P/D-Serve also introduces several optimizations to improve efficiency, such as:

Prefill Caching: Caching the pre-computed model states to reduce the compute required for each request.
Prompt-Decoder Decoupling: Separating the prompt processing and model decoding to enable parallel processing.
Goodput Optimization: Adjusting the batch size and other parameters to maximize the useful work done per unit of compute.

Through these innovations, P/D-Serve is able to serve large language models at high throughput and low latency, making them more accessible for real-world applications.

Critical Analysis

The paper provides a thorough technical explanation of the P/D-Serve system and its key innovations. However, it does not delve deeply into the potential limitations or areas for further research.

One potential concern is the complexity of the system - by introducing additional components and decoupling the serving process, P/D-Serve may increase the overall system complexity and make it more challenging to manage and maintain. The paper does not address how this added complexity might impact real-world deployments.

Additionally, the performance improvements reported in the paper are impressive, but it would be valuable to understand how the system might scale as the language models grow in size and complexity. The paper does not discuss the long-term viability of the approach as the field of large language models continues to evolve.

Further research could also explore the generalizability of the P/D-Serve approach - how well would it work for serving other types of large-scale AI models, beyond just language models? Investigating the broader applicability of the design principles could expand the impact of this work.

Conclusion

P/D-Serve presents a novel and promising approach for serving large language models at scale. By separating the various components of the serving process and introducing optimizations, the system can handle high throughput and low latency requirements crucial for real-world applications.

While the paper provides a solid technical explanation of the system, further research is needed to address potential limitations, such as system complexity and long-term scalability. Exploring the generalizability of the approach to other types of large-scale AI models could also expand the impact of this work.

Overall, P/D-Serve represents an important step forward in making large language models more accessible and practical for a wide range of use cases.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

P/D-Serve: Serving Disaggregated Large Language Model at Scale

Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo, Zhengyong Zhang

Serving disaggregated large language models (LLMs) over tens of thousands of xPU devices (GPUs or NPUs) with reliable performance faces multiple challenges. 1) Ignoring the diversity (various prefixes and tidal requests), treating all the prompts in a mixed pool is inadequate. To facilitate the similarity per scenario and minimize the inner mismatch on P/D (prefill and decoding) processing, fine-grained organization is required, dynamically adjusting P/D ratios for better performance. 2) Due to inaccurate estimation on workload (queue status or maintained connections), the global scheduler easily incurs unnecessary timeouts in prefill. 3) Block-fixed device-to-device (D2D) KVCache transfer over cluster-level RDMA (remote direct memory access) fails to achieve desired D2D utilization as expected. To overcome previous problems, this paper proposes an end-to-end system P/D-Serve, complying with the paradigm of MLOps (machine learning operations), which models end-to-end (E2E) P/D performance and enables: 1) fine-grained P/D organization, mapping the service with RoCE (RDMA over converged ethernet) as needed, to facilitate similar processing and dynamic adjustments on P/D ratios; 2) on-demand forwarding upon rejections for idle prefill, decoupling the scheduler from regular inaccurate reports and local queues, to avoid timeouts in prefill; and 3) efficient KVCache transfer via optimized D2D access. P/D-Serve is implemented upon Ascend and MindSpore, has been deployed over tens of thousands of NPUs for more than eight months in commercial use, and further achieves 60%, 42% and 46% improvements on E2E throughput, time-to-first-token (TTFT) SLO (service level objective) and D2D transfer time. As the E2E system with optimizations, P/D-Serve achieves 6.7x increase on throughput, compared with aggregated LLMs.

8/16/2024

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang

DistServe improves the performance of large language models (LLMs) serving by disaggregating the prefill and decoding computation. Existing LLM serving systems colocate the two phases and batch the computation of prefill and decoding across all users and requests. We find that this strategy not only leads to strong prefill-decoding interferences but also couples the resource allocation and parallelism plans for both phases. LLM applications often emphasize individual latency for each phase: time to first token (TTFT) for the prefill phase and time per output token (TPOT) of each request for the decoding phase. In the presence of stringent latency requirements, existing systems have to prioritize one latency over the other, or over-provision compute resources to meet both. DistServe assigns prefill and decoding computation to different GPUs, hence eliminating prefill-decoding interferences. Given the application's TTFT and TPOT requirements, DistServe co-optimizes the resource allocation and parallelism strategy tailored for each phase. DistServe also places the two phases according to the serving cluster's bandwidth to minimize the communication caused by disaggregation. As a result, DistServe significantly improves LLM serving performance in terms of the maximum rate that can be served within both TTFT and TPOT constraints on each GPU. Our evaluations show that on various popular LLMs, applications, and latency requirements, DistServe can serve 7.4x more requests or 12.6x tighter SLO, compared to state-of-the-art systems, while staying within latency constraints for > 90% of requests.

6/7/2024

LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism

Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, Xin Jin

The context window of large language models (LLMs) is rapidly increasing, leading to a huge variance in resource usage between different requests as well as between different phases of the same request. Restricted by static parallelism strategies, existing LLM serving systems cannot efficiently utilize the underlying resources to serve variable-length requests in different phases. To address this problem, we propose a new parallelism paradigm, elastic sequence parallelism (ESP), to elastically adapt to the variance between different requests and phases. Based on ESP, we design and build LoongServe, an LLM serving system that (1) improves computation efficiency by elastically adjusting the degree of parallelism in real-time, (2) improves communication efficiency by reducing key-value cache migration overhead and overlapping partial decoding communication with computation, and (3) improves GPU memory efficiency by reducing key-value cache fragmentation across instances. Our evaluation under diverse real-world datasets shows that LoongServe improves the maximum throughput by up to 3.85$times$ compared to the chunked prefill and 5.81$times$ compared to the prefill-decoding disaggregation.

4/16/2024

⚙️

Preble: Efficient Distributed Prompt Scheduling for LLM Serving

Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, Yiying Zhang

Prompts to large language models (LLMs) have evolved beyond simple user questions. For LLMs to solve complex problems, today's practices include domain-specific instructions, illustration of tool usages, and long context, such as textbook chapters in prompts. As such, many parts of prompts are repetitive across requests, and their attention computation results can be reused. However, today's LLM serving systems treat every request in isolation, missing the opportunity of computation reuse. This paper proposes Preble, the first distributed LLM serving platform that targets and optimizes for prompt sharing. We perform a study on five popular LLM workloads. Based on our study results, we designed a distributed scheduling system that co-optimizes computation reuse and load balancing. Our evaluation of Preble on two to 8 GPUs with real workloads and request arrival patterns on two open-source LLM models shows that Preble outperforms the state-of-the-art average latency by 1.5X to 14.5X and p99 by 2X to 10X.

7/2/2024