Llumnix: Dynamic Scheduling for Large Language Model Serving

2406.03243

Published 6/7/2024 by Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, Wei Lin

Llumnix: Dynamic Scheduling for Large Language Model Serving

Abstract

Inference serving for large language models (LLMs) is the key to unleashing their potential in people's daily lives. However, efficient LLM serving remains challenging today because the requests are inherently heterogeneous and unpredictable in terms of resource and latency requirements, as a result of the diverse applications and the dynamic execution nature of LLMs. Existing systems are fundamentally limited in handling these characteristics and cause problems such as severe queuing delays, poor tail latencies, and SLO violations. We introduce Llumnix, an LLM serving system that reacts to such heterogeneous and unpredictable requests by runtime rescheduling across multiple model instances. Similar to context switching across CPU cores in modern operating systems, Llumnix reschedules requests to improve load balancing and isolation, mitigate resource fragmentation, and differentiate request priorities and SLOs. Llumnix implements the rescheduling with an efficient and scalable live migration mechanism for requests and their in-memory states, and exploits it in a dynamic scheduling policy that unifies the multiple rescheduling scenarios elegantly. Our evaluations show that Llumnix improves tail latencies by an order of magnitude, accelerates high-priority requests by up to 1.5x, and delivers up to 36% cost savings while achieving similar tail latencies, compared against state-of-the-art LLM serving systems. Llumnix is publicly available at https://github.com/AlibabaPAI/llumnix.

Create account to get full access

Overview

This paper presents Llumnix, a dynamic scheduling system for efficiently serving large language models (LLMs) at scale.
Llumnix addresses the challenges of serving LLMs, which can be computationally intensive and require careful resource management.
The system uses a novel scheduling algorithm to dynamically allocate resources and optimize performance across a fleet of LLM instances.

Plain English Explanation

Llumnix is a new system designed to help efficiently deliver large language models (LLMs) to many users at the same time. LLMs are very powerful AI models that can understand and generate human-like text, but running them requires a lot of computing power.

Llumnix uses a smart scheduling algorithm to dynamically manage the available computing resources. It can quickly decide how to allocate the right amount of computing power to each user's request for an LLM, in order to provide the best performance while also using the resources efficiently. This helps ensure that LLMs can be served reliably and at scale, even when there is high demand.

The key idea behind Llumnix is to continuously monitor the system and make intelligent decisions about how to distribute the computing power. This allows it to adapt in real-time to changes in user demand or available resources. By doing this, Llumnix can provide a better experience for users while also using the underlying hardware more efficiently.

Technical Explanation

Llumnix is a dynamic scheduling system designed to efficiently serve large language models (LLMs) at scale. It addresses the challenges of LLM serving, which can be computationally intensive and require careful resource management.

Llumnix uses a novel scheduling algorithm to dynamically allocate resources and optimize performance across a fleet of LLM instances. The system continuously monitors the state of the system, including user demand and available compute resources, and makes intelligent decisions about how to distribute the workload.

The Llumnix architecture consists of a central scheduler that communicates with a set of LLM model servers. The scheduler tracks the status of each server and uses a custom scheduling algorithm to dispatch incoming requests to the appropriate server. This algorithm considers factors like server load, model version, and request priority to maximize overall system throughput and latency.

Llumnix also employs several optimization techniques to improve efficiency, such as link to PerLLM paper for personalized inference scheduling and link to MuxServe paper for flexible model multiplexing. The system is designed to be highly scalable and can dynamically add or remove LLM servers to meet changing demand.

The authors evaluate Llumnix on a variety of workloads and show that it can achieve significant performance improvements over static scheduling approaches, while also reducing overall system cost. The system demonstrates the benefits of dynamic, intelligent scheduling for large-scale LLM serving.

Critical Analysis

The Llumnix paper presents a well-designed and thoroughly evaluated system for dynamic LLM serving. The authors have identified a critical challenge in the field and developed a novel solution to address it.

One potential limitation of the Llumnix approach is its reliance on a centralized scheduler. While this design allows for global optimization, it could also introduce a single point of failure or bottleneck. The authors acknowledge this concern and suggest exploring distributed scheduling approaches, such as those described in the link to Helix paper, as an area for future research.

Additionally, the Llumnix system does not currently address multi-tenancy or resource isolation, which are important considerations for production deployments. The link to BlockLLM paper provides an example of how these concerns can be addressed in LLM serving systems.

Finally, the authors note that the Llumnix scheduling algorithm relies on accurate forecasting of user demand and resource availability. In practice, these predictions may not always be perfect, which could impact the system's ability to optimize performance. Exploring link to Automated Conversion paper for automatically converting static to dynamic schedulers may help address this challenge.

Overall, the Llumnix paper presents a compelling approach to the problem of large-scale LLM serving, and the authors have identified several promising directions for future research and development.

Conclusion

The Llumnix paper introduces a novel dynamic scheduling system for efficiently serving large language models (LLMs) at scale. By continuously monitoring the state of the system and making intelligent scheduling decisions, Llumnix can optimize performance and resource utilization, addressing the key challenges of LLM serving.

The authors have demonstrated the effectiveness of the Llumnix approach through extensive evaluation, showing significant improvements over static scheduling methods. While the current system has some limitations, the authors have identified several promising directions for future research, including distributed scheduling, multi-tenancy, and more advanced demand forecasting.

The Llumnix paper represents an important contribution to the field of large-scale LLM serving, and the ideas and techniques presented have the potential to enable more reliable and cost-effective delivery of these powerful AI models to a wide range of users and applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤯

PerLLM: Personalized Inference Scheduling with Edge-Cloud Collaboration for Diverse LLM Services

Zheming Yang, Yuanhao Yang, Chang Zhao, Qi Guo, Wenkai He, Wen Ji

With the rapid growth in the number of large language model (LLM) users, it is difficult for bandwidth-constrained cloud servers to simultaneously process massive LLM services in real-time. Recently, edge-cloud infrastructures have been used to improve the processing efficiency of large-scale LLM services. However, the diversity of task requirements and the dynamics of resources pose great challenges to inference scheduling, leading to the wastage of many resources. In this paper, we present PerLLM, a personalized inference scheduling framework with edge-cloud collaboration designed for diverse LLM services. For the complexity of multiple constraints and the decision-making process of edge-cloud collaboration, we integrate the upper confidence bound algorithm based on the constraint satisfaction mechanism in PerLLM. For diverse LLM services, PerLLM can optimize service scheduling and resource allocation solutions within the edge-cloud infrastructure to meet processing time requirements while minimizing energy costs. Experimental results from different model deployments show that PerLLM can effectively meet the processing time requirements of personalized services. Compared to other methods, PerLLM achieves 2.2x, 2.1x, and 1.6x throughput and reduces the energy cost by more than 50%.

5/24/2024

cs.DC cs.NI

🤔

MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving

Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, Hao Zhang

Large language models (LLMs) have demonstrated remarkable performance, and organizations are racing to serve LLMs of varying sizes as endpoints for use-cases like chat, programming and search. However, efficiently serving multiple LLMs poses significant challenges for existing approaches due to varying popularity of LLMs. In the paper, we present MuxServe, a flexible spatial-temporal multiplexing system for efficient multiple LLM serving. The key insight behind is to colocate LLMs considering their popularity to multiplex memory resources, and leverage the characteristics of prefill and decoding phases to separate and flexibly colocate them to multiplex computation resources. MuxServe formally formulates the multiplexing problem, and proposes a novel placement algorithm and adaptive batch scheduling strategy to identify optimal colocations and maximize utilization. MuxServe designs a unified resource manager to enable flexible and efficient multiplexing. Evaluation results show that MuxServe can achieves up to $1.8times$ higher throughput or processes $2.9times$ more requests within $99%$ SLO attainment. The code is available at: url{https://github.com/hao-ai-lab/MuxServe}.

6/14/2024

cs.DC

Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving

Ke Cheng, Wen Hu, Zhi Wang, Hongen Peng, Jianguo Li, Sheng Zhang

Large language models (LLMs) iteratively generate text token by token, with memory usage increasing with the length of generated token sequences. The unpredictability of generation lengths makes it difficult to estimate the time and memory needed to process requests, posing a challenge for effective request scheduling. Conventional sequence-level scheduling (SLS) serves requests in a first-come first-served (FCFS) manner with static batching where requests with short generation lengths are delayed until those with long ones have finished generation, which hurts computational efficiency. Besides, to avoid out-of-memory (OOM) errors, SLS batches requests with a small batch size, which limits throughput. Recently proposed iteration-level scheduling (ILS) enhances computational efficiency with continuous batching to return completed requests timely and dynamically add new requests for processing. However, many ILS schedulers limit the number of parallel-processing requests to avoid OOM errors while achieving a fast inference speed, which compromises throughput. Moreover, existing SLS and ILS schedulers fail to balance the workload across multiple deployed LLM instances. To tackle these challenges, we propose slice-level scheduling (SCLS). By splitting the predefined maximal generation length limit into slices and serving batches slice by slice, it provides a precise range of serving time and memory usage for batched requests, laying the foundation for effective scheduling. Experiments confirm that compared with SLS and ILS schedulers, SCLS can improve throughput by up to 315.8% and greatly mitigate load imbalance with proposed batching and offloading algorithms.

6/21/2024

cs.DC

Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs

Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, Rashmi Vinayak

This paper introduces Helix, a distributed system for high-throughput, low-latency large language model (LLM) serving on heterogeneous GPU clusters. A key idea behind Helix is to formulate inference computation of LLMs over heterogeneous GPUs and network connections as a max-flow problem for a directed, weighted graph, whose nodes represent GPU instances and edges capture both GPU and network heterogeneity through their capacities. Helix then uses a mixed integer linear programming (MILP) algorithm to discover highly optimized strategies to serve LLMs. This approach allows Helix to jointly optimize model placement and request scheduling, two highly entangled tasks in heterogeneous LLM serving. Our evaluation on several heterogeneous cluster settings ranging from 24 to 42 GPU nodes shows that Helix improves serving throughput by up to 2.7$times$ and reduces prompting and decoding latency by up to 2.8$times$ and 1.3$times$, respectively, compared to best existing approaches.

6/4/2024

cs.DC cs.LG