Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

Read original: arXiv:2404.08509 - Published 4/15/2024 by Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew T. Kalbarczyk, Tamer Bac{s}ar, Ravishankar K. Iyer

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

Overview

Proposes a novel approach for efficient serving of large language models (LLMs) in interactive settings
Introduces a proxy model-based sequence length prediction technique to optimize resource allocation and reduce latency
Demonstrates significant improvements in serving throughput and response time compared to existing methods

Plain English Explanation

This research paper addresses the challenge of efficiently serving large language models (LLMs) in interactive settings, where users expect fast responses. The key idea is to use a proxy model to predict the sequence length of the LLM's output, which allows the system to better allocate resources and reduce latency.

Traditionally, LLM serving systems have struggled with the unpredictable output lengths of these models, leading to inefficient resource utilization and long response times. The researchers behind this paper have developed a technique that uses a smaller, faster proxy model to estimate the output sequence length before running the actual LLM. This information is then used to optimize the resource allocation and scheduling, resulting in a significant improvement in serving throughput and response time.

By addressing the challenges of LLM serving, this research can help make these powerful models more accessible and usable in real-time interactive applications, such as chatbots, virtual assistants, and creative writing tools.

Technical Explanation

The researchers propose a proxy model-based sequence length prediction technique to improve the efficiency of LLM serving. The key components of their approach include:

Proxy Model: The researchers train a separate, smaller model to estimate the output sequence length of the target LLM. This proxy model is designed to be faster and more resource-efficient than the full LLM.
Resource Allocation: Using the predicted sequence length from the proxy model, the system can allocate appropriate computational resources (e.g., CPU, memory, GPU) for the LLM inference task, reducing the likelihood of over- or under-provisioning.
Scheduling: The system can also use the sequence length prediction to optimize the scheduling of multiple LLM inference tasks, allowing for more efficient utilization of available resources and reduced latency.

The researchers evaluate their approach on several popular LLMs, including GPT-3 and T5, and demonstrate significant improvements in serving throughput (up to 2.8x) and response time (up to 32%) compared to existing techniques, such as those used in MuxServe.

Critical Analysis

The researchers have thoroughly evaluated their approach and highlighted its benefits, but there are a few potential limitations and areas for further research:

Proxy Model Accuracy: The performance of the overall system is heavily dependent on the accuracy of the proxy model in predicting the LLM's output sequence length. Further research could explore techniques to improve the proxy model's accuracy, such as incorporating additional features or using more advanced architectures.
Generalization: The study focused on a limited set of LLMs, and it's unclear how well the approach would generalize to a wider range of models, especially those with different architectural characteristics or training objectives.
Real-world Deployment: The paper evaluates the approach in a simulated environment, and additional research would be needed to assess its performance and practicality in real-world, production-level LLM serving systems.

Overall, this research presents a promising approach to address the challenges of efficient LLM serving, which is crucial for the widespread adoption and practical use of these powerful models in interactive applications.

Conclusion

The proposed proxy model-based sequence length prediction technique offers a novel solution to the problem of efficiently serving large language models in interactive settings. By leveraging a faster, more resource-efficient proxy model to predict the output sequence length, the system can optimize resource allocation and scheduling, leading to significant improvements in serving throughput and response time. This work represents an important step towards making LLMs more accessible and usable in real-time applications, with potential applications in chatbots, virtual assistants, and creative writing tools. As the field of large language model research continues to evolve, this approach could serve as a valuable contribution to the ongoing efforts to make these models more practical and widely deployable.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew T. Kalbarczyk, Tamer Bac{s}ar, Ravishankar K. Iyer

Large language models (LLMs) have been driving a new wave of interactive AI applications across numerous domains. However, efficiently serving LLM inference requests is challenging due to their unpredictable execution times originating from the autoregressive nature of generative models. Existing LLM serving systems exploit first-come-first-serve (FCFS) scheduling, suffering from head-of-line blocking issues. To address the non-deterministic nature of LLMs and enable efficient interactive LLM serving, we present a speculative shortest-job-first (SSJF) scheduler that uses a light proxy model to predict LLM output sequence lengths. Our open-source SSJF implementation does not require changes to memory management or batching strategies. Evaluations on real-world datasets and production workload traces show that SSJF reduces average job completion times by 30.5-39.6% and increases throughput by 2.2-3.6x compared to FCFS schedulers, across no batching, dynamic batching, and continuous batching settings.

4/15/2024

Efficient LLM Scheduling by Learning to Rank

Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Stoica, Hao Zhang

In Large Language Model (LLM) inference, the output length of an LLM request is typically regarded as not known a priori. Consequently, most LLM serving systems employ a simple First-come-first-serve (FCFS) scheduling strategy, leading to Head-Of-Line (HOL) blocking and reduced throughput and service quality. In this paper, we reexamine this assumption -- we show that, although predicting the exact generation length of each request is infeasible, it is possible to predict the relative ranks of output lengths in a batch of requests, using learning to rank. The ranking information offers valuable guidance for scheduling requests. Building on this insight, we develop a novel scheduler for LLM inference and serving that can approximate the shortest-job-first (SJF) schedule better than existing approaches. We integrate this scheduler with the state-of-the-art LLM serving system and show significant performance improvement in several important applications: 2.8x lower latency in chatbot serving and 6.5x higher throughput in synthetic data generation. Our code is available at https://github.com/hao-ai-lab/vllm-ltr.git

8/29/2024

🤯

Fast Distributed Inference Serving for Large Language Models

Bingyang Wu, Yinmin Zhong, Zili Zhang, Shengyu Liu, Fangyue Liu, Yuanhang Sun, Gang Huang, Xuanzhe Liu, Xin Jin

Large language models (LLMs) power a new generation of interactive AI applications exemplified by ChatGPT. The interactive nature of these applications demands low latency for LLM inference. Existing LLM serving systems use run-to-completion processing for inference jobs, which suffers from head-of-line blocking and long latency. We present FastServe, a distributed inference serving system for LLMs. FastServe exploits the autoregressive pattern of LLM inference to enable preemption at the granularity of each output token. FastServe uses preemptive scheduling to minimize latency with a novel skip-join Multi-Level Feedback Queue scheduler. Based on the new semi-information-agnostic setting of LLM inference, the scheduler leverages the input length information to assign an appropriate initial queue for each arrival job to join. The higher priority queues than the joined queue are skipped to reduce demotions. We design an efficient GPU memory management mechanism that proactively offloads and uploads intermediate state between GPU memory and host memory for LLM inference. We build a system prototype of FastServe and experimental results show that compared to the state-of-the-art solution vLLM, FastServe improves the throughput by up to 31.4x and 17.9x under the same average and tail latency requirements, respectively.

9/26/2024

Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving

Ke Cheng, Wen Hu, Zhi Wang, Hongen Peng, Jianguo Li, Sheng Zhang

Large language models (LLMs) iteratively generate text token by token, with memory usage increasing with the length of generated token sequences. The unpredictability of generation lengths makes it difficult to estimate the time and memory needed to process requests, posing a challenge for effective request scheduling. Conventional sequence-level scheduling (SLS) serves requests in a first-come first-served (FCFS) manner with static batching where requests with short generation lengths are delayed until those with long ones have finished generation, which hurts computational efficiency. Besides, to avoid out-of-memory (OOM) errors, SLS batches requests with a small batch size, which limits throughput. Recently proposed iteration-level scheduling (ILS) enhances computational efficiency with continuous batching to return completed requests timely and dynamically add new requests for processing. However, many ILS schedulers limit the number of parallel-processing requests to avoid OOM errors while achieving a fast inference speed, which compromises throughput. Moreover, existing SLS and ILS schedulers fail to balance the workload across multiple deployed LLM instances. To tackle these challenges, we propose slice-level scheduling (SCLS). By splitting the predefined maximal generation length limit into slices and serving batches slice by slice, it provides a precise range of serving time and memory usage for batched requests, laying the foundation for effective scheduling. Experiments confirm that compared with SLS and ILS schedulers, SCLS can improve throughput by up to 315.8% and greatly mitigate load imbalance with proposed batching and offloading algorithms.

6/21/2024