Preble: Efficient Distributed Prompt Scheduling for LLM Serving

Read original: arXiv:2407.00023 - Published 7/2/2024 by Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, Yiying Zhang
Total Score

0

⚙️

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • Large language models (LLMs) are now used to solve complex problems, with prompts including domain-specific instructions, tool usages, and long context like textbook chapters.
  • These prompts often have repetitive parts across requests, and their attention computation results could be reused.
  • However, current LLM serving systems treat each request in isolation, missing the opportunity for computation reuse.

Plain English Explanation

The paper introduces a new system called Preble that aims to improve the efficiency of serving large language models (LLMs) by taking advantage of the repetitive parts in the prompts used to query these models.

Prompts to LLMs have become more complex over time, often including domain-specific instructions, illustrations of tool usage, and even long passages of text like textbook chapters. While these prompts may have many similar elements across different requests, current LLM serving systems treat each request in isolation, missing the opportunity to reuse the computation done for those repetitive parts.

Preble is designed to address this problem by being the first distributed LLM serving platform that specifically targets and optimizes for prompt sharing. The researchers performed a study on five popular LLM workloads and used the insights to design a distributed scheduling system that co-optimizes computation reuse and load balancing.

Technical Explanation

The paper proposes Preble, a distributed LLM serving platform that targets and optimizes for prompt sharing. The authors first conducted a study on five popular LLM workloads to understand the characteristics of real-world prompts. Based on the study results, they designed a distributed scheduling system that co-optimizes computation reuse and load balancing.

The key elements of Preble's architecture and evaluation are:

  1. Prompt Parsing and Caching: Preble parses the incoming prompts to identify repetitive components, which are then cached for reuse across requests.
  2. Distributed Scheduling: Preble's scheduling system assigns requests to workers in a way that maximizes computation reuse while maintaining good load balancing.
  3. Evaluation: The authors evaluated Preble on two open-source LLM models (GPT-2 and BERT) using real workloads and request arrival patterns, running on 2 to 8 GPUs. Preble outperformed the state-of-the-art in average latency by 1.5X to 14.5X and in p99 latency by 2X to 10X.

The paper builds on related work in areas such as efficient multi-prompt evaluation, modular attention reuse, and low-latency large language model serving [https://aimodels.fyi/papers/arxiv/loongserve-efficiently-serving-long-context-large-language].

Critical Analysis

The paper presents a compelling solution to the problem of inefficient LLM serving, with Preble demonstrating significant performance improvements over the state-of-the-art. However, the authors acknowledge some limitations and areas for further research:

  1. Prompt Complexity: The study focused on five popular LLM workloads, but the diversity and complexity of real-world prompts may be broader, requiring further evaluation.
  2. Scalability: While Preble shows good performance on 2 to 8 GPUs, its scalability to larger deployments with more compute resources needs to be explored.
  3. Generalization: The paper evaluates Preble on two open-source LLM models (GPT-2 and BERT). Its effectiveness on other LLM architectures and future models remains to be seen.

Additionally, one could question the reliance on the ability to accurately parse and identify repetitive components in prompts, as this may not always be a trivial task, especially for more complex prompts. The paper could have also discussed potential security and privacy implications of caching and reusing computation results across user requests.

Conclusion

The Preble system presents a promising approach to improving the efficiency of serving large language models by taking advantage of the repetitive components in prompts. By designing a distributed scheduling system that co-optimizes computation reuse and load balancing, the researchers were able to achieve significant performance improvements over the state-of-the-art.

While the paper highlights some limitations and areas for further research, the core idea of Preble is a valuable contribution to the field of LLM serving, and the results suggest that there is substantial room for optimization beyond the current isolated request handling approaches. As LLMs continue to grow in complexity and importance, systems like Preble will become increasingly crucial for enabling their efficient and scalable deployment.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

⚙️

Total Score

0

Preble: Efficient Distributed Prompt Scheduling for LLM Serving

Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, Yiying Zhang

Prompts to large language models (LLMs) have evolved beyond simple user questions. For LLMs to solve complex problems, today's practices include domain-specific instructions, illustration of tool usages, and long context, such as textbook chapters in prompts. As such, many parts of prompts are repetitive across requests, and their attention computation results can be reused. However, today's LLM serving systems treat every request in isolation, missing the opportunity of computation reuse. This paper proposes Preble, the first distributed LLM serving platform that targets and optimizes for prompt sharing. We perform a study on five popular LLM workloads. Based on our study results, we designed a distributed scheduling system that co-optimizes computation reuse and load balancing. Our evaluation of Preble on two to 8 GPUs with real workloads and request arrival patterns on two open-source LLM models shows that Preble outperforms the state-of-the-art average latency by 1.5X to 14.5X and p99 by 2X to 10X.

Read more

7/2/2024

P/D-Serve: Serving Disaggregated Large Language Model at Scale
Total Score

0

P/D-Serve: Serving Disaggregated Large Language Model at Scale

Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo, Zhengyong Zhang

Serving disaggregated large language models (LLMs) over tens of thousands of xPU devices (GPUs or NPUs) with reliable performance faces multiple challenges. 1) Ignoring the diversity (various prefixes and tidal requests), treating all the prompts in a mixed pool is inadequate. To facilitate the similarity per scenario and minimize the inner mismatch on P/D (prefill and decoding) processing, fine-grained organization is required, dynamically adjusting P/D ratios for better performance. 2) Due to inaccurate estimation on workload (queue status or maintained connections), the global scheduler easily incurs unnecessary timeouts in prefill. 3) Block-fixed device-to-device (D2D) KVCache transfer over cluster-level RDMA (remote direct memory access) fails to achieve desired D2D utilization as expected. To overcome previous problems, this paper proposes an end-to-end system P/D-Serve, complying with the paradigm of MLOps (machine learning operations), which models end-to-end (E2E) P/D performance and enables: 1) fine-grained P/D organization, mapping the service with RoCE (RDMA over converged ethernet) as needed, to facilitate similar processing and dynamic adjustments on P/D ratios; 2) on-demand forwarding upon rejections for idle prefill, decoupling the scheduler from regular inaccurate reports and local queues, to avoid timeouts in prefill; and 3) efficient KVCache transfer via optimized D2D access. P/D-Serve is implemented upon Ascend and MindSpore, has been deployed over tens of thousands of NPUs for more than eight months in commercial use, and further achieves 60%, 42% and 46% improvements on E2E throughput, time-to-first-token (TTFT) SLO (service level objective) and D2D transfer time. As the E2E system with optimizations, P/D-Serve achieves 6.7x increase on throughput, compared with aggregated LLMs.

Read more

8/16/2024

🤯

Total Score

0

PerLLM: Personalized Inference Scheduling with Edge-Cloud Collaboration for Diverse LLM Services

Zheming Yang, Yuanhao Yang, Chang Zhao, Qi Guo, Wenkai He, Wen Ji

With the rapid growth in the number of large language model (LLM) users, it is difficult for bandwidth-constrained cloud servers to simultaneously process massive LLM services in real-time. Recently, edge-cloud infrastructures have been used to improve the processing efficiency of large-scale LLM services. However, the diversity of task requirements and the dynamics of resources pose great challenges to inference scheduling, leading to the wastage of many resources. In this paper, we present PerLLM, a personalized inference scheduling framework with edge-cloud collaboration designed for diverse LLM services. For the complexity of multiple constraints and the decision-making process of edge-cloud collaboration, we integrate the upper confidence bound algorithm based on the constraint satisfaction mechanism in PerLLM. For diverse LLM services, PerLLM can optimize service scheduling and resource allocation solutions within the edge-cloud infrastructure to meet processing time requirements while minimizing energy costs. Experimental results from different model deployments show that PerLLM can effectively meet the processing time requirements of personalized services. Compared to other methods, PerLLM achieves 2.2x, 2.1x, and 1.6x throughput and reduces the energy cost by more than 50%.

Read more

5/24/2024

Efficient multi-prompt evaluation of LLMs
Total Score

0

Efficient multi-prompt evaluation of LLMs

Felipe Maia Polo, Ronald Xu, Lucas Weber, M'irian Silva, Onkar Bhardwaj, Leshem Choshen, Allysson Flavio Melo de Oliveira, Yuekai Sun, Mikhail Yurochkin

Most popular benchmarks for comparing LLMs rely on a limited set of prompt templates, which may not fully capture the LLMs' abilities and can affect the reproducibility of results on leaderboards. Many recent works empirically verify prompt sensitivity and advocate for changes in LLM evaluation. In this paper, we consider the problem of estimating the performance distribution across many prompt variants instead of finding a single prompt to evaluate with. We introduce PromptEval, a method for estimating performance across a large set of prompts borrowing strength across prompts and examples to produce accurate estimates under practical evaluation budgets. The resulting distribution can be used to obtain performance quantiles to construct various robust performance metrics (e.g., top 95% quantile or median). We prove that PromptEval consistently estimates the performance distribution and demonstrate its efficacy empirically on three prominent LLM benchmarks: MMLU, BIG-bench Hard, and LMentry. For example, PromptEval can accurately estimate performance quantiles across 100 prompt templates on MMLU with a budget equivalent to two single-prompt evaluations. Our code and data can be found at https://github.com/felipemaiapolo/prompt-eval.

Read more

6/11/2024