Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Scheduling

Read original: arXiv:2408.13510 - Published 8/27/2024 by Kunal Jain, Anjaly Parayil, Ankur Mallick, Esha Choukse, Xiaoting Qin, Jue Zhang, 'I~nigo Goiri, Rujia Wang, Chetan Bansal, Victor Ruhle and 3 others

🚀

Overview

Large language model (LLM) workloads have distinct prefill and decode phases with different compute and memory requirements
Existing scheduling algorithms treat LLM workloads as single jobs, without considering the characteristics of these two phases
This can lead to sub-optimal scheduling and increased response latency

Plain English Explanation

Large language models (LLMs) are powerful machine learning models used for tasks like natural language processing. When running LLM workloads, there are two distinct phases: prefill and decode. The prefill phase requires more computing power, while the decode phase needs more memory.

Existing scheduling algorithms, which decide how to distribute work across different LLM instances, treat these workloads as a single job. They don't consider the unique characteristics of the prefill and decode phases. This can lead to inefficient scheduling, causing longer response times for users.

To address this, the researchers propose a new intelligent router that uses reinforcement learning and a response-length predictor to make more informed scheduling decisions. This helps achieve over 11% lower end-to-end latency compared to existing approaches.

Technical Explanation

The researchers developed a heuristic-guided, reinforcement learning-based intelligent router for scheduling LLM workloads across a cluster of instances.

The router leverages a trainable response-length predictor to estimate the compute and memory requirements of the prefill and decode phases for each incoming query. It then uses a novel formulation to estimate the impact of mixing different workloads on the overall system performance.

Based on these predictions, the router intelligently schedules queries across the available LLM instances to achieve lower end-to-end latency compared to existing approaches that treat LLM workloads as monolithic jobs.

Critical Analysis

The researchers acknowledge that their approach relies on accurate response-length prediction, which could be challenging for some types of LLM queries. They also note that their formulation for estimating the impact of mixed workloads may not capture all the nuances of real-world deployments.

Additionally, the evaluation was conducted on a specific cluster configuration, and the performance gains may vary depending on the hardware and scaling characteristics of the underlying LLM instances.

Further research could explore the robustness of the approach under different LLM architectures, workload patterns, and hardware configurations. Incorporating additional factors, such as energy efficiency or fairness, could also be valuable extensions to this work.

Conclusion

This research proposes an intelligent scheduling router that leverages reinforcement learning and a response-length predictor to account for the distinct characteristics of the prefill and decode phases in LLM workloads. By making more informed scheduling decisions, the router can achieve over 11% lower end-to-end latency compared to existing approaches.

This work highlights the importance of considering the unique properties of LLM workloads when designing scheduling systems, and it demonstrates the potential of data-driven and workload-aware techniques to improve the performance of large-scale LLM deployments.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🚀

Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Scheduling

Kunal Jain, Anjaly Parayil, Ankur Mallick, Esha Choukse, Xiaoting Qin, Jue Zhang, 'I~nigo Goiri, Rujia Wang, Chetan Bansal, Victor Ruhle, Anoop Kulkarni, Steve Kofsky, Saravan Rajmohan

Large Language Model (LLM) workloads have distinct prefill and decode phases with different compute and memory requirements which should ideally be accounted for when scheduling input queries across different LLM instances in a cluster. However existing scheduling algorithms treat LLM workloads as monolithic jobs without considering the distinct characteristics of the two phases in each workload. This leads to sub-optimal scheduling and increased response latency. In this work, we propose a heuristic-guided reinforcement learning-based intelligent router for data-driven and workload-aware scheduling. Our router leverages a trainable response-length predictor, and a novel formulation for estimating the impact of mixing different workloads to schedule queries across LLM instances and achieve over 11% lower end-to-end latency than existing approaches.

8/27/2024

RouteLLM: Learning to Route LLMs with Preference Data

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, Ion Stoica

Large language models (LLMs) exhibit impressive capabilities across a wide range of tasks, yet the choice of which model to use often involves a trade-off between performance and cost. More powerful models, though effective, come with higher expenses, while less capable models are more cost-effective. To address this dilemma, we propose several efficient router models that dynamically select between a stronger and a weaker LLM during inference, aiming to optimize the balance between cost and response quality. We develop a training framework for these routers leveraging human preference data and data augmentation techniques to enhance performance. Our evaluation on widely-recognized benchmarks shows that our approach significantly reduces costs-by over 2 times in certain cases-without compromising the quality of responses. Interestingly, our router models also demonstrate significant transfer learning capabilities, maintaining their performance even when the strong and weak models are changed at test time. This highlights the potential of these routers to provide a cost-effective yet high-performance solution for deploying LLMs.

7/23/2024

🛸

Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing

Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V. S. Lakshmanan, Ahmed Hassan Awadallah

Large language models (LLMs) excel in most NLP tasks but also require expensive cloud servers for deployment due to their size, while smaller models that can be deployed on lower cost (e.g., edge) devices, tend to lag behind in terms of response quality. Therefore in this work we propose a hybrid inference approach which combines their respective strengths to save cost and maintain quality. Our approach uses a router that assigns queries to the small or large model based on the predicted query difficulty and the desired quality level. The desired quality level can be tuned dynamically at test time to seamlessly trade quality for cost as per the scenario requirements. In experiments our approach allows us to make up to 40% fewer calls to the large model, with no drop in response quality.

4/24/2024

🤯

PerLLM: Personalized Inference Scheduling with Edge-Cloud Collaboration for Diverse LLM Services

Zheming Yang, Yuanhao Yang, Chang Zhao, Qi Guo, Wenkai He, Wen Ji

With the rapid growth in the number of large language model (LLM) users, it is difficult for bandwidth-constrained cloud servers to simultaneously process massive LLM services in real-time. Recently, edge-cloud infrastructures have been used to improve the processing efficiency of large-scale LLM services. However, the diversity of task requirements and the dynamics of resources pose great challenges to inference scheduling, leading to the wastage of many resources. In this paper, we present PerLLM, a personalized inference scheduling framework with edge-cloud collaboration designed for diverse LLM services. For the complexity of multiple constraints and the decision-making process of edge-cloud collaboration, we integrate the upper confidence bound algorithm based on the constraint satisfaction mechanism in PerLLM. For diverse LLM services, PerLLM can optimize service scheduling and resource allocation solutions within the edge-cloud infrastructure to meet processing time requirements while minimizing energy costs. Experimental results from different model deployments show that PerLLM can effectively meet the processing time requirements of personalized services. Compared to other methods, PerLLM achieves 2.2x, 2.1x, and 1.6x throughput and reduces the energy cost by more than 50%.

5/24/2024