Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving

2405.06856

Published 5/14/2024 by Chengyi Nie, Rodrigo Fonseca, Zhenhua Liu

Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving

Abstract

The demand for large language model (LLM) inference is gradually dominating the artificial intelligence workloads. Therefore, there is an urgent need for cost-efficient inference serving. Existing work focuses on single-worker optimization and lacks consideration of cluster-level management for both inference queries and computing resources. However, placing requests and managing resources without considering the query features easily causes SLO violations or resource underutilization. Providers are forced to allocate extra computing resources to guarantee user experience, leading to additional serving costs. In this paper we introduce Aladdin, a scheduler that co-adaptively places queries and scales computing resources with SLO awareness. For a stream of inference queries, Aladdin first predicts minimal computing resources and the corresponding serving workers' configuration required to fulfill the SLOs for all queries. Then, it places the queries to each serving worker according to the prefill and decode latency models of batched LLM inference to maximize each worker's utilization. Results show that Aladdin reduces the serving cost of a single model by up to 71% for the same SLO level compared with the baselines, which can be millions of dollars per year.

Create account to get full access

Background and Motivation

Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving

Overview

This paper presents Aladdin, a system that optimizes the placement and scaling of large language models (LLMs) to meet service-level objectives (SLOs) while minimizing resource usage.
Aladdin jointly considers LLM placement and scaling to efficiently serve dynamic SLOs, addressing challenges in existing approaches.
The paper evaluates Aladdin's performance and compares it to alternative strategies using real-world LLM workloads.

Plain English Explanation

Serving large language models (LLMs) like GPT-3 or BERT to users can be challenging, as these models require significant computational resources and need to meet certain performance targets, known as service-level objectives (SLOs). The Sponge and Proxy systems have tried to address this, but they have limitations.

The Aladdin system presented in this paper takes a new approach. It jointly considers where to place the LLMs (on which servers) and how to scale them (adjusting the resources they use) to meet the SLOs as efficiently as possible. This joint optimization helps Aladdin better handle dynamic changes in workload and SLOs compared to previous systems.

Technical Explanation

Aladdin is designed to efficiently serve LLMs while meeting dynamic service-level objectives (SLOs). It takes a joint approach to LLM placement and scaling, which differs from previous systems like Sponge and Proxy.

Aladdin uses a planning-based optimization algorithm to determine the optimal placement of LLMs across servers and the appropriate scaling of resources for each model. This allows Aladdin to adapt to changes in workload and SLOs more effectively than prior approaches.

The paper evaluates Aladdin's performance using real-world LLM workloads and compares it to alternative strategies. The results demonstrate that Aladdin can meet SLOs while using fewer resources than other systems.

Critical Analysis

The Aladdin paper makes a compelling case for its joint placement and scaling approach to serving LLMs. However, the paper does not address some potential limitations:

The optimization algorithm used by Aladdin may become computationally expensive as the number of LLMs and servers scales up.
The paper does not discuss how Aladdin would handle hardware failures or other unexpected events that could disrupt the placement and scaling of LLMs.
The evaluation is limited to a specific set of workloads, and it's unclear how well Aladdin would perform on a more diverse range of LLM applications.

Further research could explore ways to improve the efficiency and robustness of Aladdin's optimization, as well as testing its performance on a wider variety of real-world LLM serving scenarios.

Conclusion

The Aladdin system presented in this paper offers a promising approach to serving large language models (LLMs) more efficiently by jointly optimizing their placement and scaling to meet dynamic service-level objectives (SLOs). While the paper demonstrates the advantages of Aladdin over previous systems, there are some potential areas for improvement that could be explored in future research. Overall, Aladdin represents an important step forward in the ongoing effort to make LLM serving more practical and cost-effective for real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Sponge: Inference Serving with Dynamic SLOs Using In-Place Vertical Scaling

Kamran Razavi, Saeid Ghafouri, Max Muhlhauser, Pooyan Jamshidi, Lin Wang

Mobile and IoT applications increasingly adopt deep learning inference to provide intelligence. Inference requests are typically sent to a cloud infrastructure over a wireless network that is highly variable, leading to the challenge of dynamic Service Level Objectives (SLOs) at the request level. This paper presents Sponge, a novel deep learning inference serving system that maximizes resource efficiency while guaranteeing dynamic SLOs. Sponge achieves its goal by applying in-place vertical scaling, dynamic batching, and request reordering. Specifically, we introduce an Integer Programming formulation to capture the resource allocation problem, providing a mathematical model of the relationship between latency, batch size, and resources. We demonstrate the potential of Sponge through a prototype implementation and preliminary experiments and discuss future works.

4/24/2024

cs.DC

🤯

PerLLM: Personalized Inference Scheduling with Edge-Cloud Collaboration for Diverse LLM Services

Zheming Yang, Yuanhao Yang, Chang Zhao, Qi Guo, Wenkai He, Wen Ji

With the rapid growth in the number of large language model (LLM) users, it is difficult for bandwidth-constrained cloud servers to simultaneously process massive LLM services in real-time. Recently, edge-cloud infrastructures have been used to improve the processing efficiency of large-scale LLM services. However, the diversity of task requirements and the dynamics of resources pose great challenges to inference scheduling, leading to the wastage of many resources. In this paper, we present PerLLM, a personalized inference scheduling framework with edge-cloud collaboration designed for diverse LLM services. For the complexity of multiple constraints and the decision-making process of edge-cloud collaboration, we integrate the upper confidence bound algorithm based on the constraint satisfaction mechanism in PerLLM. For diverse LLM services, PerLLM can optimize service scheduling and resource allocation solutions within the edge-cloud infrastructure to meet processing time requirements while minimizing energy costs. Experimental results from different model deployments show that PerLLM can effectively meet the processing time requirements of personalized services. Compared to other methods, PerLLM achieves 2.2x, 2.1x, and 1.6x throughput and reduces the energy cost by more than 50%.

5/24/2024

cs.DC cs.NI

New!One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving

Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Shengkun Cui, Chandra Narayanaswami, Zbigniew Kalbarczyk, Ravishankar Iyer

$ $Large language models (LLMs) have become an increasingly important workload for cloud providers catering to both enterprise and consumer applications. LLM inference requests from these applications have end-to-end latency SLOs that must be adhered to in production settings. However, existing LLM serving systems focus on optimization objectives such as request serving throughput or request execution latency rather than the end-to-end latency SLOs. Achieving end-to-end SLOs for latency-sensitive requests is challenging due to head-of-line (HOL) blocking in the request queue, which results from bursty arrival rates and insufficient resources. To address the above challenge, we propose QLM, a multi-model queue management framework for LLM serving. QLM uses stochastic programming to orchestrate the actions of multiple LLM Serving Operations (LSOs) to reduce HOL blocking and maximize SLO attainment. Specifically, QLM uses the following LSOs: model swapping, request eviction, GPU-CPU state swapping, load balancing, and warm model start. Evaluation on heterogeneous GPU devices and models with real-world LLM serving dataset shows that QLM improves SLO attainment by 40-90% and throughput by 20-400% while maintaining or improving device utilization compared to other state-of-the-art LLM serving systems.

7/2/2024

cs.DC cs.CL cs.LG

Cached Model-as-a-Resource: Provisioning Large Language Model Agents for Edge Intelligence in Space-air-ground Integrated Networks

Minrui Xu, Dusit Niyato, Hongliang Zhang, Jiawen Kang, Zehui Xiong, Shiwen Mao, Zhu Han

Edge intelligence in space-air-ground integrated networks (SAGINs) can enable worldwide network coverage beyond geographical limitations for users to access ubiquitous and low-latency intelligence services. Facing global coverage and complex environments in SAGINs, edge intelligence can provision approximate large language models (LLMs) agents for users via edge servers at ground base stations (BSs) or cloud data centers relayed by satellites. As LLMs with billions of parameters are pre-trained on vast datasets, LLM agents have few-shot learning capabilities, e.g., chain-of-thought (CoT) prompting for complex tasks, which raises a new trade-off between resource consumption and performance in SAGINs. In this paper, we propose a joint caching and inference framework for edge intelligence to provision sustainable and ubiquitous LLM agents in SAGINs. We introduce cached model-as-a-resource for offering LLMs with limited context windows and propose a novel optimization framework, i.e., joint model caching and inference, to utilize cached model resources for provisioning LLM agent services along with communication, computing, and storage resources. We design age of thought (AoT) considering the CoT prompting of LLMs, and propose a least AoT cached model replacement algorithm for optimizing the provisioning cost. We propose a deep Q-network-based modified second-bid (DQMSB) auction to incentivize network operators, which can enhance allocation efficiency by 23% while guaranteeing strategy-proofness and free from adverse selection.

6/3/2024

cs.NI eess.SP