Autothrottle: A Practical Bi-Level Approach to Resource Management for SLO-Targeted Microservices

Read original: arXiv:2212.12180 - Published 4/16/2024 by Zibo Wang, Pinghe Li, Chieh-Jan Mike Liang, Feng Wu, Francis Y. Yan

🧠

Overview

The paper presents Autothrottle, a resource management framework for cloud applications that use microservices.
Autothrottle aims to achieve efficient resource usage while preserving the end-user experience, which is a challenging task as cloud applications increasingly adopt microservices.
The framework architecturally decouples application-level service-level objective (SLO) feedback from per-service resource control, bridging them through the notion of performance targets.

Plain English Explanation

Cloud applications are increasingly using a microservices architecture, where an application is broken down into smaller, independent services. This allows for more flexibility and scalability, but it also introduces new challenges for resource management. Specifically, there are two distinct levels of system behavior that need to be considered: the end-to-end latency of the application (how long it takes for a user's request to be processed), and the resource usage of each individual service.

Translating between these two levels is difficult because user requests often traverse multiple heterogeneous services, each of which contributes to the overall latency in an uneven way. Autothrottle is a framework that aims to address this challenge. It uses a two-level approach, with an application-wide controller that sets performance targets for each service, and per-service controllers that try to meet those targets.

The application-wide controller uses machine learning to periodically adjust the performance targets, expressed as CPU throttle ratios, for each service. The per-service controllers then use heuristic algorithms to try to meet these targets, effectively balancing the overall application latency with the resource usage of each individual service.

The key idea is to decouple the application-level SLO feedback from the per-service resource control, using the performance targets as a bridge between the two. This allows the framework to optimize resource usage while still maintaining the desired end-user experience, which is a crucial challenge for cloud application operators as they adopt microservices.

Technical Explanation

The Autothrottle framework consists of two key components:

Application-wide controller: This component uses a learning-based approach to periodically set performance targets, expressed as CPU throttle ratios, for each individual service within the microservices-based application. The goal is to achieve the desired end-to-end application latency while optimizing resource usage.
Per-service controllers: These heuristic-based controllers attempt to meet the performance targets set by the application-wide controller for their respective services. By doing so, they help to strike a balance between the overall application latency and the resource usage of each individual service.

The authors evaluate Autothrottle on three different microservice applications, using workload traces from production scenarios. The results show that Autothrottle can achieve significant CPU savings, up to 26.21% over the best-performing baseline and up to 93.84% over all baselines, while still preserving the desired end-user experience.

Critical Analysis

The paper presents a well-designed and comprehensive approach to the challenge of resource management for cloud applications using microservices. The authors have addressed a real-world problem that cloud operators face as their applications become more complex and distributed.

One potential limitation of the research is that it focuses solely on CPU usage as the resource metric. While CPU is a crucial resource, cloud applications often need to consider other resources, such as memory, network bandwidth, and storage. Extending Autothrottle to handle multiple resource types could be an area for further research.

Additionally, the paper does not explore the impact of the machine learning model used in the application-wide controller. The performance of this component could be sensitive to the choice of model, hyperparameters, and training data. Investigating the robustness of the learning-based controller would be an important next step.

Overall, the Autothrottle framework represents a valuable contribution to the field of resource management for cloud applications, and the research has the potential to significantly improve the efficiency and performance of microservices-based systems.

Conclusion

The Autothrottle framework addresses the challenge of achieving resource efficiency while preserving end-user experience in cloud applications that use microservices. By architecturally decoupling application-level SLO feedback from per-service resource control, and bridging them through the notion of performance targets, Autothrottle can achieve significant CPU savings while maintaining the desired end-user experience. The research demonstrates the potential for innovative resource management approaches to enable the continued growth and adoption of microservices in cloud computing.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🧠

Autothrottle: A Practical Bi-Level Approach to Resource Management for SLO-Targeted Microservices

Zibo Wang, Pinghe Li, Chieh-Jan Mike Liang, Feng Wu, Francis Y. Yan

Achieving resource efficiency while preserving end-user experience is non-trivial for cloud application operators. As cloud applications progressively adopt microservices, resource managers are faced with two distinct levels of system behavior: end-to-end application latency and per-service resource usage. Translating between the two levels, however, is challenging because user requests traverse heterogeneous services that collectively (but unevenly) contribute to the end-to-end latency. We present Autothrottle, a bi-level resource management framework for microservices with latency SLOs (service-level objectives). It architecturally decouples application SLO feedback from service resource control, and bridges them through the notion of performance targets. Specifically, an application-wide learning-based controller is employed to periodically set performance targets -- expressed as CPU throttle ratios -- for per-service heuristic controllers to attain. We evaluate Autothrottle on three microservice applications, with workload traces from production scenarios. Results show superior CPU savings, up to 26.21% over the best-performing baseline and up to 93.84% over all baselines.

4/16/2024

🌐

CASA: A Framework for SLO and Carbon-Aware Autoscaling and Scheduling in Serverless Cloud Computing

S. Qi, H. Moore, N. Hogade, D. Milojicic, C. Bash, S. Pasricha

Serverless computing is an emerging cloud computing paradigm that can reduce costs for cloud providers and their customers. However, serverless cloud platforms have stringent performance requirements (due to the need to execute short duration functions in a timely manner) and a growing carbon footprint. Traditional carbon-reducing techniques such as shutting down idle containers can reduce performance by increasing cold-start latencies of containers required in the future. This can cause higher violation rates of service level objectives (SLOs). Conversely, traditional latency-reduction approaches of prewarming containers or keeping them alive when not in use can improve performance but increase the associated carbon footprint of the serverless cluster platform. To strike a balance between sustainability and performance, in this paper, we propose a novel carbon- and SLO-aware framework called CASA to schedule and autoscale containers in a serverless cloud computing cluster. Experimental results indicate that CASA reduces the operational carbon footprint of a FaaS cluster by up to 2.6x while also reducing the SLO violation rate by up to 1.4x compared to the state-of-the-art.

9/4/2024

SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving

Andreas Kosmas Kakolyris, Dimosthenis Masouros, Petros Vavaroutsos, Sotirios Xydis, Dimitrios Soudris

As Large Language Models (LLMs) gain traction, their reliance on power-hungry GPUs places ever-increasing energy demands, raising environmental and monetary concerns. Inference dominates LLM workloads, presenting a critical challenge for providers: minimizing energy costs under Service-Level Objectives (SLOs) that ensure optimal user experience. In this paper, we present textit{throttLL'eM}, a framework that reduces energy consumption while meeting SLOs through the use of instance and GPU frequency scaling. textit{throttLL'eM} features mechanisms that project future KV cache usage and batch size. Leveraging a Machine-Learning (ML) model that receives these projections as inputs, textit{throttLL'eM} manages performance at the iteration level to satisfy SLOs with reduced frequencies and instance sizes. We show that the proposed ML model achieves $R^2$ scores greater than 0.97 and miss-predicts performance by less than 1 iteration per second on average. Experimental results on LLM inference traces show that textit{throttLL'eM} achieves up to 43.8% lower energy consumption and an energy efficiency improvement of at least $1.71times$ under SLOs, when compared to NVIDIA's Triton server.

8/13/2024

DRPC: Distributed Reinforcement Learning Approach for Scalable Resource Provisioning in Container-based Clusters

Haoyu Bai, Minxian Xu, Kejiang Ye, Rajkumar Buyya, Chengzhong Xu

Microservices have transformed monolithic applications into lightweight, self-contained, and isolated application components, establishing themselves as a dominant paradigm for application development and deployment in public clouds such as Google and Alibaba. Autoscaling emerges as an efficient strategy for managing resources allocated to microservices' replicas. However, the dynamic and intricate dependencies within microservice chains present challenges to the effective management of scaled microservices. Additionally, the centralized autoscaling approach can encounter scalability issues, especially in the management of large-scale microservice-based clusters. To address these challenges and enhance scalability, we propose an innovative distributed resource provisioning approach for microservices based on the Twin Delayed Deep Deterministic Policy Gradient algorithm. This approach enables effective autoscaling decisions and decentralizes responsibilities from a central node to distributed nodes. Comparative results with state-of-the-art approaches, obtained from a realistic testbed and traces, indicate that our approach reduces the average response time by 15% and the number of failed requests by 24%, validating improved scalability as the number of requests increases.

7/16/2024