Optimal Workload Placement on Multi-Instance GPUs

Read original: arXiv:2409.06646 - Published 9/11/2024 by Bekir Turkkan, Pavankumar Murali, Pavithra Harsha, Rohan Arora, Gerard Vanloo, Chandra Narayanaswami

Optimal Workload Placement on Multi-Instance GPUs

Overview

The paper presents an approach for optimal workload placement on multi-instance GPUs.
It aims to improve the utilization and performance of GPUs that can be shared by multiple applications or users.
The proposed technique dynamically allocates GPU resources to different workloads to maximize overall system throughput.

Plain English Explanation

Nowadays, many powerful graphics processing units (GPUs) can be shared by multiple applications or users at the same time. This is called "multi-instance GPU" usage. However, efficiently dividing up the GPU resources between the different workloads is a challenging problem.

The researchers in this paper developed a new approach to optimize the placement of workloads on multi-instance GPUs. Their goal was to maximize the overall performance and utilization of the GPU hardware when it is shared.

The key idea is to dynamically adjust how the GPU resources are allocated between the different running workloads. This allows the system to adapt to changes in the workload demands over time and ensure that the GPU is being used as efficiently as possible. For example, if one workload suddenly requires more GPU resources, the system can detect this and adjust the allocations accordingly.

By optimizing the workload placement in this way, the researchers were able to achieve significant improvements in GPU utilization and overall system throughput compared to traditional static approaches to resource allocation.

Technical Explanation

The paper proposes a new framework called Optimus for optimal workload placement on multi-instance GPUs. Optimus uses a two-level hierarchical resource partitioning approach to dynamically allocate GPU resources between different running workloads.

At the

global

level, Optimus determines the overall GPU resource partitioning between the workloads to maximize system throughput. It does this by modeling the performance of each workload as a function of the GPU resources allocated to it. Optimus then uses an optimization algorithm to find the resource partitioning that yields the highest total throughput.

At the

local

level, Optimus further divides the GPU resources allocated to each workload among the individual GPU instances (e.g. compute cores, memory banks) to optimize performance within each workload. This is done using a reinforcement learning-based approach that learns the optimal local resource partitioning policies.

Through extensive experiments, the authors demonstrate that Optimus can significantly improve GPU utilization and performance compared to existing static partitioning approaches. For example, they show up to 40% higher system throughput on a variety of real-world workloads.

Critical Analysis

The paper presents a well-designed and thorough evaluation of the proposed Optimus framework. The authors consider a range of realistic multi-instance GPU workloads and compare Optimus to several baseline partitioning approaches.

One potential limitation is that the evaluation is primarily focused on throughput-oriented workloads. The performance advantages of Optimus may be less pronounced for latency-sensitive applications, where minimizing the response time of individual tasks is more important than overall system throughput.

Additionally, while Optimus demonstrates strong empirical results, the paper lacks a deeper analysis of the underlying reasons for its performance advantages. A more detailed examination of the specific tradeoffs and bottlenecks that Optimus is able to navigate compared to other approaches would strengthen the technical contributions.

Overall, this is a well-executed piece of research that makes a meaningful contribution to the problem of efficient resource partitioning for multi-instance GPU systems. The Optimus framework provides a promising direction for further exploration and refinement.

Conclusion

This paper presents Optimus, a new approach for optimal workload placement on multi-instance GPUs. By dynamically partitioning GPU resources between running workloads, Optimus is able to significantly improve overall system throughput and GPU utilization compared to static partitioning techniques.

The key innovation is Optimus's two-level hierarchical resource partitioning strategy, which optimizes both the global allocation of resources between workloads and the local allocation within each workload. This allows the system to adapt to changing workload demands in an effective and principled manner.

The strong empirical results demonstrate the practical relevance of this work for improving the efficiency of modern GPU hardware shared by multiple applications or users. As GPUs continue to play a central role in a wide range of computing domains, techniques like Optimus will become increasingly important for maximizing the utility of these valuable resources.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Optimal Workload Placement on Multi-Instance GPUs

Bekir Turkkan, Pavankumar Murali, Pavithra Harsha, Rohan Arora, Gerard Vanloo, Chandra Narayanaswami

There is an urgent and pressing need to optimize usage of Graphical Processing Units (GPUs), which have arguably become one of the most expensive and sought after IT resources. To help with this goal, several of the current generation of GPUs support a partitioning feature, called Multi-Instance GPU (MIG) to allow multiple workloads to share a GPU, albeit with some constraints. In this paper we investigate how to optimize the placement of Large Language Model (LLM)-based AI Inferencing workloads on GPUs. We first identify and present several use cases that are encountered in practice that require workloads to be efficiently placed or migrated to other GPUs to make room for incoming workloads. The overarching goal is to use as few GPUs as possible and to further minimize memory and compute wastage on GPUs that are utilized. We have developed two approaches to address this problem: an optimization method and a heuristic method. We benchmark these with two workload scheduling heuristics for multiple use cases. Our results show up to 2.85x improvement in the number of GPUs used and up to 70% reduction in GPU wastage over baseline heuristics. We plan to enable the SRE community to leverage our proposed method in production environments.

9/11/2024

💬

Improving GPU Multi-Tenancy Through Dynamic Multi-Instance GPU Reconfiguration

Tianyu Wang, Sheng Li, Bingyao Li, Yue Dai, Ao Li, Geng Yuan, Yufei Ding, Youtao Zhang, Xulong Tang

Continuous learning (CL) has emerged as one of the most popular deep learning paradigms deployed in modern cloud GPUs. Specifically, CL has the capability to continuously update the model parameters (through model retraining) and use the updated model (if available) to serve overtime arriving inference requests. It is generally beneficial to co-locate the retraining and inference together to enable timely model updates and avoid model transfer overheads. This brings the need for GPU sharing among retraining and inferences. Meanwhile, multiple CL workloads can share the modern GPUs in the cloud, leading to multi-tenancy execution. In this paper, we observe that prior GPU-sharing techniques are not optimized for multi-tenancy CL workloads. Specifically, they do not coherently consider the accuracy of the retraining model and the inference service level objective (SLO) attainment. Moreover, they cannot accommodate the overtime dynamics (e.g., inference arrival intensity) in CL execution. In this paper, we propose MIGRator, a novel GPU reconfiguration runtime that dynamically performs GPU reconfiguration for multi-tenancy CL workloads. MIGRator is based on the recent NVIDIA multi-instance GPU (MIG) to mitigate resource contention and formulates the reconfiguration optimization into Integer Linear Programming (ILP) to dynamically identify, reconfigure, and allocate the GPU instances. MIGRator leverages the Goodput metric in the ILP objective function to consider both inference SLO attainment and model accuracy in the reconfiguration exploration. We evaluate MIGRator using representative multi-tenancy CL workloads. The results show our approach outperforms the state-of-the-art GPU sharing techniques (i.e., Ekya, Astraea, and PARIS) by 17%, 21%, and 20%, respectively.

7/19/2024

🏅

Hierarchical Resource Partitioning on Modern GPUs: A Reinforcement Learning Approach

Urvij Saroliya, Eishi Arima, Dai Liu, Martin Schulz

GPU-based heterogeneous architectures are now commonly used in HPC clusters. Due to their architectural simplicity specialized for data-level parallelism, GPUs can offer much higher computational throughput and memory bandwidth than CPUs in the same generation do. However, as the available resources in GPUs have increased exponentially over the past decades, it has become increasingly difficult for a single program to fully utilize them. As a consequence, the industry has started supporting several resource partitioning features in order to improve the resource utilization by co-scheduling multiple programs on the same GPU die at the same time. Driven by the technological trend, this paper focuses on hierarchical resource partitioning on modern GPUs, and as an example, we utilize a combination of two different features available on recent NVIDIA GPUs in a hierarchical manner: MPS (Multi-Process Service), a finer-grained logical partitioning; and MIG (Multi-Instance GPU), a coarse-grained physical partitioning. We propose a method for comprehensively co-optimizing the setup of hierarchical partitioning and the selection of co-scheduling groups from a given set of jobs, based on reinforcement learning using their profiles. Our thorough experimental results demonstrate that our approach can successfully set up job concurrency, partitioning, and co-scheduling group selections simultaneously. This results in a maximum throughput improvement by a factor of 1.87 compared to the time-sharing scheduling.

5/15/2024

Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps

Eishi Arima, Minjoon Kang, Issa Saba, Josef Weidendorfer, Carsten Trinitis, Martin Schulz

CPU-GPU heterogeneous systems are now commonly used in HPC (High-Performance Computing). However, improving the utilization and energy-efficiency of such systems is still one of the most critical issues. As one single program typically cannot fully utilize all resources within a node/chip, co-scheduling (or co-locating) multiple programs with complementary resource requirements is a promising solution. Meanwhile, as power consumption has become the first-class design constraint for HPC systems, such co-scheduling techniques should be well-tailored for power-constrained environments. To this end, the industry recently started supporting hardware-level resource partitioning features on modern GPUs for realizing efficient co-scheduling, which can operate with existing power capping features. For example, NVidia's MIG (Multi-Instance GPU) partitions one single GPU into multiple instances at the granularity of a GPC (Graphics Processing Cluster). In this paper, we explicitly target the combination of hardware-level GPU partitioning features and power capping for power-constrained HPC systems. We provide a systematic methodology to optimize the combination of chip partitioning, job allocations, as well as power capping based on our scalability/interference modeling while taking a variety of aspects into account, such as compute/memory intensity and utilization in heterogeneous computational resources (e.g., Tensor Cores). The experimental result indicates that our approach is successful in selecting a near optimal combination across multiple different workloads.

5/8/2024