Hierarchical Resource Partitioning on Modern GPUs: A Reinforcement Learning Approach

Read original: arXiv:2405.08754 - Published 5/15/2024 by Urvij Saroliya, Eishi Arima, Dai Liu, Martin Schulz

🏅

Overview

GPUs are now commonly used in high-performance computing (HPC) clusters due to their ability to provide much higher computational throughput and memory bandwidth compared to CPUs.
As the available resources in GPUs have increased over time, it has become challenging for a single program to fully utilize them.
The industry has started supporting resource partitioning features to improve resource utilization by co-scheduling multiple programs on the same GPU.
This paper focuses on hierarchical resource partitioning on modern GPUs, specifically using a combination of Multi-Process Service (MPS) and Multi-Instance GPU (MIG).

Plain English Explanation

GPUs are specialized computer chips that excel at processing large amounts of data in parallel. They are now commonly used in high-performance computing (HPC) clusters, which are powerful computer systems used for scientific and engineering applications. Compared to traditional CPUs, GPUs can perform many calculations simultaneously, allowing them to process data much faster.

However, as the capabilities of GPUs have grown over the years, it has become increasingly difficult for a single program or application to fully utilize all the resources available on a GPU. To address this, the industry has developed features that allow multiple programs to share the resources of a single GPU. This is known as resource partitioning.

The paper you provided focuses on a specific type of resource partitioning called hierarchical resource partitioning. This involves using two different partitioning features available on recent NVIDIA GPUs: Multi-Process Service (MPS) and Multi-Instance GPU (MIG). MPS allows for finer-grained logical partitioning of GPU resources, while MIG enables coarser-grained physical partitioning.

The researchers propose a method that combines these two partitioning features in a hierarchical manner, along with a technique to optimize the selection of programs that will be co-scheduled on the GPU. This approach aims to improve the overall utilization of GPU resources and increase the throughput of the HPC cluster.

Technical Explanation

The paper proposes a method for comprehensively co-optimizing the setup of hierarchical partitioning and the selection of co-scheduling groups from a given set of jobs, based on reinforcement learning using their profiles.

The researchers utilize a combination of two GPU partitioning features available on recent NVIDIA GPUs:

Multi-Process Service (MPS): This feature enables finer-grained logical partitioning of GPU resources, allowing multiple programs to run concurrently on the same GPU.
Multi-Instance GPU (MIG): This feature provides coarse-grained physical partitioning of the GPU, allowing it to be divided into multiple isolated instances, each of which can be assigned to a different program.

The researchers propose a method that co-optimizes the setup of these hierarchical partitioning features and the selection of co-scheduling groups from a given set of jobs. This is done using reinforcement learning, which takes into account the profiles of the jobs to make informed decisions.

The experimental results presented in the paper demonstrate that this approach can successfully set up job concurrency, partitioning, and co-scheduling group selections simultaneously. This leads to a maximum throughput improvement of up to 1.87 times compared to a traditional time-sharing scheduling approach.

Critical Analysis

The paper provides a comprehensive solution for optimizing the use of GPU resources in HPC clusters through hierarchical partitioning and job co-scheduling. The researchers have identified a relevant problem and proposed a novel approach that combines two GPU partitioning features in a hierarchical manner.

One potential limitation of the research is that it focuses on NVIDIA GPUs specifically, and the proposed method may not be directly applicable to other GPU architectures or vendors. It would be interesting to see if the approach can be generalized to work with a wider range of GPU hardware.

Additionally, the paper does not provide a detailed analysis of the computational overhead or latency introduced by the hierarchical partitioning and co-scheduling mechanisms. Further research could explore the trade-offs between the performance gains and the system overhead in different workload scenarios.

Another area for further investigation could be the impact of this approach on the energy efficiency of the HPC cluster. As power consumption and thermal management are crucial concerns in large-scale computing systems, understanding the energy implications of the proposed method would be valuable.

Overall, the paper presents a compelling solution to the challenge of improving GPU resource utilization in HPC environments. The hierarchical partitioning and co-scheduling approach demonstrated in this research could have significant implications for enhancing the efficiency and performance of GPU-based computing systems.

Conclusion

This paper addresses the challenge of effectively utilizing the increasing resources available in modern GPUs by proposing a hierarchical resource partitioning approach. The researchers leverage a combination of Multi-Process Service (MPS) and Multi-Instance GPU (MIG) features to co-optimize the setup of partitioning and the selection of co-scheduling groups.

The experimental results demonstrate that this approach can significantly improve the throughput of GPU-based HPC clusters, with a maximum improvement of 1.87 times compared to traditional time-sharing scheduling. This research highlights the importance of innovative resource management techniques in harnessing the full potential of modern GPU hardware for demanding scientific and engineering applications.

As GPU capabilities continue to advance, the insights and methods presented in this paper could contribute to the development of more efficient and versatile GPU-based computing systems, driving further progress in high-performance machine learning training, multi-resource scheduling in HPC, and distributed training across heterogeneous clusters.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏅

Hierarchical Resource Partitioning on Modern GPUs: A Reinforcement Learning Approach

Urvij Saroliya, Eishi Arima, Dai Liu, Martin Schulz

GPU-based heterogeneous architectures are now commonly used in HPC clusters. Due to their architectural simplicity specialized for data-level parallelism, GPUs can offer much higher computational throughput and memory bandwidth than CPUs in the same generation do. However, as the available resources in GPUs have increased exponentially over the past decades, it has become increasingly difficult for a single program to fully utilize them. As a consequence, the industry has started supporting several resource partitioning features in order to improve the resource utilization by co-scheduling multiple programs on the same GPU die at the same time. Driven by the technological trend, this paper focuses on hierarchical resource partitioning on modern GPUs, and as an example, we utilize a combination of two different features available on recent NVIDIA GPUs in a hierarchical manner: MPS (Multi-Process Service), a finer-grained logical partitioning; and MIG (Multi-Instance GPU), a coarse-grained physical partitioning. We propose a method for comprehensively co-optimizing the setup of hierarchical partitioning and the selection of co-scheduling groups from a given set of jobs, based on reinforcement learning using their profiles. Our thorough experimental results demonstrate that our approach can successfully set up job concurrency, partitioning, and co-scheduling group selections simultaneously. This results in a maximum throughput improvement by a factor of 1.87 compared to the time-sharing scheduling.

5/15/2024

Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps

Eishi Arima, Minjoon Kang, Issa Saba, Josef Weidendorfer, Carsten Trinitis, Martin Schulz

CPU-GPU heterogeneous systems are now commonly used in HPC (High-Performance Computing). However, improving the utilization and energy-efficiency of such systems is still one of the most critical issues. As one single program typically cannot fully utilize all resources within a node/chip, co-scheduling (or co-locating) multiple programs with complementary resource requirements is a promising solution. Meanwhile, as power consumption has become the first-class design constraint for HPC systems, such co-scheduling techniques should be well-tailored for power-constrained environments. To this end, the industry recently started supporting hardware-level resource partitioning features on modern GPUs for realizing efficient co-scheduling, which can operate with existing power capping features. For example, NVidia's MIG (Multi-Instance GPU) partitions one single GPU into multiple instances at the granularity of a GPC (Graphics Processing Cluster). In this paper, we explicitly target the combination of hardware-level GPU partitioning features and power capping for power-constrained HPC systems. We provide a systematic methodology to optimize the combination of chip partitioning, job allocations, as well as power capping based on our scalability/interference modeling while taking a variety of aspects into account, such as compute/memory intensity and utilization in heterogeneous computational resources (e.g., Tensor Cores). The experimental result indicates that our approach is successful in selecting a near optimal combination across multiple different workloads.

5/8/2024

Orchestrated Co-scheduling, Resource Partitioning, and Power Capping on CPU-GPU Heterogeneous Systems via Machine Learning

Issa Saba, Eishi Arima, Dai Liu, Martin Schulz

CPU-GPU heterogeneous architectures are now commonly used in a wide variety of computing systems from mobile devices to supercomputers. Maximizing the throughput for multi-programmed workloads on such systems is indispensable as one single program typically cannot fully exploit all available resources. At the same time, power consumption is a key issue and often requires optimizing power allocations to the CPU and GPU while enforcing a total power constraint, in particular when the power/thermal requirements are strict. The result is a system-wide optimization problem with several knobs. In particular we focus on (1) co-scheduling decisions, i.e., selecting programs to co-locate in a space sharing manner; (2) resource partitioning on both CPUs and GPUs; and (3) power capping on both CPUs and GPUs. We solve this problem using predictive performance modeling using machine learning in order to coordinately optimize the above knob setups. Our experiential results using a real system show that our approach achieves up to 67% of speedup compared to a time-sharing-based scheduling with a naive power capping that evenly distributes power budgets across components.

5/8/2024

Optimal Workload Placement on Multi-Instance GPUs

Bekir Turkkan, Pavankumar Murali, Pavithra Harsha, Rohan Arora, Gerard Vanloo, Chandra Narayanaswami

There is an urgent and pressing need to optimize usage of Graphical Processing Units (GPUs), which have arguably become one of the most expensive and sought after IT resources. To help with this goal, several of the current generation of GPUs support a partitioning feature, called Multi-Instance GPU (MIG) to allow multiple workloads to share a GPU, albeit with some constraints. In this paper we investigate how to optimize the placement of Large Language Model (LLM)-based AI Inferencing workloads on GPUs. We first identify and present several use cases that are encountered in practice that require workloads to be efficiently placed or migrated to other GPUs to make room for incoming workloads. The overarching goal is to use as few GPUs as possible and to further minimize memory and compute wastage on GPUs that are utilized. We have developed two approaches to address this problem: an optimization method and a heuristic method. We benchmark these with two workload scheduling heuristics for multiple use cases. Our results show up to 2.85x improvement in the number of GPUs used and up to 70% reduction in GPU wastage over baseline heuristics. We plan to enable the SRE community to leverage our proposed method in production environments.

9/11/2024