Missile: Fine-Grained, Hardware-Level GPU Resource Isolation for Multi-Tenant DNN Inference

Read original: arXiv:2407.13996 - Published 7/30/2024 by Yongkang Zhang, Haoxuan Yu, Chenxia Han, Cheng Wang, Baotong Lu, Yang Li, Xiaowen Chu, Huaicheng Li

🤯

Overview

Combining high-priority, low-latency (LS) and low-priority, best-effort (BE) DNN inference services on the same GPU cluster can reduce total cost of ownership (TCO).
Existing GPU sharing solutions are unable to avoid resource conflicts between LS and BE tasks, leading to poor latency for LS tasks and low throughput for BE tasks.
This paper presents Missile, a GPU sharing solution for multi-tenant DNN inference on NVIDIA GPUs that isolates resources between LS and BE tasks.

Plain English Explanation

The paper discusses a way to run different types of deep neural network (DNN) inference tasks on the same GPU cluster to save money. Some tasks, like real-time video processing, need to run quickly with low delay (latency-sensitive, or LS tasks). Other tasks, like batch data processing, don't need to be as fast (best-effort, or BE tasks).

Putting LS and BE tasks on the same GPUs could be cheaper, but existing methods can't isolate the resources used by each type of task. This leads to the LS tasks being slow and the BE tasks not running as efficiently as they could.

The paper introduces Missile, a new system that can better separate the resources used by LS and BE tasks on NVIDIA GPUs. This allows the LS tasks to run quickly while still getting good performance from the BE tasks.

Technical Explanation

Missile addresses two key bottlenecks that limit existing GPU sharing solutions:

VRAM Channel Conflicts: Missile reveals the VRAM channel hash mapping architecture of NVIDIA GPUs and uses software-level cache coloring to eliminate conflicts between LS and BE tasks.
PCIe Bus Contentions: Missile isolates the PCIe bus and fairly allocates PCIe bandwidth to LS and BE tasks using a completely fair scheduler.

Through comprehensive experiments with 12 mainstream DNNs and synthetic/real-world workloads on 4 GPUs, Missile is shown to:

Reduce tail latency for LS services by up to ~50% compared to state-of-the-art GPU sharing solutions.
Achieve up to 6.1x higher throughput for BE jobs.
Allocate PCIe bus bandwidth to tenants on-demand for optimal performance.

Critical Analysis

The paper provides a thorough technical explanation of Missile's design and evaluation. However, it does not discuss potential limitations or future research directions in depth.

Some areas that could be further explored include:

Generalizability: The evaluation is focused on NVIDIA GPUs - it would be helpful to understand how Missile's techniques could apply to other GPU architectures.
Production Feasibility: The paper does not address practical deployment challenges, such as handling dynamic workload changes or fault tolerance.
Hardware Requirements: Missile may require specialized hardware support (e.g., cache partitioning) that may not be available in all GPU systems.

Overall, Missile presents a promising approach to improving GPU multi-tenancy, but additional research is needed to fully understand its real-world applicability and limitations.

Conclusion

This paper introduces Missile, a GPU sharing solution that can effectively isolate resources between high-priority, low-latency (LS) and low-priority, best-effort (BE) DNN inference tasks. By addressing key bottlenecks like VRAM channel conflicts and PCIe bus contention, Missile is able to achieve low latency for LS tasks while also maintaining high throughput for BE tasks.

The technical evaluation demonstrates Missile's ability to outperform state-of-the-art GPU sharing approaches. However, the paper could be strengthened by further discussion of Missile's limitations and future research directions. Overall, Missile represents an important step towards more efficient and cost-effective GPU utilization in multi-tenant environments.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤯

Missile: Fine-Grained, Hardware-Level GPU Resource Isolation for Multi-Tenant DNN Inference

Yongkang Zhang, Haoxuan Yu, Chenxia Han, Cheng Wang, Baotong Lu, Yang Li, Xiaowen Chu, Huaicheng Li

Colocating high-priority, latency-sensitive (LS) and low-priority, best-effort (BE) DNN inference services reduces the total cost of ownership (TCO) of GPU clusters. Limited by bottlenecks such as VRAM channel conflicts and PCIe bus contentions, existing GPU sharing solutions are unable to avoid resource conflicts among concurrently executing tasks, failing to achieve both low latency for LS tasks and high throughput for BE tasks. To bridge this gap, this paper presents Missile, a general GPU sharing solution for multi-tenant DNN inference on NVIDIA GPUs. Missile approximates fine-grained GPU hardware resource isolation between multiple LS and BE DNN tasks at software level. Through comprehensive reverse engineering, Missile first reveals a general VRAM channel hash mapping architecture of NVIDIA GPUs and eliminates VRAM channel conflicts using software-level cache coloring. It also isolates the PCIe bus and fairly allocates PCIe bandwidth using completely fair scheduler. We evaluate 12 mainstream DNNs with synthetic and real-world workloads on four GPUs. The results show that compared to the state-of-the-art GPU sharing solutions, Missile reduces tail latency for LS services by up to ~50%, achieves up to 6.1x BE job throughput, and allocates PCIe bus bandwidth to tenants on-demand for optimal performance.

7/30/2024

💬

Improving GPU Multi-Tenancy Through Dynamic Multi-Instance GPU Reconfiguration

Tianyu Wang, Sheng Li, Bingyao Li, Yue Dai, Ao Li, Geng Yuan, Yufei Ding, Youtao Zhang, Xulong Tang

Continuous learning (CL) has emerged as one of the most popular deep learning paradigms deployed in modern cloud GPUs. Specifically, CL has the capability to continuously update the model parameters (through model retraining) and use the updated model (if available) to serve overtime arriving inference requests. It is generally beneficial to co-locate the retraining and inference together to enable timely model updates and avoid model transfer overheads. This brings the need for GPU sharing among retraining and inferences. Meanwhile, multiple CL workloads can share the modern GPUs in the cloud, leading to multi-tenancy execution. In this paper, we observe that prior GPU-sharing techniques are not optimized for multi-tenancy CL workloads. Specifically, they do not coherently consider the accuracy of the retraining model and the inference service level objective (SLO) attainment. Moreover, they cannot accommodate the overtime dynamics (e.g., inference arrival intensity) in CL execution. In this paper, we propose MIGRator, a novel GPU reconfiguration runtime that dynamically performs GPU reconfiguration for multi-tenancy CL workloads. MIGRator is based on the recent NVIDIA multi-instance GPU (MIG) to mitigate resource contention and formulates the reconfiguration optimization into Integer Linear Programming (ILP) to dynamically identify, reconfigure, and allocate the GPU instances. MIGRator leverages the Goodput metric in the ILP objective function to consider both inference SLO attainment and model accuracy in the reconfiguration exploration. We evaluate MIGRator using representative multi-tenancy CL workloads. The results show our approach outperforms the state-of-the-art GPU sharing techniques (i.e., Ekya, Astraea, and PARIS) by 17%, 21%, and 20%, respectively.

7/19/2024

Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing

Yizhou Luo, Qiang Wang, Shaohuai Shi, Jiaxin Lai, Shuhan Qi, Jiajia Zhang, Xuan Wang

Deep learning (DL) has demonstrated significant success across diverse fields, leading to the construction of dedicated GPU accelerators within GPU clusters for high-quality training services. Efficient scheduler designs for such clusters are vital to reduce operational costs and enhance resource utilization. While recent schedulers have shown impressive performance in optimizing DL job performance and cluster utilization through periodic reallocation or selection of GPU resources, they also encounter challenges such as preemption and migration overhead, along with potential DL accuracy degradation. Nonetheless, few explore the potential benefits of GPU sharing to improve resource utilization and reduce job queuing times. Motivated by these insights, we present a job scheduling model allowing multiple jobs to share the same set of GPUs without altering job training settings. We introduce SJF-BSBF (shortest job first with best sharing benefit first), a straightforward yet effective heuristic scheduling algorithm. SJF-BSBF intelligently selects job pairs for GPU resource sharing and runtime settings (sub-batch size and scheduling time point) to optimize overall performance while ensuring DL convergence accuracy through gradient accumulation. In experiments with both physical DL workloads and trace-driven simulations, even as a preemption-free policy, SJF-BSBF reduces the average job completion time by 27-33% relative to the state-of-the-art preemptive DL schedulers. Moreover, SJF-BSBF can wisely determine the optimal resource sharing settings, such as the sharing time point and sub-batch size for gradient accumulation, outperforming the aggressive GPU sharing approach (baseline SJF-FFS policy) by up to 17% in large-scale traces.

7/19/2024

🏅

Hierarchical Resource Partitioning on Modern GPUs: A Reinforcement Learning Approach

Urvij Saroliya, Eishi Arima, Dai Liu, Martin Schulz

GPU-based heterogeneous architectures are now commonly used in HPC clusters. Due to their architectural simplicity specialized for data-level parallelism, GPUs can offer much higher computational throughput and memory bandwidth than CPUs in the same generation do. However, as the available resources in GPUs have increased exponentially over the past decades, it has become increasingly difficult for a single program to fully utilize them. As a consequence, the industry has started supporting several resource partitioning features in order to improve the resource utilization by co-scheduling multiple programs on the same GPU die at the same time. Driven by the technological trend, this paper focuses on hierarchical resource partitioning on modern GPUs, and as an example, we utilize a combination of two different features available on recent NVIDIA GPUs in a hierarchical manner: MPS (Multi-Process Service), a finer-grained logical partitioning; and MIG (Multi-Instance GPU), a coarse-grained physical partitioning. We propose a method for comprehensively co-optimizing the setup of hierarchical partitioning and the selection of co-scheduling groups from a given set of jobs, based on reinforcement learning using their profiles. Our thorough experimental results demonstrate that our approach can successfully set up job concurrency, partitioning, and co-scheduling group selections simultaneously. This results in a maximum throughput improvement by a factor of 1.87 compared to the time-sharing scheduling.

5/15/2024