Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing

Read original: arXiv:2407.13088 - Published 7/19/2024 by Yizhou Luo, Qiang Wang, Shaohuai Shi, Jiaxin Lai, Shuhan Qi, Jiajia Zhang, Xuan Wang

Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing

Overview

This paper presents a new scheduling algorithm for deep learning jobs in multi-tenant GPU clusters.
The algorithm, called "Wise Resource Sharing" (WRS), aims to efficiently schedule deep learning jobs while considering their resource requirements and sharing GPU resources among them.
WRS is designed to address the challenges of scheduling deep learning jobs in shared GPU clusters, where multiple users or applications compete for limited GPU resources.

Plain English Explanation

The paper discusses a new way to schedule deep learning jobs in shared GPU clusters. In these clusters, multiple users or applications need to use the same, limited GPU resources. The authors' approach, called "Wise Resource Sharing" (WRS), tries to efficiently schedule these deep learning jobs while taking into account their specific resource requirements.

Deep learning models can have very different resource needs, such as the amount of GPU memory they require. WRS-paper tries to match these requirements with the available GPU resources in the cluster, allowing multiple jobs to run concurrently and share the GPUs. This can lead to better utilization of the cluster's resources compared to simpler scheduling approaches.

Technical Explanation

The key idea behind the "Wise Resource Sharing" (WRS) algorithm is to intelligently schedule deep learning jobs in a multi-tenant GPU cluster, taking into account the resource requirements of each job.

WRS works by first profiling the resource needs of each incoming deep learning job, such as the amount of GPU memory required. It then uses this information to find the optimal way to pack multiple jobs onto the available GPUs in the cluster, maximizing resource utilization. This is done by carefully partitioning the GPUs and allocating the right amount of resources to each job.

The authors compare WRS to other scheduling approaches, both heuristic-based and learning-based. They show that WRS can achieve higher GPU utilization and lower job completion times, especially as the cluster becomes more heavily loaded.

Critical Analysis

The paper presents a promising approach for scheduling deep learning jobs in shared GPU clusters. However, there are a few potential limitations and areas for further research:

The evaluation is conducted on simulated workloads, and it would be valuable to test the algorithm on real-world production workloads to better understand its performance.
The paper does not discuss how WRS would handle dynamic changes in the cluster, such as new jobs arriving or GPUs becoming unavailable. Hierarchical resource partitioning may be a useful approach to address this.
The authors mention that WRS assumes jobs are independent and do not have dependencies. Extending the algorithm to handle interdependent deep learning workflows could be a valuable future direction.

Overall, the WRS algorithm appears to be a useful contribution to the problem of scheduling deep learning jobs in shared GPU clusters, but further research and real-world validation would help solidify its practical benefits.

Conclusion

This paper presents a new scheduling algorithm called "Wise Resource Sharing" (WRS) for deep learning jobs in multi-tenant GPU clusters. WRS aims to efficiently schedule these jobs by considering their specific resource requirements and intelligently sharing the available GPU resources among them.

The authors show that WRS can outperform other scheduling approaches in terms of GPU utilization and job completion times, especially as the cluster becomes more heavily loaded. While the evaluation is conducted on simulated workloads, the WRS algorithm represents a promising step towards better resource management in shared deep learning infrastructure.

Further research to extend WRS to handle dynamic cluster changes and interdependent deep learning workflows could help improve its practical applicability and impact in real-world environments.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing

Yizhou Luo, Qiang Wang, Shaohuai Shi, Jiaxin Lai, Shuhan Qi, Jiajia Zhang, Xuan Wang

Deep learning (DL) has demonstrated significant success across diverse fields, leading to the construction of dedicated GPU accelerators within GPU clusters for high-quality training services. Efficient scheduler designs for such clusters are vital to reduce operational costs and enhance resource utilization. While recent schedulers have shown impressive performance in optimizing DL job performance and cluster utilization through periodic reallocation or selection of GPU resources, they also encounter challenges such as preemption and migration overhead, along with potential DL accuracy degradation. Nonetheless, few explore the potential benefits of GPU sharing to improve resource utilization and reduce job queuing times. Motivated by these insights, we present a job scheduling model allowing multiple jobs to share the same set of GPUs without altering job training settings. We introduce SJF-BSBF (shortest job first with best sharing benefit first), a straightforward yet effective heuristic scheduling algorithm. SJF-BSBF intelligently selects job pairs for GPU resource sharing and runtime settings (sub-batch size and scheduling time point) to optimize overall performance while ensuring DL convergence accuracy through gradient accumulation. In experiments with both physical DL workloads and trace-driven simulations, even as a preemption-free policy, SJF-BSBF reduces the average job completion time by 27-33% relative to the state-of-the-art preemptive DL schedulers. Moreover, SJF-BSBF can wisely determine the optimal resource sharing settings, such as the sharing time point and sub-batch size for gradient accumulation, outperforming the aggressive GPU sharing approach (baseline SJF-FFS policy) by up to 17% in large-scale traces.

7/19/2024

🌿

ESG: Pipeline-Conscious Efficient Scheduling of DNN Workflows on Serverless Platforms with Shareable GPUs

Xinning Hui, Yuanchao Xu, Zhishan Guo, Xipeng Shen

Recent years have witnessed increasing interest in machine learning inferences on serverless computing for its auto-scaling and cost effective properties. Existing serverless computing, however, lacks effective job scheduling methods to handle the schedule space dramatically expanded by GPU sharing, task batching, and inter-task relations. Prior solutions have dodged the issue by neglecting some important factors, leaving some large performance potential locked. This paper presents ESG, a new scheduling algorithm that directly addresses the difficulties. ESG treats sharable GPU as a first-order factor in scheduling. It employs an optimality-guided adaptive method by combining A*-search and a novel dual-blade pruning to dramatically prune the scheduling space without compromising the quality. It further introduces a novel method, dominator-based SLO distribution, to ensure the scalability of the scheduler. The results show that ESG can significantly improve the SLO hit rates 61%-80% while saving 47%-187% costs over prior work.

4/26/2024

🤿

SGPRS: Seamless GPU Partitioning Real-Time Scheduler for Periodic Deep Learning Workloads

Amir Fakhim Babaei, Thidapat Chantem

Deep Neural Networks (DNNs) are useful in many applications, including transportation, healthcare, and speech recognition. Despite various efforts to improve accuracy, few works have studied DNN in the context of real-time requirements. Coarse resource allocation and sequential execution in existing frameworks result in underutilization. In this work, we conduct GPU speedup gain analysis and propose SGPRS, the first real-time GPU scheduler considering zero configuration partition switch. The proposed scheduler not only meets more deadlines for parallel tasks but also sustains overall performance beyond the pivot point.

6/17/2024

🤯

Missile: Fine-Grained, Hardware-Level GPU Resource Isolation for Multi-Tenant DNN Inference

Yongkang Zhang, Haoxuan Yu, Chenxia Han, Cheng Wang, Baotong Lu, Yang Li, Xiaowen Chu, Huaicheng Li

Colocating high-priority, latency-sensitive (LS) and low-priority, best-effort (BE) DNN inference services reduces the total cost of ownership (TCO) of GPU clusters. Limited by bottlenecks such as VRAM channel conflicts and PCIe bus contentions, existing GPU sharing solutions are unable to avoid resource conflicts among concurrently executing tasks, failing to achieve both low latency for LS tasks and high throughput for BE tasks. To bridge this gap, this paper presents Missile, a general GPU sharing solution for multi-tenant DNN inference on NVIDIA GPUs. Missile approximates fine-grained GPU hardware resource isolation between multiple LS and BE DNN tasks at software level. Through comprehensive reverse engineering, Missile first reveals a general VRAM channel hash mapping architecture of NVIDIA GPUs and eliminates VRAM channel conflicts using software-level cache coloring. It also isolates the PCIe bus and fairly allocates PCIe bandwidth using completely fair scheduler. We evaluate 12 mainstream DNNs with synthetic and real-world workloads on four GPUs. The results show that compared to the state-of-the-art GPU sharing solutions, Missile reduces tail latency for LS services by up to ~50%, achieves up to 6.1x BE job throughput, and allocates PCIe bus bandwidth to tenants on-demand for optimal performance.

7/30/2024