Efficient Multi-Processor Scheduling in Increasingly Realistic Models

2404.15246

Published 4/24/2024 by P'al Andr'as Papp, Georg Anegg, Aikaterini Karanasiou, A. N. Yzelman

🔄

Abstract

We study the problem of efficiently scheduling a computational DAG on multiple processors. The majority of previous works have developed and compared algorithms for this problem in relatively simple models; in contrast to this, we analyze this problem in a more realistic model that captures many real-world aspects, such as communication costs, synchronization costs, and the hierarchical structure of modern processing architectures. For this we extend the well-established BSP model of parallel computing with non-uniform memory access (NUMA) effects. We then develop a range of new scheduling algorithms to minimize the scheduling cost in this more complex setting: several initialization heuristics, a hill-climbing local search method, and several approaches that formulate (and solve) the scheduling problem as an Integer Linear Program (ILP). We combine these algorithms into a single framework, and conduct experiments on a diverse set of real-world computational DAGs to show that the resulting scheduler significantly outperforms both academic and practical baselines. In particular, even without NUMA effects, our scheduler finds solutions of 24%-44% smaller cost on average than the baselines, and in case of NUMA effects, it achieves up to a factor $2.5times$ improvement compared to the baselines. Finally, we also develop a multilevel scheduling algorithm, which provides up to almost a factor $5times$ improvement in the special case when the problem is dominated by very high communication costs.

Create account to get full access

Overview

The paper focuses on the problem of efficiently scheduling a computational Directed Acyclic Graph (DAG) on multiple processors.
It extends the well-established Bulk Synchronous Parallel (BSP) model of parallel computing to capture real-world aspects like communication costs, synchronization costs, and the hierarchical structure of modern processing architectures.
The paper develops a range of new scheduling algorithms, including initialization heuristics, a hill-climbing local search method, and Integer Linear Programming (ILP) approaches, to minimize the scheduling cost in this more complex setting.
The algorithms are combined into a single framework and evaluated on a diverse set of real-world computational DAGs, showing significant improvements over academic and practical baselines.

Plain English Explanation

When you have a complex computational task that can be broken down into smaller, interdependent parts, you need to figure out the best way to schedule those parts across multiple processors to get the job done as efficiently as possible. This is a common problem in high-performance computing and data processing.

The researchers in this paper looked at this scheduling problem in a more realistic way, taking into account factors like the cost of communicating between processors and the way modern computer architectures are structured. They developed a range of new algorithms to tackle this more complex version of the scheduling problem.

These algorithms include some smart heuristics for initially dividing up the work, a method that tries to iteratively improve the schedule, and approaches that formulate the problem as a mathematical optimization problem and solve it using specialized techniques.

The researchers combined these algorithms into a single system and tested it on a variety of real-world computational tasks. They found that their system could schedule the work [object Object] than existing methods, even without considering the additional complexities of modern computer architectures. And in cases where communication costs were very high, their multilevel scheduling algorithm could achieve [object Object] of the baseline approaches.

Technical Explanation

The paper extends the well-established Bulk Synchronous Parallel (BSP) model of parallel computing to capture the non-uniform memory access (NUMA) effects found in modern processing architectures. This more realistic model accounts for communication costs and synchronization costs, which are often overlooked in simpler scheduling models.

The researchers develop several new scheduling algorithms to minimize the cost in this complex setting:

Initialization Heuristics: These use various strategies to initially divide the computational DAG across the available processors.
Hill-Climbing Local Search: This iteratively refines the initial schedule, trying to improve it step-by-step.
Integer Linear Programming (ILP) Approaches: These formulate the scheduling problem as an optimization problem and use ILP solvers to find optimal or near-optimal solutions.

The algorithms are combined into a unified scheduling framework and evaluated on a diverse set of real-world computational DAGs. Even without NUMA effects, the framework finds solutions that are [object Object] than academic and practical baselines. With NUMA effects, the improvements can be as high as [object Object].

The paper also develops a multilevel scheduling algorithm that provides [object Object] in cases where communication costs dominate the scheduling problem.

Critical Analysis

The paper provides a thorough and systematic exploration of scheduling algorithms for computational DAGs on modern multi-processor architectures. The extensions to the BSP model to account for NUMA effects and the range of algorithmic approaches developed are valuable contributions.

However, the paper does not address some potential limitations:

The experiments are conducted on a limited set of real-world DAGs, and the performance of the algorithms may vary for other types of computational workloads.
The ILP-based approaches, while powerful, may not scale well to very large scheduling problems due to the inherent complexity of solving ILP problems.
The paper does not provide a detailed analysis of the runtime and memory requirements of the different algorithms, which would be important for understanding their practical applicability.

Further research could explore:

Extending the algorithms to handle dynamic changes in the computational DAG or processor availability during execution.
Developing hyrbid approaches that combine the strengths of the different algorithmic techniques.
Investigating the impact of other real-world factors, such as processor heterogeneity or energy constraints, on the scheduling problem.

Conclusion

This paper presents a significant advancement in the field of computational DAG scheduling by developing a range of efficient algorithms that account for the realities of modern processing architectures. The results demonstrate substantial improvements over existing methods, particularly in the presence of non-uniform memory access effects and high communication costs.

The work has important implications for optimizing the performance of a wide range of high-performance computing and data processing applications. The techniques developed in this paper could be leveraged to improve the efficiency and scalability of large-scale distributed systems, scientific computing workflows, and other complex computational workloads.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Learning Interpretable Scheduling Algorithms for Data Processing Clusters

Zhibo Hu (Hye-Young), Chen Wang (Hye-Young), Helen (Hye-Young), Paik, Yanfeng Shu, Liming Zhu

Workloads in data processing clusters are often represented in the form of DAG (Directed Acyclic Graph) jobs. Scheduling DAG jobs is challenging. Simple heuristic scheduling algorithms are often adopted in practice in production data centres. There is much room for scheduling performance optimisation for cost saving. Recently, reinforcement learning approaches (like decima) have been attempted to optimise DAG job scheduling and demonstrate clear performance gain in comparison to traditional algorithms. However, reinforcement learning (RL) approaches face their own problems in real-world deployment. In particular, their black-box decision making processes and generalizability in unseen workloads may add a non-trivial burden to the cluster administrators. Moreover, adapting RL models on unseen workloads often requires significant amount of training data, which leaves edge cases run in a sub-optimal mode. To fill the gap, we propose a new method to distill a simple scheduling policy based on observations of the behaviours of a complex deep learning model. The simple model not only provides interpretability of scheduling decisions, but also adaptive to edge cases easily through tuning. We show that our method achieves high fidelity to the decisions made by deep learning models and outperforms these models when additional heuristics are taken into account.

5/30/2024

cs.DC

Differentiable Combinatorial Scheduling at Scale

Mingju Liu, Yingjie Li, Jiaqi Yin, Zhiru Zhang, Cunxi Yu

This paper addresses the complex issue of resource-constrained scheduling, an NP-hard problem that spans critical areas including chip design and high-performance computing. Traditional scheduling methods often stumble over scalability and applicability challenges. We propose a novel approach using a differentiable combinatorial scheduling framework, utilizing Gumbel-Softmax differentiable sampling technique. This new technical allows for a fully differentiable formulation of linear programming (LP) based scheduling, extending its application to a broader range of LP formulations. To encode inequality constraints for scheduling tasks, we introduce textit{constrained Gumbel Trick}, which adeptly encodes arbitrary inequality constraints. Consequently, our method facilitates an efficient and scalable scheduling via gradient descent without the need for training data. Comparative evaluations on both synthetic and real-world benchmarks highlight our capability to significantly improve the optimization efficiency of scheduling, surpassing state-of-the-art solutions offered by commercial and open-source solvers such as CPLEX, Gurobi, and CP-SAT in the majority of the designs.

6/12/2024

cs.LG cs.AI cs.AR

👁️

Scheduling of Distributed Applications on the Computing Continuum: A Survey

Narges Mehran, Dragi Kimovski, Hermann Hellwagner, Dumitru Roman, Ahmet Soylu, Radu Prodan

The demand for distributed applications has significantly increased over the past decade, with improvements in machine learning techniques fueling this growth. These applications predominantly utilize Cloud data centers for high-performance computing and Fog and Edge devices for low-latency communication for small-size machine learning model training and inference. The challenge of executing applications with different requirements on heterogeneous devices requires effective methods for solving NP-hard resource allocation and application scheduling problems. The state-of-the-art techniques primarily investigate conflicting objectives, such as the completion time, energy consumption, and economic cost of application execution on the Cloud, Fog, and Edge computing infrastructure. Therefore, in this work, we review these research works considering their objectives, methods, and evaluation tools. Based on the review, we provide a discussion on the scheduling methods in the Computing Continuum.

5/2/2024

cs.DC

New!Joint Task Allocation and Scheduling for Multi-Hop Distributed Computing

Ke Ma, Junfei Xie

The rise of the Internet of Things and edge computing has shifted computing resources closer to end-users, benefiting numerous delay-sensitive, computation-intensive applications. To speed up computation, distributed computing is a promising technique that allows parallel execution of tasks across multiple compute nodes. However, current research predominantly revolves around the master-worker paradigm, limiting resource sharing within one-hop neighborhoods. This limitation can render distributed computing ineffective in scenarios with limited nearby resources or constrained/dynamic connectivity. In this paper, we address this limitation by introducing a new distributed computing framework that extends resource sharing beyond one-hop neighborhoods through exploring layered network structures and multi-hop routing. Our framework involves transforming the network graph into a sink tree and formulating a joint optimization problem based on the layered tree structure for task allocation and scheduling. To solve this problem, we propose two exact methods that find optimal solutions and three heuristic strategies to improve efficiency and scalability. The performances of these methods are analyzed and evaluated through theoretical analyses and comprehensive simulation studies. The results demonstrate their promising performances over the traditional distributed computing and computation offloading strategies.

7/2/2024

cs.DC cs.NI