Efficient All-to-All Collective Communication Schedules for Direct-Connect Topologies

Read original: arXiv:2309.13541 - Published 4/29/2024 by Prithwish Basu, Liangyu Zhao, Jason Fantl, Siddharth Pal, Arvind Krishnamurthy, Joud Khoury

Efficient All-to-All Collective Communication Schedules for Direct-Connect Topologies

Overview

This paper presents efficient algorithms for all-to-all collective communication in direct-connect network topologies, which are commonly used in machine learning (ML) and high-performance computing (HPC) systems.
The authors propose novel scheduling techniques that can significantly improve the performance of all-to-all communication patterns, a critical component of many distributed ML and HPC applications.
The proposed algorithms are evaluated through simulations and showed to outperform existing approaches in terms of communication latency and throughput.

Plain English Explanation

In modern ML and HPC systems, the network connecting the computing nodes plays a crucial role in overall performance. One common network topology is the direct-connect architecture, where each node is directly connected to a subset of other nodes. This allows for faster and more efficient communication compared to traditional hierarchical network designs.

A key communication pattern in distributed ML and HPC is the all-to-all collective, where each node needs to exchange data with every other node. The authors of this paper have developed new scheduling algorithms to optimize the performance of these all-to-all collectives in direct-connect network topologies.

Their techniques aim to minimize the time and resources required to complete the all-to-all communication, which is essential for achieving high performance in many distributed applications, such as distributed machine learning and large-scale high-performance computing.

The authors evaluated their algorithms through simulations and showed that they can outperform existing approaches in terms of communication latency and throughput. This suggests that their techniques could be valuable for improving the efficiency and scalability of distributed ML and HPC systems that rely on direct-connect network topologies.

Technical Explanation

The paper begins by providing background on direct-connect fabrics, which are widely used in modern ML and HPC systems due to their advantages over traditional hierarchical network designs. The authors then introduce the problem of optimizing all-to-all collective communication in these direct-connect topologies, which is a critical component of many distributed applications.

The core of the paper presents the authors' novel scheduling algorithms for all-to-all collectives. These algorithms aim to minimize the overall communication time by carefully coordinating the data transfers between nodes to avoid contention and maximize parallelism. The authors leverage the structural properties of direct-connect topologies to develop their scheduling techniques, which include both centralized and distributed approaches.

The proposed algorithms are evaluated through extensive simulations, where they are compared to existing all-to-all communication schemes. The results show that the authors' techniques can achieve significant improvements in communication latency and throughput, particularly as the number of nodes in the system increases. The authors also provide insights into the scalability and robustness of their algorithms under various network conditions and communication patterns.

Critical Analysis

The paper presents a thorough and well-designed study of all-to-all collective communication in direct-connect network topologies. The authors' algorithms show promising performance improvements over existing approaches, which could be highly beneficial for distributed ML and HPC applications that rely on efficient all-to-all communication.

However, the paper does not address several potential limitations and areas for further research. For example, the authors only evaluate their algorithms through simulations and do not provide any real-world experimental results. It would be valuable to see how their techniques perform on actual hardware and network configurations, as well as under more realistic communication patterns and workloads.

Additionally, the paper does not explore the trade-offs between the centralized and distributed scheduling approaches, or the sensitivity of the algorithms to factors like network congestion, node failures, or heterogeneous communication speeds. Investigating these aspects could provide a more comprehensive understanding of the strengths and weaknesses of the proposed techniques.

Finally, the authors do not discuss the computational complexity of their scheduling algorithms and how they might scale as the number of nodes in the system increases. This information would be valuable for assessing the practical feasibility and applicability of the proposed approaches in large-scale, production-level ML and HPC systems.

Conclusion

This paper presents innovative scheduling algorithms for optimizing all-to-all collective communication in direct-connect network topologies, which are widely used in modern ML and HPC systems. The authors' techniques demonstrate significant improvements in communication latency and throughput over existing approaches, suggesting they could be valuable for enhancing the performance and scalability of distributed applications that rely on efficient all-to-all communication patterns.

While the paper provides a strong technical foundation, further research is needed to address potential limitations and expand the real-world applicability of the proposed algorithms. Evaluating their performance on actual hardware, exploring trade-offs between centralized and distributed scheduling, and assessing computational complexity would all be valuable next steps in this line of research.

Overall, this work represents an important contribution to the field of high-performance distributed computing, with the potential to have a meaningful impact on the design and optimization of future ML and HPC systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Efficient All-to-All Collective Communication Schedules for Direct-Connect Topologies

Prithwish Basu, Liangyu Zhao, Jason Fantl, Siddharth Pal, Arvind Krishnamurthy, Joud Khoury

The all-to-all collective communications primitive is widely used in machine learning (ML) and high performance computing (HPC) workloads, and optimizing its performance is of interest to both ML and HPC communities. All-to-all is a particularly challenging workload that can severely strain the underlying interconnect bandwidth at scale. This paper takes a holistic approach to optimize the performance of all-to-all collective communications on supercomputer-scale direct-connect interconnects. We address several algorithmic and practical challenges in developing efficient and bandwidth-optimal all-to-all schedules for any topology and lowering the schedules to various runtimes and interconnect technologies. We also propose a novel topology that delivers near-optimal all-to-all performance.

4/29/2024

🔄

Efficient Direct-Connect Topologies for Collective Communications

Liangyu Zhao, Siddharth Pal, Tapan Chugh, Weiyang Wang, Jason Fantl, Prithwish Basu, Joud Khoury, Arvind Krishnamurthy

We consider the problem of distilling efficient network topologies for collective communications. We provide an algorithmic framework for constructing direct-connect topologies optimized for the latency vs. bandwidth trade-off associated with the workload. Our approach synthesizes many different topologies and schedules for a given cluster size and degree and then identifies the appropriate topology and schedule for a given workload. Our algorithms start from small, optimal base topologies and associated communication schedules and use techniques that can be iteratively applied to derive much larger topologies and schedules. Additionally, we incorporate well-studied large-scale graph topologies into our algorithmic framework by producing efficient collective schedules for them using a novel polynomial-time algorithm. Our evaluation uses multiple testbeds and large-scale simulations to demonstrate significant performance benefits from our derived topologies and schedules.

5/14/2024

Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI

Mikhail Khalilov, Salvatore Di Girolamo, Marcin Chrapek, Rami Nudelman, Gil Bloch, Torsten Hoefler

In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be interleaved to maximize the communication/computation overlap. In this scenario, outstanding operations such as Allgather and Reduce-Scatter can compete for the injection bandwidth and create pipeline bubbles. To address this problem, we propose a novel bandwidth-optimal Allgather collective algorithm that leverages hardware multicast. We use multicast to build a constant-time reliable Broadcast protocol, a building block for constructing an optimal Allgather schedule. Our Allgather algorithm achieves 2x traffic reduction on a 188-node testbed. To free the host side from running the protocol, we employ SmartNIC offloading. We extract the parallelism in our Allgather algorithm and map it to a SmartNIC specialized for hiding the cost of data movement. We show that our SmartNIC-offloaded collective progress engine can scale to the next generation of 1.6 Tbit/s links.

8/27/2024

🛸

Optimizing Distributed ML Communication with Fused Computation-Collective Operations

Kishore Punniyamurthy, Khaled Hamidouche, Bradford M. Beckmann

In order to satisfy their ever increasing capacity and compute requirements, machine learning models are distributed across multiple nodes using numerous parallelism strategies. As a result, collective communications are often on the critical path, and hiding their latency by overlapping kernel-granular communication and computation is difficult due to the absence of independent computation. In this work, we propose fusing computation with dependent collective communication by leveraging GPUs' massive parallelism and GPU-initiated communication. We have developed self-contained GPU kernels where workgroups (WGs) immediately communicate their results to remote GPUs when they complete their computation. Meanwhile, other WGs within the same kernel perform overlapping computation, maintaining high ALU utilization. We demonstrate our approach by creating three prototype fused operators (embedding + All-to-All, GEMV + AllReduce, and GEMM + All-to-All) to address the pervasive communication overheads observed in DLRM, Transformers and MoE model architectures. In order to demonstrate that our approach can be integrated into ML frameworks for wide adoption in production environments, we expose our fused operators as new PyTorch operators as well as extend the Triton framework to enable them. Our evaluations show that our approach can effectively overlap communication with computations, subsequently reducing their combined execution time than the current collective library-based approaches. Our scale-up GEMV + AllReduce and GEMM + All-to-All implementations achieve up to 22% and 20% lower execution time, while our fused embedding + All-to-All reduces execution time by 20% and 31% for intra-node and inter-node configurations. Large scale-out simulations indicate that our approach reduces DLRM execution time by 21% for 128 node system.

4/24/2024