Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI

Read original: arXiv:2408.13356 - Published 8/27/2024 by Mikhail Khalilov, Salvatore Di Girolamo, Marcin Chrapek, Rami Nudelman, Gil Bloch, Torsten Hoefler

Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI

Overview

This paper proposes network-offloaded broadcast and allgather algorithms that optimize bandwidth usage for distributed AI training.
The authors design these algorithms to be implemented in network hardware, offloading communication tasks from the CPUs/GPUs and improving overall system efficiency.
The algorithms are evaluated through experiments on a GPU cluster and shown to outperform existing approaches in terms of bandwidth utilization and training throughput.

Plain English Explanation

In distributed AI systems, training models often requires sharing large amounts of data across many machines. This bandwidth-optimal broadcast and allgather for distributed AI paper presents new algorithms to efficiently manage this data communication.

The key idea is to perform some of the communication tasks directly in the network hardware, rather than relying on the CPUs or GPUs to handle it. This network offloading approach can improve overall system efficiency by reducing the computational load on the training nodes.

The broadcast and allgather algorithms optimize the way data is shared across the cluster to minimize the total bandwidth used. This can lead to faster training times compared to existing communication methods.

The researchers evaluated their algorithms on a GPU cluster and found they outperformed previous approaches in terms of bandwidth utilization and overall training throughput. By offloading communication to the network, they were able to free up more of the computational resources on the individual nodes for the actual machine learning work.

Technical Explanation

The paper introduces two new algorithms for network-offloaded bandwidth-optimal broadcast and allgather in distributed AI training environments.

For the broadcast algorithm, the authors design a tree-based approach where the network hardware manages the data dissemination to minimize total bandwidth usage. This involves carefully scheduling the data transfers between nodes to avoid redundant transmissions.

The allgather algorithm uses a ring-based topology, again with the network hardware orchestrating the collective communication to optimize bandwidth efficiency. By pipelining the data exchanges, the approach can achieve higher aggregate throughput compared to standard allgather implementations.

Experiments on a GPU cluster demonstrated that these network-offloaded algorithms outperform existing communication primitives. For example, the bandwidth-optimal broadcast achieved up to 2.5x higher throughput than a traditional software-based broadcast. The network-offloaded allgather also showed significant improvements in bandwidth utilization.

The authors attribute these gains to the ability to offload the communication tasks to the network switches and routers, rather than relying on the training nodes' CPUs and GPUs to handle the data transfers. This reduces the computational burden on the individual machines, allowing them to focus more on the machine learning workload.

Critical Analysis

The paper provides a thorough evaluation of the proposed algorithms, including comparisons to state-of-the-art communication primitives under various cluster configurations and workloads. The results convincingly demonstrate the benefits of network offloading for improving bandwidth efficiency and overall training throughput.

However, the authors do not discuss some potential limitations or caveats of their approach. For example, the reliance on specialized network hardware to implement the algorithms may limit their applicability in environments without such capabilities. Additionally, the impact of potential network congestion or failures on the performance of the algorithms is not explored.

Further research could investigate the robustness of the network-offloaded algorithms under more diverse network conditions, as well as their generalization to other types of collective communication patterns beyond broadcast and allgather. Exploring the integration of these techniques with emerging network technologies, such as in-network computing or programmable switches, could also be a fruitful direction.

Conclusion

This paper presents novel network-offloaded algorithms for bandwidth-optimal broadcast and allgather in distributed AI training. By performing key communication tasks directly in the network hardware, the authors demonstrate significant improvements in bandwidth utilization and overall training throughput compared to traditional software-based approaches.

The ability to offload communication to the network infrastructure is a promising direction for enhancing the efficiency of large-scale distributed ML systems. As AI models and datasets continue to grow, optimizing the underlying data movement and coordination primitives will be crucial for unlocking the full potential of distributed training.

The concepts and techniques introduced in this paper could have broader applicability beyond just distributed AI, potentially benefiting other fields that rely on efficient collective communication patterns, such as high-performance computing and data-intensive applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI

Mikhail Khalilov, Salvatore Di Girolamo, Marcin Chrapek, Rami Nudelman, Gil Bloch, Torsten Hoefler

In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be interleaved to maximize the communication/computation overlap. In this scenario, outstanding operations such as Allgather and Reduce-Scatter can compete for the injection bandwidth and create pipeline bubbles. To address this problem, we propose a novel bandwidth-optimal Allgather collective algorithm that leverages hardware multicast. We use multicast to build a constant-time reliable Broadcast protocol, a building block for constructing an optimal Allgather schedule. Our Allgather algorithm achieves 2x traffic reduction on a 188-node testbed. To free the host side from running the protocol, we employ SmartNIC offloading. We extract the parallelism in our Allgather algorithm and map it to a SmartNIC specialized for hiding the cost of data movement. We show that our SmartNIC-offloaded collective progress engine can scale to the next generation of 1.6 Tbit/s links.

8/27/2024

Efficient All-to-All Collective Communication Schedules for Direct-Connect Topologies

Prithwish Basu, Liangyu Zhao, Jason Fantl, Siddharth Pal, Arvind Krishnamurthy, Joud Khoury

The all-to-all collective communications primitive is widely used in machine learning (ML) and high performance computing (HPC) workloads, and optimizing its performance is of interest to both ML and HPC communities. All-to-all is a particularly challenging workload that can severely strain the underlying interconnect bandwidth at scale. This paper takes a holistic approach to optimize the performance of all-to-all collective communications on supercomputer-scale direct-connect interconnects. We address several algorithmic and practical challenges in developing efficient and bandwidth-optimal all-to-all schedules for any topology and lowering the schedules to various runtimes and interconnect technologies. We also propose a novel topology that delivers near-optimal all-to-all performance.

4/29/2024

🛸

Optimizing Distributed ML Communication with Fused Computation-Collective Operations

Kishore Punniyamurthy, Khaled Hamidouche, Bradford M. Beckmann

In order to satisfy their ever increasing capacity and compute requirements, machine learning models are distributed across multiple nodes using numerous parallelism strategies. As a result, collective communications are often on the critical path, and hiding their latency by overlapping kernel-granular communication and computation is difficult due to the absence of independent computation. In this work, we propose fusing computation with dependent collective communication by leveraging GPUs' massive parallelism and GPU-initiated communication. We have developed self-contained GPU kernels where workgroups (WGs) immediately communicate their results to remote GPUs when they complete their computation. Meanwhile, other WGs within the same kernel perform overlapping computation, maintaining high ALU utilization. We demonstrate our approach by creating three prototype fused operators (embedding + All-to-All, GEMV + AllReduce, and GEMM + All-to-All) to address the pervasive communication overheads observed in DLRM, Transformers and MoE model architectures. In order to demonstrate that our approach can be integrated into ML frameworks for wide adoption in production environments, we expose our fused operators as new PyTorch operators as well as extend the Triton framework to enable them. Our evaluations show that our approach can effectively overlap communication with computations, subsequently reducing their combined execution time than the current collective library-based approaches. Our scale-up GEMV + AllReduce and GEMM + All-to-All implementations achieve up to 22% and 20% lower execution time, while our fused embedding + All-to-All reduces execution time by 20% and 31% for intra-node and inter-node configurations. Large scale-out simulations indicate that our approach reduces DLRM execution time by 21% for 128 node system.

4/24/2024

Full-Stack Allreduce on Multi-Rail Networks

Enda Yu, Dezun Dong, Xiangke Liao

The high communication costs impede scalability in distributed systems. Multimodal models like Sora exacerbate this issue by requiring more resources than current networks can support. However, existing network architectures fail to address this gap. In this paper, we provide full-stack support for allreduce on multi-rail networks, aiming to overcome the scalability limitations of large-scale networks by facilitating collaborative data transfer across various networks. To achieve this, we propose the Nezha system, which integrates TCP, in-network computing protocol SHARP, and RDMA-based protocol GLEX. To maximize data transfer rates, Nezha incorporates a load balancing data allocation scheme based on cost feedback and combines exception handling to achieve reliable data transmission. Our experiments on a six-node cluster demonstrate that Nezha significantly enhances allreduce performance by 58% to 87% in homogeneous dual-rail configurations and offers considerable acceleration in heterogeneous settings, contingent on the performance variance among networks.

5/29/2024