Full-Stack Allreduce on Multi-Rail Networks

Read original: arXiv:2405.17870 - Published 5/29/2024 by Enda Yu, Dezun Dong, Xiangke Liao

Full-Stack Allreduce on Multi-Rail Networks

Overview

This paper introduces a new approach called "Full-Stack Allreduce" that aims to improve the performance of distributed computing systems on multi-rail networks.
The key ideas are to leverage in-network computing capabilities and coordinate multi-rail communication to speed up the critical Allreduce operation, which is a common collective communication primitive in distributed machine learning and high-performance computing.
The authors demonstrate the effectiveness of their approach through experiments on a real testbed, showing significant performance improvements compared to existing solutions.

Plain English Explanation

In modern distributed computing systems, data needs to be shared and combined across multiple computers or "nodes" to solve complex problems. One common operation used for this is called "Allreduce", where each node contributes some data and the final result is shared back with all nodes.

The Full-Stack Allreduce technique introduced in this paper tries to make this Allreduce process faster, especially in systems that use multiple network connections ("multi-rail") between nodes. The key ideas are:

Leveraging in-network computing: Rather than just blindly transferring data between nodes, the network switches themselves can be used to perform some of the Allreduce computation, reducing the overall work needed.
Coordinating multi-rail communication: When there are multiple network connections between nodes, the paper shows how to intelligently coordinate the use of these connections to maximize the parallelism and throughput of the Allreduce operation.

By combining these two innovations, the Full-Stack Allreduce approach is able to significantly speed up the critical Allreduce step compared to existing solutions. This can lead to faster training of large-scale machine learning models or more efficient execution of other distributed computing workloads.

Technical Explanation

The Full-Stack Allreduce technique leverages two key ideas to improve the performance of the Allreduce operation in distributed systems:

In-Network Computing: Rather than just relaying data between nodes, the network switches are used to perform partial reductions of the contributed data. This reduces the overall amount of data that needs to be transferred, speeding up the Allreduce process.
Multi-Rail Coordination: When multiple network connections ("rails") exist between nodes, the paper introduces novel algorithms to intelligently split and schedule the Allreduce workload across these rails. This allows for better utilization of the available network bandwidth.

The authors evaluate their Full-Stack Allreduce approach on a real testbed and compare it to existing solutions. Their results show significant performance improvements, with up to 2.5x speedups for the Allreduce operation.

Critical Analysis

The Full-Stack Allreduce paper presents a clever and well-designed solution for improving the performance of a critical operation in distributed computing. The authors have clearly put a lot of thought into leveraging the capabilities of modern network hardware to optimize the Allreduce primitive.

That said, the paper does not address some potential limitations or areas for further research:

The experiments were conducted on a relatively small testbed, so it's unclear how the approach would scale to larger deployments with hundreds or thousands of nodes.
The reliance on specialized network hardware (i.e., switches with in-network computing capabilities) may limit the wider applicability of the technique, as such hardware may not be available in all distributed computing environments.
The paper does not explore the impact of Full-Stack Allreduce on other collective communication primitives beyond Allreduce, or how it might interact with different distributed algorithms and workloads.

Overall, the Full-Stack Allreduce approach is a promising contribution to the field of distributed computing, but further research and evaluation would be helpful to better understand its broader applicability and limitations.

Conclusion

The Full-Stack Allreduce technique introduced in this paper offers a novel way to improve the performance of the critical Allreduce operation in distributed computing systems. By leveraging in-network computing capabilities and coordinating the use of multi-rail network connections, the authors demonstrate significant speedups compared to existing solutions.

This work has the potential to enable faster training of large-scale machine learning models and more efficient execution of other distributed computing workloads. While the paper does not address all possible limitations, it represents an important step forward in optimizing the core communication primitives that underpin modern distributed systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Full-Stack Allreduce on Multi-Rail Networks

Enda Yu, Dezun Dong, Xiangke Liao

The high communication costs impede scalability in distributed systems. Multimodal models like Sora exacerbate this issue by requiring more resources than current networks can support. However, existing network architectures fail to address this gap. In this paper, we provide full-stack support for allreduce on multi-rail networks, aiming to overcome the scalability limitations of large-scale networks by facilitating collaborative data transfer across various networks. To achieve this, we propose the Nezha system, which integrates TCP, in-network computing protocol SHARP, and RDMA-based protocol GLEX. To maximize data transfer rates, Nezha incorporates a load balancing data allocation scheme based on cost feedback and combines exception handling to achieve reliable data transmission. Our experiments on a six-node cluster demonstrate that Nezha significantly enhances allreduce performance by 58% to 87% in homogeneous dual-rail configurations and offers considerable acceleration in heterogeneous settings, contingent on the performance variance among networks.

5/29/2024

Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI

Mikhail Khalilov, Salvatore Di Girolamo, Marcin Chrapek, Rami Nudelman, Gil Bloch, Torsten Hoefler

In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be interleaved to maximize the communication/computation overlap. In this scenario, outstanding operations such as Allgather and Reduce-Scatter can compete for the injection bandwidth and create pipeline bubbles. To address this problem, we propose a novel bandwidth-optimal Allgather collective algorithm that leverages hardware multicast. We use multicast to build a constant-time reliable Broadcast protocol, a building block for constructing an optimal Allgather schedule. Our Allgather algorithm achieves 2x traffic reduction on a 188-node testbed. To free the host side from running the protocol, we employ SmartNIC offloading. We extract the parallelism in our Allgather algorithm and map it to a SmartNIC specialized for hiding the cost of data movement. We show that our SmartNIC-offloaded collective progress engine can scale to the next generation of 1.6 Tbit/s links.

8/27/2024

🌐

Rail-only: A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters

Weiyang Wang, Manya Ghobadi, Kayvon Shakeri, Ying Zhang, Naader Hasani

This paper presents a low-cost network architecture for training large language models (LLMs) at hyperscale. We study the optimal parallelization strategy of LLMs and propose a novel datacenter network design tailored to LLM's unique communication pattern. We show that LLM training generates sparse communication patterns in the network and, therefore, does not require any-to-any full-bisection network to complete efficiently. As a result, our design eliminates the spine layer in traditional GPU clusters. We name this design a Rail-only network and demonstrate that it achieves the same training performance while reducing the network cost by 38% to 77% and network power consumption by 37% to 75% compared to a conventional GPU datacenter. Our architecture also supports Mixture-of-Expert (MoE) models with all-to-all communication through forwarding, with only 8.2% to 11.2% completion time overhead for all-to-all traffic. We study the failure robustness of Rail-only networks and provide insights into the performance impact of different network and training parameters.

9/17/2024

🧠

Near-Optimal Wafer-Scale Reduce

Piotr Luczynski, Lukas Gianinazzi, Patrick Iff, Leighton Wilson, Daniele De Sensi, Torsten Hoefler

Efficient Reduce and AllReduce communication collectives are a critical cornerstone of high-performance computing (HPC) applications. We present the first systematic investigation of Reduce and AllReduce on the Cerebras Wafer-Scale Engine (WSE). This architecture has been shown to achieve unprecedented performance both for machine learning workloads and other computational problems like FFT. We introduce a performance model to estimate the execution time of algorithms on the WSE and validate our predictions experimentally for a wide range of input sizes. In addition to existing implementations, we design and implement several new algorithms specifically tailored to the architecture. Moreover, we establish a lower bound for the runtime of a Reduce operation on the WSE. Based on our model, we automatically generate code that achieves near-optimal performance across the whole range of input sizes. Experiments demonstrate that our new Reduce and AllReduce algorithms outperform the current vendor solution by up to 3.27x. Additionally, our model predicts performance with less than 4% error. The proposed communication collectives increase the range of HPC applications that can benefit from the high throughput of the WSE. Our model-driven methodology demonstrates a disciplined approach that can lead the way to further algorithmic advancements on wafer-scale architectures.

9/4/2024