Revisiting the Time Cost Model of AllReduce

Read original: arXiv:2409.04202 - Published 9/9/2024 by Dian Xiong, Li Chen, Youhe Jiang, Dan Li, Shuai Wang, Songtao Wang

📈

Overview

This paper evaluates a model called \model on several state-of-the-art all-reduce algorithms.
The authors implement and test the Ring all-reduce, RHD, CPS, and HCPS algorithms using Open MPI.
The experiments were conducted on a testbed to evaluate the time cost of these algorithms.

Plain English Explanation

The paper is focused on evaluating a model called \model that predicts the time cost of different all-reduce algorithms. All-reduce is a key communication primitive used in distributed machine learning, where workers need to aggregate their local updates into a global model update.

The researchers implemented several state-of-the-art all-reduce algorithms, including Ring, RHD, CPS, and HCPS, and ran experiments on a testbed to measure their time cost. They used the \model to predict the performance of these algorithms and compared the predictions to the actual measured time costs.

The goal is to better understand the factors that influence the performance of all-reduce algorithms, which is crucial for optimizing the communication in large-scale distributed machine learning systems. By accurately modeling the time cost, researchers and engineers can make more informed decisions about which algorithms to use in their systems.

Technical Explanation

The paper evaluates the \model on several state-of-the-art all-reduce algorithms, including Ring, RHD, CPS, and HCPS. The authors implement these algorithms using Open MPI v4.1.1 and run them on a testbed to measure their time cost.

The experiments were conducted on two different cluster configurations: 12 nodes and 15 nodes. The researchers compare the actual measured time costs to the predictions made by the \model, which is designed to estimate the time cost of all-reduce algorithms based on factors such as network topology, message size, and number of nodes.

The results show that the \model can accurately predict the time cost of the different all-reduce algorithms, with the predictions closely matching the observed performance on the testbed. This suggests that the \model is a useful tool for understanding and optimizing the performance of all-reduce communication in distributed machine learning systems.

Critical Analysis

The paper provides a thorough evaluation of the \model and its ability to predict the time cost of various all-reduce algorithms. The researchers have carefully designed their experiments to cover a range of cluster configurations and all-reduce algorithms, which gives confidence in the validity of their results.

However, the paper does not discuss any potential limitations or caveats of the \model or the experimental setup. For example, it would be helpful to know the hardware and network specifications of the testbed, as well as any potential sources of performance variability or noise that could impact the results.

Additionally, the paper does not explore the implications of their findings for the broader field of distributed machine learning. It would be valuable to discuss how the insights from this research could be used to inform the design of more efficient distributed training systems or to guide the selection of appropriate all-reduce algorithms for different workloads and hardware configurations.

Conclusion

This paper presents a comprehensive evaluation of the \model for predicting the time cost of all-reduce algorithms, a crucial communication primitive in distributed machine learning. The researchers implement and test several state-of-the-art all-reduce algorithms on a testbed and compare the actual performance to the \model's predictions.

The results show that the \model can accurately predict the time cost of the different all-reduce algorithms, suggesting that it is a valuable tool for understanding and optimizing the performance of distributed machine learning systems. By accurately modeling the factors that influence all-reduce performance, researchers and engineers can make more informed decisions about which algorithms to use in their systems, ultimately leading to more efficient and scalable distributed training.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📈

Revisiting the Time Cost Model of AllReduce

Dian Xiong, Li Chen, Youhe Jiang, Dan Li, Shuai Wang, Songtao Wang

AllReduce is an important and popular collective communication primitive, which has been widely used in areas such as distributed machine learning and high performance computing. To design, analyze, and choose from various algorithms and implementations of AllReduce, the time cost model plays a crucial role, and the predominant one is the $(alpha,beta,gamma)$ model. In this paper, we revisit this model, and reveal that it cannot well characterize the time cost of AllReduce on modern clusters; thus must be updated. We perform extensive measurements to identify two additional terms contributing to the time cost: the incast term and the memory access term. We augment the $(alpha,beta,gamma)$ model with these two terms, and present GenModel as a result. Using GenModel, we discover two new optimalities for AllReduce algorithms, and prove that they cannot be achieved simultaneously. Finally, striking the balance between the two new optimalities, we design GenTree, an AllReduce plan generation algorithm specialized for tree-like topologies. Experiments on a real testbed with 64 GPUs show that GenTree can achieve 1.22$times$ to 1.65$times$ speed-up against NCCL. Large-scale simulations also confirm that GenTree can improve the state-of-the-art AllReduce algorithm by a factor of $1.2$ to $7.4$ in scenarios where the two new terms dominate.

9/9/2024

🧠

Near-Optimal Wafer-Scale Reduce

Piotr Luczynski, Lukas Gianinazzi, Patrick Iff, Leighton Wilson, Daniele De Sensi, Torsten Hoefler

Efficient Reduce and AllReduce communication collectives are a critical cornerstone of high-performance computing (HPC) applications. We present the first systematic investigation of Reduce and AllReduce on the Cerebras Wafer-Scale Engine (WSE). This architecture has been shown to achieve unprecedented performance both for machine learning workloads and other computational problems like FFT. We introduce a performance model to estimate the execution time of algorithms on the WSE and validate our predictions experimentally for a wide range of input sizes. In addition to existing implementations, we design and implement several new algorithms specifically tailored to the architecture. Moreover, we establish a lower bound for the runtime of a Reduce operation on the WSE. Based on our model, we automatically generate code that achieves near-optimal performance across the whole range of input sizes. Experiments demonstrate that our new Reduce and AllReduce algorithms outperform the current vendor solution by up to 3.27x. Additionally, our model predicts performance with less than 4% error. The proposed communication collectives increase the range of HPC applications that can benefit from the high throughput of the WSE. Our model-driven methodology demonstrates a disciplined approach that can lead the way to further algorithmic advancements on wafer-scale architectures.

9/4/2024

🛸

Optimizing Distributed ML Communication with Fused Computation-Collective Operations

Kishore Punniyamurthy, Khaled Hamidouche, Bradford M. Beckmann

In order to satisfy their ever increasing capacity and compute requirements, machine learning models are distributed across multiple nodes using numerous parallelism strategies. As a result, collective communications are often on the critical path, and hiding their latency by overlapping kernel-granular communication and computation is difficult due to the absence of independent computation. In this work, we propose fusing computation with dependent collective communication by leveraging GPUs' massive parallelism and GPU-initiated communication. We have developed self-contained GPU kernels where workgroups (WGs) immediately communicate their results to remote GPUs when they complete their computation. Meanwhile, other WGs within the same kernel perform overlapping computation, maintaining high ALU utilization. We demonstrate our approach by creating three prototype fused operators (embedding + All-to-All, GEMV + AllReduce, and GEMM + All-to-All) to address the pervasive communication overheads observed in DLRM, Transformers and MoE model architectures. In order to demonstrate that our approach can be integrated into ML frameworks for wide adoption in production environments, we expose our fused operators as new PyTorch operators as well as extend the Triton framework to enable them. Our evaluations show that our approach can effectively overlap communication with computations, subsequently reducing their combined execution time than the current collective library-based approaches. Our scale-up GEMV + AllReduce and GEMM + All-to-All implementations achieve up to 22% and 20% lower execution time, while our fused embedding + All-to-All reduces execution time by 20% and 31% for intra-node and inter-node configurations. Large scale-out simulations indicate that our approach reduces DLRM execution time by 21% for 128 node system.

4/24/2024

Full-Stack Allreduce on Multi-Rail Networks

Enda Yu, Dezun Dong, Xiangke Liao

The high communication costs impede scalability in distributed systems. Multimodal models like Sora exacerbate this issue by requiring more resources than current networks can support. However, existing network architectures fail to address this gap. In this paper, we provide full-stack support for allreduce on multi-rail networks, aiming to overcome the scalability limitations of large-scale networks by facilitating collaborative data transfer across various networks. To achieve this, we propose the Nezha system, which integrates TCP, in-network computing protocol SHARP, and RDMA-based protocol GLEX. To maximize data transfer rates, Nezha incorporates a load balancing data allocation scheme based on cost feedback and combines exception handling to achieve reliable data transmission. Our experiments on a six-node cluster demonstrate that Nezha significantly enhances allreduce performance by 58% to 87% in homogeneous dual-rail configurations and offers considerable acceleration in heterogeneous settings, contingent on the performance variance among networks.

5/29/2024