Near-Optimal Wafer-Scale Reduce

Read original: arXiv:2404.15888 - Published 9/4/2024 by Piotr Luczynski, Lukas Gianinazzi, Patrick Iff, Leighton Wilson, Daniele De Sensi, Torsten Hoefler

🧠

Overview

This paper presents the first systematic investigation of efficient Reduce and AllReduce communication collectives on the Cerebras Wafer-Scale Engine (WSE) architecture.
The WSE has been shown to achieve unprecedented performance for both machine learning workloads and other computational problems like Fast Fourier Transforms (FFTs).
The authors introduce a performance model to estimate the execution time of algorithms on the WSE and validate their predictions experimentally across a wide range of input sizes.
They design and implement several new algorithms tailored to the WSE architecture, in addition to existing implementations.
The authors establish a lower bound for the runtime of a Reduce operation on the WSE and automatically generate code that achieves near-optimal performance across the entire range of input sizes.
Experimental results demonstrate that their new Reduce and AllReduce algorithms outperform the current vendor solution by up to 3.27x, and their model predicts performance with less than 4% error.

Plain English Explanation

The paper focuses on improving the efficiency of two key communication operations, Reduce and AllReduce, on the Cerebras Wafer-Scale Engine (WSE) architecture. The WSE is a powerful computing system that has shown impressive performance for both machine learning tasks and other computational problems.

The researchers developed a model to predict how long these communication operations will take on the WSE for different input sizes. They then used this model to design and implement new algorithms that are specifically tailored to the WSE's architecture. These new algorithms were able to outperform the existing solutions by a significant margin, up to 3.27 times faster.

The authors also established a mathematical lower bound for the minimum time required to perform a Reduce operation on the WSE. This helps provide a clear target for the best possible performance that can be achieved.

Overall, this work improves the efficiency of critical communication operations on the WSE, which will enable a wider range of high-performance computing (HPC) applications to take advantage of the WSE's high-throughput capabilities. The researchers' model-driven methodology demonstrates a disciplined approach that could lead to further algorithmic advancements on wafer-scale architectures like the WSE.

Technical Explanation

The paper presents a systematic investigation of efficient Reduce and AllReduce communication collectives on the Cerebras Wafer-Scale Engine (WSE). The WSE has been shown to achieve unprecedented performance for both machine learning workloads and other computational problems like Fast Fourier Transforms (FFTs).

The authors introduce a performance model to estimate the execution time of algorithms on the WSE and validate their predictions experimentally across a wide range of input sizes. In addition to existing implementations, they design and implement several new algorithms specifically tailored to the WSE architecture. Moreover, they establish a lower bound for the runtime of a Reduce operation on the WSE.

Based on their model, the researchers automatically generate code that achieves near-optimal performance across the entire range of input sizes. Experiments demonstrate that their new Reduce and AllReduce algorithms outperform the current vendor solution by up to 3.27x. Additionally, their model predicts performance with less than 4% error.

The proposed communication collectives increase the range of HPC applications that can benefit from the high throughput of the WSE. The authors' model-driven methodology demonstrates a disciplined approach that can lead the way to further algorithmic advancements on wafer-scale architectures.

Critical Analysis

The paper provides a comprehensive and systematic investigation of Reduce and AllReduce communication collectives on the Cerebras WSE, which is an important contribution to the field of high-performance computing. The authors' performance model and their ability to achieve near-optimal performance across a wide range of input sizes is particularly noteworthy.

However, the paper does not mention any potential limitations or caveats of the WSE architecture or the proposed algorithms. For example, it would be useful to understand the power and energy efficiency of the WSE for these communication-intensive workloads, as well as any potential scalability or memory constraints.

Additionally, the paper focuses solely on the WSE and does not compare the performance of the proposed algorithms to other state-of-the-art solutions on different hardware platforms, such as GPU-based communication collectives or communication-computation fusion techniques. A more comprehensive comparison would help readers understand the relative strengths and weaknesses of the WSE for these communication-centric workloads.

Overall, the paper presents a significant contribution to the field, but further research is needed to fully understand the capabilities and limitations of the WSE for a wider range of high-performance computing applications.

Conclusion

This paper introduces the first systematic investigation of efficient Reduce and AllReduce communication collectives on the Cerebras Wafer-Scale Engine (WSE) architecture. The authors develop a performance model to estimate the execution time of algorithms on the WSE and use it to design and implement several new algorithms tailored to the WSE's unique architecture.

Experimental results demonstrate that the new Reduce and AllReduce algorithms outperform the current vendor solution by up to 3.27x, while the performance model predicts results with less than 4% error. This work significantly improves the efficiency of critical communication operations on the WSE, enabling a broader range of high-performance computing applications to benefit from the system's high-throughput capabilities.

The authors' model-driven methodology provides a disciplined approach that can inform further algorithmic advancements on wafer-scale architectures like the WSE. As the field of high-performance computing continues to evolve, innovations like those presented in this paper will be crucial for unlocking the full potential of emerging hardware platforms.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🧠

Near-Optimal Wafer-Scale Reduce

Piotr Luczynski, Lukas Gianinazzi, Patrick Iff, Leighton Wilson, Daniele De Sensi, Torsten Hoefler

Efficient Reduce and AllReduce communication collectives are a critical cornerstone of high-performance computing (HPC) applications. We present the first systematic investigation of Reduce and AllReduce on the Cerebras Wafer-Scale Engine (WSE). This architecture has been shown to achieve unprecedented performance both for machine learning workloads and other computational problems like FFT. We introduce a performance model to estimate the execution time of algorithms on the WSE and validate our predictions experimentally for a wide range of input sizes. In addition to existing implementations, we design and implement several new algorithms specifically tailored to the architecture. Moreover, we establish a lower bound for the runtime of a Reduce operation on the WSE. Based on our model, we automatically generate code that achieves near-optimal performance across the whole range of input sizes. Experiments demonstrate that our new Reduce and AllReduce algorithms outperform the current vendor solution by up to 3.27x. Additionally, our model predicts performance with less than 4% error. The proposed communication collectives increase the range of HPC applications that can benefit from the high throughput of the WSE. Our model-driven methodology demonstrates a disciplined approach that can lead the way to further algorithmic advancements on wafer-scale architectures.

9/4/2024

Tascade: Hardware Support for Atomic-free, Asynchronous and Efficient Reduction Trees

Marcelo Orenes-Vera, Esin Tureci, David Wentzlaff, Margaret Martonosi

Graph search and sparse data-structure traversal workloads contain challenging irregular memory patterns on global data structures that need to be modified atomically. Distributed processing of these workloads has relied on server threads operating on their own data copies that are merged upon global synchronization. As parallelism increases within each server, the communication challenges that arose in distributed systems a decade ago are now being encountered within large manycore servers. Prior work has achieved scalability for sparse applications up to thousands of PUs on-chip, but does not scale further due to increasing communication distances and load-imbalance across PUs. To address these challenges we propose Tascade, a hardware-software co-design that offers support for storage-efficient data-private reductions as well as asynchronous and opportunistic reduction trees. Tascade introduces an execution model along with supporting hardware design that allows coalescing of data updates regionally and merges the data from these regions through cascaded updates. Together, Tascade innovations minimize communication and increase work balance in task-based parallelization schemes and scales up to a million PUs. We evaluate six applications and four datasets to provide a detailed analysis of Tascade's performance, power, and traffic-reduction gains over prior work. Our parallelization of Breadth-First-Search with RMAT-26 across a million PUs -- the largest of the literature -- reaches over 7600 GTEPS.

4/23/2024

🛸

Optimizing Distributed ML Communication with Fused Computation-Collective Operations

Kishore Punniyamurthy, Khaled Hamidouche, Bradford M. Beckmann

In order to satisfy their ever increasing capacity and compute requirements, machine learning models are distributed across multiple nodes using numerous parallelism strategies. As a result, collective communications are often on the critical path, and hiding their latency by overlapping kernel-granular communication and computation is difficult due to the absence of independent computation. In this work, we propose fusing computation with dependent collective communication by leveraging GPUs' massive parallelism and GPU-initiated communication. We have developed self-contained GPU kernels where workgroups (WGs) immediately communicate their results to remote GPUs when they complete their computation. Meanwhile, other WGs within the same kernel perform overlapping computation, maintaining high ALU utilization. We demonstrate our approach by creating three prototype fused operators (embedding + All-to-All, GEMV + AllReduce, and GEMM + All-to-All) to address the pervasive communication overheads observed in DLRM, Transformers and MoE model architectures. In order to demonstrate that our approach can be integrated into ML frameworks for wide adoption in production environments, we expose our fused operators as new PyTorch operators as well as extend the Triton framework to enable them. Our evaluations show that our approach can effectively overlap communication with computations, subsequently reducing their combined execution time than the current collective library-based approaches. Our scale-up GEMV + AllReduce and GEMM + All-to-All implementations achieve up to 22% and 20% lower execution time, while our fused embedding + All-to-All reduces execution time by 20% and 31% for intra-node and inter-node configurations. Large scale-out simulations indicate that our approach reduces DLRM execution time by 21% for 128 node system.

4/24/2024

📈

Revisiting the Time Cost Model of AllReduce

Dian Xiong, Li Chen, Youhe Jiang, Dan Li, Shuai Wang, Songtao Wang

AllReduce is an important and popular collective communication primitive, which has been widely used in areas such as distributed machine learning and high performance computing. To design, analyze, and choose from various algorithms and implementations of AllReduce, the time cost model plays a crucial role, and the predominant one is the $(alpha,beta,gamma)$ model. In this paper, we revisit this model, and reveal that it cannot well characterize the time cost of AllReduce on modern clusters; thus must be updated. We perform extensive measurements to identify two additional terms contributing to the time cost: the incast term and the memory access term. We augment the $(alpha,beta,gamma)$ model with these two terms, and present GenModel as a result. Using GenModel, we discover two new optimalities for AllReduce algorithms, and prove that they cannot be achieved simultaneously. Finally, striking the balance between the two new optimalities, we design GenTree, an AllReduce plan generation algorithm specialized for tree-like topologies. Experiments on a real testbed with 64 GPUs show that GenTree can achieve 1.22$times$ to 1.65$times$ speed-up against NCCL. Large-scale simulations also confirm that GenTree can improve the state-of-the-art AllReduce algorithm by a factor of $1.2$ to $7.4$ in scenarios where the two new terms dominate.

9/9/2024