ForestColl: Efficient Collective Communications on Heterogeneous Network Fabrics

Read original: arXiv:2402.06787 - Published 9/24/2024 by Liangyu Zhao, Saeed Maleki, Aashaka Shah, Ziyue Yang, Hossein Pourreza, Arvind Krishnamurthy

🌐

Overview

As deep learning models continue to grow larger, the communication between the accelerators (such as allreduce) becomes a significant performance bottleneck.
Designing efficient communication schedules is challenging due to the diversity and heterogeneity of modern network fabrics.
This paper presents ForestColl, a tool that generates high-performance communication schedules for any network topology.

Plain English Explanation

ForestColl: Efficient Collective Communications for Heterogeneous Network Fabrics is a research paper that tackles the challenge of optimizing communication between the accelerators (such as GPUs) in large deep learning models.

As these models continue to grow in size, the collective communication operations, like allreduce, have become a major performance bottleneck. This is because the underlying network fabrics connecting the accelerators have become highly diverse and heterogeneous, making it difficult to design efficient communication schedules.

The researchers developed a tool called ForestColl that can generate high-performing communication schedules for any network topology. ForestColl constructs broadcast/aggregation spanning trees as the communication schedule, which allows it to achieve theoretically optimal throughput. The schedule generation process is highly scalable and runs in strongly polynomial time, making it practical for large-scale systems.

The key advantage of ForestColl is that it supports a wide range of network fabrics, including both switching fabrics and direct connections between accelerators. This flexibility is important because modern data centers and AI training clusters often have a mix of different network technologies.

Technical Explanation

ForestColl generates communication schedules by constructing broadcast/aggregation spanning trees. This approach allows it to achieve theoretically optimal throughput for collective communication operations, such as allreduce, which are essential for training large deep learning models.

The researchers evaluated ForestColl on multi-box AMD MI250 and NVIDIA DGX A100 platforms. They found that ForestColl's schedules delivered up to 130% higher performance compared to the vendors' own optimized communication libraries, RCCL and NCCL. ForestColl also achieved a 20% speedup in large language model (LLM) training.

Furthermore, ForestColl outperforms other state-of-the-art schedule generation techniques. It can produce up to 61% more efficient schedules and generates them orders of magnitude faster.

Critical Analysis

The paper provides a thorough evaluation of ForestColl's performance on real-world hardware, demonstrating its significant advantages over existing solutions. However, the authors do not delve into potential limitations or areas for further research.

One aspect that could be explored further is the impact of network heterogeneity on ForestColl's performance. While the paper claims that ForestColl supports a wide range of network fabrics, it would be helpful to understand how the tool performs in scenarios with more diverse and dynamic network conditions.

Additionally, the paper could have examined the trade-offs between the schedule generation time and the quality of the generated schedules. In some cases, a slightly less efficient schedule generated more quickly may be preferable to a more optimal schedule that takes much longer to compute.

Conclusion

ForestColl is a powerful tool that addresses a critical performance bottleneck in large deep learning models by generating highly efficient communication schedules for diverse network fabrics. Its ability to outperform vendors' own optimized libraries and achieve significant speedups in training large language models makes it a promising solution for improving the scalability of modern AI systems.

While the paper provides a strong technical foundation, further research could explore the tool's behavior under more complex and dynamic network conditions, as well as the trade-offs between schedule generation time and quality. Nonetheless, ForestColl represents an important step forward in optimizing collective communication for the next generation of large-scale deep learning models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌐

ForestColl: Efficient Collective Communications on Heterogeneous Network Fabrics

Liangyu Zhao, Saeed Maleki, Aashaka Shah, Ziyue Yang, Hossein Pourreza, Arvind Krishnamurthy

As modern DNN models grow ever larger, collective communications between the accelerators (allreduce, etc.) emerge as a significant performance bottleneck. Designing efficient communication schedules is challenging, given today's highly diverse and heterogeneous network fabrics. In this paper, we present ForestColl, a tool that generates performant schedules for any network topology. ForestColl constructs broadcast/aggregation spanning trees as the communication schedule, achieving theoretically optimal throughput. Its schedule generation runs in strongly polynomial time and is highly scalable. ForestColl supports any network fabric, including both switching fabrics and direct connections. We evaluated ForestColl on multi-box AMD MI250 and NVIDIA DGX A100 platforms. ForestColl's schedules delivered up to 130% higher performance compared to the vendors' own optimized communication libraries, RCCL and NCCL, and achieved a 20% speedup in LLM training. ForestColl also outperforms other state-of-the-art schedule generation techniques with both up to 61% more efficient generated schedules and orders of magnitude faster schedule generation speed.

9/24/2024

Efficient All-to-All Collective Communication Schedules for Direct-Connect Topologies

Prithwish Basu, Liangyu Zhao, Jason Fantl, Siddharth Pal, Arvind Krishnamurthy, Joud Khoury

The all-to-all collective communications primitive is widely used in machine learning (ML) and high performance computing (HPC) workloads, and optimizing its performance is of interest to both ML and HPC communities. All-to-all is a particularly challenging workload that can severely strain the underlying interconnect bandwidth at scale. This paper takes a holistic approach to optimize the performance of all-to-all collective communications on supercomputer-scale direct-connect interconnects. We address several algorithmic and practical challenges in developing efficient and bandwidth-optimal all-to-all schedules for any topology and lowering the schedules to various runtimes and interconnect technologies. We also propose a novel topology that delivers near-optimal all-to-all performance.

4/29/2024

🔄

Efficient Direct-Connect Topologies for Collective Communications

Liangyu Zhao, Siddharth Pal, Tapan Chugh, Weiyang Wang, Jason Fantl, Prithwish Basu, Joud Khoury, Arvind Krishnamurthy

We consider the problem of distilling efficient network topologies for collective communications. We provide an algorithmic framework for constructing direct-connect topologies optimized for the latency vs. bandwidth trade-off associated with the workload. Our approach synthesizes many different topologies and schedules for a given cluster size and degree and then identifies the appropriate topology and schedule for a given workload. Our algorithms start from small, optimal base topologies and associated communication schedules and use techniques that can be iteratively applied to derive much larger topologies and schedules. Additionally, we incorporate well-studied large-scale graph topologies into our algorithmic framework by producing efficient collective schedules for them using a novel polynomial-time algorithm. Our evaluation uses multiple testbeds and large-scale simulations to demonstrate significant performance benefits from our derived topologies and schedules.

5/14/2024

Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI

Mikhail Khalilov, Salvatore Di Girolamo, Marcin Chrapek, Rami Nudelman, Gil Bloch, Torsten Hoefler

In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be interleaved to maximize the communication/computation overlap. In this scenario, outstanding operations such as Allgather and Reduce-Scatter can compete for the injection bandwidth and create pipeline bubbles. To address this problem, we propose a novel bandwidth-optimal Allgather collective algorithm that leverages hardware multicast. We use multicast to build a constant-time reliable Broadcast protocol, a building block for constructing an optimal Allgather schedule. Our Allgather algorithm achieves 2x traffic reduction on a 188-node testbed. To free the host side from running the protocol, we employ SmartNIC offloading. We extract the parallelism in our Allgather algorithm and map it to a SmartNIC specialized for hiding the cost of data movement. We show that our SmartNIC-offloaded collective progress engine can scale to the next generation of 1.6 Tbit/s links.

8/27/2024