On Optimizing the Communication of Model Parallelism

Read original: arXiv:2211.05322 - Published 8/20/2024 by Yonghao Zhuang, Hexu Zhao, Lianmin Zheng, Zhuohan Li, Eric P. Xing, Qirong Ho, Joseph E. Gonzalez, Ion Stoica, Hao Zhang

📈

Overview

Explores a new communication pattern in large-scale model-parallel deep learning called "cross-mesh resharding"
This pattern emerges when combining intra-operator and inter-operator parallelism to support large models on large clusters
Formalizes cross-mesh resharding as a many-to-many multicast communication problem
Proposes two contributions to address this: an efficient broadcast-based communication system and an overlapping-friendly pipeline schedule
Evaluates the system on microbenchmarks and end-to-end training of large models like GPT-3 and U-Transformer

Plain English Explanation

When training very large deep learning models, researchers often use a technique called "model parallelism" to split the model across multiple devices (like GPUs). This allows the model to be trained more quickly by dividing the work.

However, this introduces a new communication challenge - the different parts of the model need to share data with each other during training. The authors refer to this as "cross-mesh resharding," where a tensor (a multi-dimensional array of data) needs to be sent from one set of devices to another, potentially with a different layout.

The authors show that existing approaches for this communication are either inefficient or don't work well with different network topologies or tensor layouts. To address this, they propose two key contributions:

An efficient broadcast-based communication system that can quickly move data between device meshes.
An overlapping-friendly pipeline schedule that hides some of the communication latency by overlapping it with computation.

When evaluated on benchmark tests and training large models like GPT-3 and U-Transformer, this overall system outperforms existing approaches by up to 10x and can improve training throughput by 10-50%.

Technical Explanation

The paper formalizes the cross-mesh resharding problem as a many-to-many multicast communication challenge. In this scenario, a tensor needs to be sent from a source device mesh to a destination device mesh, where the tensor may be distributed differently across the devices.

The authors propose two key technical contributions to address this problem:

Broadcast-based Communication System: Rather than using a more traditional point-to-point communication approach, the authors develop an efficient broadcast-based system. This allows the source devices to simultaneously send the tensor to all destination devices, leveraging the underlying network topology.
Overlapping-friendly Pipeline Schedule: To further improve performance, the authors design a pipeline schedule that overlaps communication and computation. This hides some of the communication latency by executing other useful work while the data is being transferred.

The paper evaluates this system on both microbenchmarks and end-to-end training of large models. On the microbenchmarks, the authors show their approach outperforms existing techniques by up to 10x across various tensor and mesh layouts.

When applied to training GPT-3 and U-Transformer, the authors report throughput improvements of 10% and 50% respectively compared to prior methods.

Critical Analysis

The paper presents a well-designed and thorough solution to the cross-mesh resharding problem in large-scale model-parallel deep learning. The formalization of the problem as a many-to-many multicast communication challenge is a helpful framing, and the two technical contributions seem well-justified and effective based on the experimental results.

That said, the paper does not extensively explore the limitations or potential issues with the proposed approach. For example, it would be valuable to understand how the performance scales as the model size, number of devices, or network topology increases. The authors also do not discuss potential issues around fault tolerance, load balancing, or adaptability to different hardware configurations.

Additionally, while the paper demonstrates strong empirical results, there is limited analysis of the underlying reasons for the performance improvements. A deeper dive into the mechanics of the broadcast-based communication system and pipeline schedule could provide more insight into when and why this approach is most beneficial.

Overall, the paper makes an important contribution in addressing a key challenge in large-scale distributed deep learning training. However, further research is needed to fully understand the strengths, weaknesses, and broader applicability of the proposed techniques.

Conclusion

This paper tackles a critical communication challenge in large-scale model-parallel deep learning, known as "cross-mesh resharding." The authors formalize this as a many-to-many multicast problem and propose two key technical contributions: an efficient broadcast-based communication system and an overlapping-friendly pipeline schedule.

Experimental results on microbenchmarks and end-to-end training of large models like GPT-3 and U-Transformer show this approach can significantly outperform existing methods, improving training throughput by up to 50%. This work represents an important step forward in enabling efficient distributed training of ever-larger deep learning models.

However, the paper leaves room for further exploration of the limitations, underlying mechanics, and broader applicability of the proposed techniques. Continued research in this area is important to support the ongoing growth and scalability of model-parallel deep learning systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📈

On Optimizing the Communication of Model Parallelism

Yonghao Zhuang, Hexu Zhao, Lianmin Zheng, Zhuohan Li, Eric P. Xing, Qirong Ho, Joseph E. Gonzalez, Ion Stoica, Hao Zhang

We study a novel and important communication pattern in large-scale model-parallel deep learning (DL), which we call cross-mesh resharding. This pattern emerges when the two paradigms of model parallelism - intra-operator and inter-operator parallelism - are combined to support large models on large clusters. In cross-mesh resharding, a sharded tensor needs to be sent from a source device mesh to a destination device mesh, on which the tensor may be distributed with the same or different layouts. We formalize this as a many-to-many multicast communication problem, and show that existing approaches either are sub-optimal or do not generalize to different network topologies or tensor layouts, which result from different model architectures and parallelism strategies. We then propose two contributions to address cross-mesh resharding: an efficient broadcast-based communication system, and an overlapping-friendly pipeline schedule. On microbenchmarks, our overall system outperforms existing ones by up to 10x across various tensor and mesh layouts. On end-to-end training of two large models, GPT-3 and U-Transformer, we improve throughput by 10% and 50%, respectively.

8/20/2024

Communication Optimization for Distributed Training: Architecture, Advances, and Opportunities

Yunze Wei, Tianshuo Hu, Cong Liang, Yong Cui

The past few years have witnessed the flourishing of large-scale deep neural network models with ever-growing parameter numbers. Training such large-scale models typically requires massive memory and computing resources, necessitating distributed training. As GPU performance has rapidly evolved in recent years, computation time has shrunk, making communication a larger portion of the overall training time. Consequently, optimizing communication for distributed training has become crucial. In this article, we briefly introduce the general architecture of distributed deep neural network training and analyze relationships among Parallelization Strategy, Collective Communication Library, and Network from the perspective of communication optimization, which forms a three-layer paradigm. We then review current representative research advances within this three-layer paradigm. We find that layers in the current three-layer paradigm are relatively independent and there is a rich design space for cross-layer collaborative optimization in distributed training scenarios. Therefore, we advocate Vertical and Horizontal co-designs which extend the three-layer paradigm to a five-layer paradigm. We also advocate Intra-Inter and Host-Net co-designs to further utilize the potential of heterogeneous resources. We hope this article can shed some light on future research on communication optimization for distributed training.

8/30/2024

🔍

A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs

Siddharth Singh, Prajwal Singhania, Aditya K. Ranjan, Zack Sating, Abhinav Bhatele

Heavy communication, in particular, collective operations, can become a critical performance bottleneck in scaling the training of billion-parameter neural networks to large-scale parallel systems. This paper introduces a four-dimensional (4D) approach to optimize communication in parallel training. This 4D approach is a hybrid of 3D tensor and data parallelism, and is implemented in the AxoNN framework. In addition, we employ two key strategies to further minimize communication overheads. First, we aggressively overlap expensive collective operations (reduce-scatter, all-gather, and all-reduce) with computation. Second, we develop an analytical model to identify high-performing configurations within the large search space defined by our 4D algorithm. This model empowers practitioners by simplifying the tuning process for their specific training workloads. When training an 80-billion parameter GPT on 1024 GPUs of Perlmutter, AxoNN surpasses Megatron-LM, a state-of-the-art framework, by a significant 26%. Additionally, it achieves a significantly high 57% of the theoretical peak FLOP/s or 182 PFLOP/s in total.

5/15/2024

Demystifying the Communication Characteristics for Distributed Transformer Models

Quentin Anthony, Benjamin Michalowicz, Jacob Hatef, Lang Xu, Mustafa Abduljabbar, Aamir Shafi, Hari Subramoni, Dhabaleswar Panda

Deep learning (DL) models based on the transformer architecture have revolutionized many DL applications such as large language models (LLMs), vision transformers, audio generation, and time series prediction. Much of this progress has been fueled by distributed training, yet distributed communication remains a substantial bottleneck to training progress. This paper examines the communication behavior of transformer models - that is, how different parallelism schemes used in multi-node/multi-GPU DL Training communicate data in the context of transformers. We use GPT-based language models as a case study of the transformer architecture due to their ubiquity. We validate the empirical results obtained from our communication logs using analytical models. At a high level, our analysis reveals a need to optimize small message point-to-point communication further, correlations between sequence length, per-GPU throughput, model size, and optimizations used, and where to potentially guide further optimizations in framework and HPC middleware design and optimization.

8/20/2024