AutoDDL: Automatic Distributed Deep Learning with Near-Optimal Bandwidth Cost

Read original: arXiv:2301.06813 - Published 5/6/2024 by Jinfan Chen, Shigang Li, Ran Gun, Jinhui Yuan, Torsten Hoefler

🤿

Overview

The paper proposes a distributed training framework called AutoDDL that automatically explores and exploits new parallelization schemes to efficiently train large-scale deep learning models on distributed systems.
AutoDDL uses OneFlow's Split, Broadcast, and Partial Sum (SBP) abstraction to facilitate the description and implementation of different parallelization schemes.
AutoDDL combines an analytical performance model with a customized Coordinate Descent algorithm to significantly reduce the overhead of searching for optimal parallelization schemes.
Experiments on Multi-Node-Single-GPU and Multi-Node-Multi-GPU systems using Transformer and VGG models show that AutoDDL can reduce the end-to-end training time by up to 31.1% and 71.5%, respectively, compared to expert-optimized implementations.

Plain English Explanation

Training large-scale deep learning models on distributed systems is a complex task that requires carefully balancing different types of parallelism, such as data, operator, and pipeline parallelism. This can be a significant burden on machine learning practitioners. To address this, the researchers have developed a system called AutoDDL that can automatically find the best way to split up and distribute the training process across multiple machines and GPUs.

AutoDDL uses a set of building blocks called Split, Broadcast, and Partial Sum (SBP) to describe and implement different parallelization schemes. It then uses an analytical performance model and a specialized optimization algorithm to quickly identify the most efficient scheme for a given deep learning model and hardware setup. This allows AutoDDL to outperform the training times of expert-optimized implementations by a substantial margin, up to 31.1% for a Transformer model and 71.5% for a VGG model.

The key innovation of AutoDDL is its ability to automatically explore and exploit parallelization schemes, relieving machine learning researchers and engineers of the burden of manually designing and tuning the distributed training process. This can lead to significant time savings and enable the efficient training of ever-larger deep learning models on distributed hardware.

Technical Explanation

The paper proposes a distributed training framework called AutoDDL that automatically explores and exploits new parallelization schemes to efficiently train large-scale deep learning models on distributed systems. AutoDDL utilizes OneFlow's Split, Broadcast, and Partial Sum (SBP) abstraction to facilitate the description and implementation of different parallelization schemes.

To reduce the overhead of searching for optimal parallelization schemes, AutoDDL combines an analytical performance model with a customized Coordinate Descent algorithm. The performance model estimates the bandwidth cost of different parallelization schemes, and the Coordinate Descent algorithm iteratively refines the scheme to minimize the bandwidth cost.

The researchers evaluate AutoDDL on Multi-Node-Single-GPU and Multi-Node-Multi-GPU systems using various deep learning models, including Transformer and VGG. Compared to expert-optimized implementations, AutoDDL reduces the end-to-end training time by up to 31.1% and 10% for Transformer, and up to 17.7% and 71.5% for VGG, on the two parallel systems, respectively.

Critical Analysis

The paper presents a promising approach to the challenging problem of efficiently training large-scale deep learning models on distributed systems. The key strength of AutoDDL is its ability to automatically explore and optimize parallelization schemes, which can significantly reduce the burden on machine learning practitioners.

However, the paper does not provide a comprehensive analysis of the limitations and potential issues with the proposed framework. For example, the performance model used by AutoDDL may not accurately capture all the complexities of real-world distributed training scenarios, such as the impact of network latency, resource contention, or hardware heterogeneity. Additionally, the paper does not discuss the scalability of AutoDDL to larger, more complex models or distributed systems with a higher number of nodes.

Further research could explore ways to enhance the performance model, improve the optimization algorithm, and investigate the generalization of AutoDDL to a wider range of deep learning models and hardware configurations. Additionally, it would be valuable to compare AutoDDL's performance to other distributed training frameworks, such as GAD, ANTDT, or Sampling-based Distributed Training, to better understand its relative strengths and weaknesses.

Conclusion

The AutoDDL framework proposed in this paper represents a significant advancement in the field of distributed deep learning training. By automatically exploring and exploiting parallelization schemes, AutoDDL can substantially reduce the end-to-end training time of large-scale models, making it a valuable tool for machine learning researchers and engineers working on computationally intensive deep learning tasks.

The paper's key contribution is the development of a system that can seamlessly handle the complex trade-offs involved in distributed training, freeing practitioners from the burden of manual optimization. While the paper does not address all potential limitations, the results demonstrate the promising potential of AutoDDL and highlight the importance of continued research in this area to enable the efficient training of ever-larger and more sophisticated deep learning models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤿

AutoDDL: Automatic Distributed Deep Learning with Near-Optimal Bandwidth Cost

Jinfan Chen, Shigang Li, Ran Gun, Jinhui Yuan, Torsten Hoefler

Recent advances in deep learning are driven by the growing scale of computation, data, and models. However, efficiently training large-scale models on distributed systems requires an intricate combination of data, operator, and pipeline parallelism, which exerts heavy burden on machine learning practitioners. To this end, we propose AutoDDL, a distributed training framework that automatically explores and exploits new parallelization schemes with near-optimal bandwidth cost. AutoDDL facilitates the description and implementation of different schemes by utilizing OneFlow's Split, Broadcast, and Partial Sum (SBP) abstraction. AutoDDL is equipped with an analytical performance model combined with a customized Coordinate Descent algorithm, which significantly reduces the scheme searching overhead. We conduct evaluations on Multi-Node-Single-GPU and Multi-Node-Multi-GPU machines using different models, including VGG and Transformer. Compared to the expert-optimized implementations, AutoDDL reduces the end-to-end training time by up to 31.1% and 10% for Transformer and up to 17.7% and 71.5% for VGG on the two parallel systems, respectively.

5/6/2024

🤿

Automated Deep Neural Network Inference Partitioning for Distributed Embedded Systems

Fabian Kress, El Mahdi El Annabi, Tim Hotfilter, Julian Hoefer, Tanja Harbaum, Juergen Becker

Distributed systems can be found in various applications, e.g., in robotics or autonomous driving, to achieve higher flexibility and robustness. Thereby, data flow centric applications such as Deep Neural Network (DNN) inference benefit from partitioning the workload over multiple compute nodes in terms of performance and energy-efficiency. However, mapping large models on distributed embedded systems is a complex task, due to low latency and high throughput requirements combined with strict energy and memory constraints. In this paper, we present a novel approach for hardware-aware layer scheduling of DNN inference in distributed embedded systems. Therefore, our proposed framework uses a graph-based algorithm to automatically find beneficial partitioning points in a given DNN. Each of these is evaluated based on several essential system metrics such as accuracy and memory utilization, while considering the respective system constraints. We demonstrate our approach in terms of the impact of inference partitioning on various performance metrics of six different DNNs. As an example, we can achieve a 47.5 % throughput increase for EfficientNet-B0 inference partitioned onto two platforms while observing high energy-efficiency.

7/1/2024

AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs

Feiyang Kang, Yifan Sun, Bingbing Wen, Si Chen, Dawn Song, Rafid Mahmood, Ruoxi Jia

To ensure performance on a diverse set of downstream tasks, LLMs are pretrained via data mixtures over different domains. In this work, we demonstrate that the optimal data composition for a fixed compute budget varies depending on the scale of the training data, suggesting that the common practice of empirically determining an optimal composition using small-scale experiments will not yield the optimal data mixtures when scaling up to the final model. To address this challenge, we propose *AutoScale*, an automated tool that finds a compute-optimal data composition for training at any desired target scale. AutoScale first determines the optimal composition at a small scale using a novel bilevel optimization framework, Direct Data Optimization (*DDO*), and then fits a predictor to estimate the optimal composition at larger scales. The predictor's design is inspired by our theoretical analysis of scaling laws related to data composition, which could be of independent interest. In empirical studies with pre-training 774M Decoder-only LMs (GPT-2 Large) on RedPajama dataset, AutoScale decreases validation perplexity at least 25% faster than any baseline with up to 38% speed up compared to without reweighting, achieving the best overall performance across downstream tasks. On pre-training Encoder-only LMs (BERT) with masked language modeling, DDO is shown to decrease loss on all domains while visibly improving average task performance on GLUE benchmark by 8.7% and on large-scale QA dataset (SQuAD) by 5.9% compared with without reweighting. AutoScale speeds up training by up to 28%. Our codes are open-sourced.

7/30/2024

Communication-Efficient Distributed Deep Learning via Federated Dynamic Averaging

Michail Theologitis, Georgios Frangias, Georgios Anestis, Vasilis Samoladas, Antonios Deligiannakis

Driven by the ever-growing volume and decentralized nature of data, coupled with the need to harness this data and generate knowledge from it, has led to the extensive use of distributed deep learning (DDL) techniques for training. These techniques rely on local training that is performed at the distributed nodes based on locally collected data, followed by a periodic synchronization process that combines these models to create a global model. However, frequent synchronization of DL models, encompassing millions to many billions of parameters, creates a communication bottleneck, severely hindering scalability. Worse yet, DDL algorithms typically waste valuable bandwidth, and make themselves less practical in bandwidth-constrained federated settings, by relying on overly simplistic, periodic, and rigid synchronization schedules. These drawbacks also have a direct impact on the time required for the training process, necessitating excessive time for data communication. To address these shortcomings, we propose Federated Dynamic Averaging (FDA), a communication-efficient DDL strategy that dynamically triggers synchronization based on the value of the model variance. In essence, the costly synchronization step is triggered only if the local models, which are initialized from a common global model after each synchronization, have significantly diverged. This decision is facilitated by the communication of a small local state from each distributed node/worker. Through extensive experiments across a wide range of learning tasks we demonstrate that FDA reduces communication cost by orders of magnitude, compared to both traditional and cutting-edge communication-efficient algorithms. Additionally, we show that FDA maintains robust performance across diverse data heterogeneity settings.

6/7/2024