ZeroPP: Unleashing Exceptional Parallelism Efficiency through Tensor-Parallelism-Free Methodology

Read original: arXiv:2402.03791 - Published 5/27/2024 by Ding Tang, Lijuan Jiang, Jiecheng Zhou, Minxi Jin, Hengjie Li, Xingcheng Zhang, Zhilin Pei, Jidong Zhai

ZeroPP: Unleashing Exceptional Parallelism Efficiency through Tensor-Parallelism-Free Methodology

Overview

This paper introduces a new approach called "Adaptive Blockwise Task-interleaved Pipeline Parallelism" for improving the parallelization of large language models and other data-intensive machine learning models.
The key idea is to adaptively adjust the block size and interleaving pattern of pipeline parallelism to better match the varying computational and memory requirements of different model layers.
This allows for more efficient utilization of compute resources and can lead to significant speedups compared to traditional data parallelism and pipeline parallelism approaches.

Plain English Explanation

Modern large language models and other advanced machine learning models can be incredibly computationally intensive, requiring massive amounts of data and processing power to train effectively. To speed up the training process, researchers often use parallelization techniques like data parallelism and pipeline parallelism.

However, these traditional approaches have limitations. Data parallelism can be inefficient if the model's memory requirements exceed the available GPU memory. And pipeline parallelism assumes that the compute and memory requirements are evenly distributed across the model layers, which is often not the case in practice.

The "Adaptive Blockwise Task-interleaved Pipeline Parallelism" approach introduced in this paper tries to address these issues. The key idea is to dynamically adjust the block size and interleaving pattern of the pipeline parallelism based on the varying compute and memory requirements of the different model layers.

This allows the system to better utilize the available hardware resources, leading to significant speedups in the training process compared to traditional parallelization methods. It's a bit like having a team of workers where you can flexibly adjust how they work together on different tasks to maximize efficiency, rather than forcing them to work in a rigid, one-size-fits-all way.

Technical Explanation

The paper presents a new parallelization technique called "Adaptive Blockwise Task-interleaved Pipeline Parallelism" that aims to improve on traditional data parallelism and pipeline parallelism approaches.

The key innovation is the adaptive adjustment of the block size and interleaving pattern in the pipeline parallelism. Normally, pipeline parallelism assumes that the compute and memory requirements are evenly distributed across the model layers, but in practice this is often not the case.

The proposed method addresses this by dynamically adjusting the block size and interleaving pattern to better match the varying requirements of the different layers. This allows for more efficient utilization of the available compute resources, leading to significant speedups in the training process compared to traditional parallelization approaches.

The authors evaluate their technique on large language models like GPT-3 and T5, as well as other data-intensive models like Transformer-XL and BERT. They show that their "Adaptive Blockwise Task-interleaved Pipeline Parallelism" can achieve up to 2.8x speedup over data parallelism and up to 1.6x speedup over traditional pipeline parallelism.

Critical Analysis

The paper presents a well-designed and thorough evaluation of the proposed "Adaptive Blockwise Task-interleaved Pipeline Parallelism" approach. The authors carefully consider the limitations of existing parallelization techniques and provide a principled solution that addresses these shortcomings.

One potential area for further research could be to explore the interaction between the adaptive block size/interleaving and the overall model architecture. The authors mention that their technique works best for models with varying compute/memory requirements across layers, but it would be interesting to see how it performs on more homogeneous architectures.

Additionally, the paper focuses on training large language models and other data-intensive models, but it's unclear how the technique would scale to even larger models or different domains. Evaluating the approach on a wider range of model types and sizes could provide additional insights.

Overall, this is a well-executed piece of research that makes a meaningful contribution to the field of distributed machine learning. The "Adaptive Blockwise Task-interleaved Pipeline Parallelism" approach represents an important step forward in making large-scale model training more efficient and accessible.

Conclusion

This paper introduces a novel parallelization technique called "Adaptive Blockwise Task-interleaved Pipeline Parallelism" that aims to improve on traditional data parallelism and pipeline parallelism approaches.

The key innovation is the dynamic adjustment of the block size and interleaving pattern in the pipeline parallelism to better match the varying compute and memory requirements of different model layers. This allows for more efficient utilization of available hardware resources, leading to significant speedups in the training process.

Evaluated on large language models and other data-intensive models, the proposed technique demonstrates up to 2.8x speedup over data parallelism and up to 1.6x speedup over traditional pipeline parallelism. This represents an important advancement in making large-scale machine learning more scalable and accessible.

While the paper focuses on training large language models, the underlying principles of the "Adaptive Blockwise Task-interleaved Pipeline Parallelism" approach could potentially be applied to a wider range of data-intensive models and domains, potentially unlocking new breakthroughs in areas like scientific computing, drug discovery, and beyond.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

ZeroPP: Unleashing Exceptional Parallelism Efficiency through Tensor-Parallelism-Free Methodology

Ding Tang, Lijuan Jiang, Jiecheng Zhou, Minxi Jin, Hengjie Li, Xingcheng Zhang, Zhilin Pei, Jidong Zhai

Large-scale models rely heavily on 3D parallelism for distributed training, which utilizes tensor parallelism (TP) as the intra-operator parallelism to partition model states across GPUs. However, TP introduces significant communication overheads and complexity in modifying single-GPU code. In this paper, we propose a TP-free distributed framework ZeroPP, which leverages the hybrid of scalable inter-operator pipeline parallelism and intra-operator fully sharded data parallelism to train models at scale, reducing memory consumption and enabling high training efficiency. Through extensive experimentation, we demonstrate that ZeroPP achieves significant performance gains of up to 33% compared to conventional 3D parallelism while maintaining comparable GPU memory consumption.

5/27/2024

Accelerating Large Language Model Training with Hybrid GPU-based Compression

Lang Xu, Quentin Anthony, Qinghua Zhou, Nawras Alnaasan, Radha R. Gulhane, Aamir Shafi, Hari Subramoni, Dhabaleswar K. Panda

Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP) are the three strategies widely adopted to enable fast and efficient Large Language Model (LLM) training. However, these approaches rely on data-intensive communication routines to collect, aggregate, and re-distribute gradients, activations, and other important model information, which pose significant overhead. Co-designed with GPU-based compression libraries, MPI libraries have been proven to reduce message size significantly, and leverage interconnect bandwidth, thus increasing training efficiency while maintaining acceptable accuracy. In this work, we investigate the efficacy of compression-assisted MPI collectives under the context of distributed LLM training using 3D parallelism and ZeRO optimizations. We scaled up to 192 V100 GPUs on the Lassen supercomputer. First, we enabled a naive compression scheme across all collectives and observed a 22.5% increase in TFLOPS per GPU and a 23.6% increase in samples per second for GPT-NeoX-20B training. Nonetheless, such a strategy ignores the sparsity discrepancy among messages communicated in each parallelism degree, thus introducing more errors and causing degradation in training loss. Therefore, we incorporated hybrid compression settings toward each parallel dimension and adjusted the compression intensity accordingly. Given their low-rank structure (arXiv:2301.02654), we apply aggressive compression on gradients when performing DP All-reduce. We adopt milder compression to preserve precision while communicating activations, optimizer states, and model parameters in TP and PP. Using the adjusted hybrid compression scheme, we demonstrate a 17.3% increase in TFLOPS per GPU and a 12.7% increase in samples per second while reaching baseline loss convergence.

9/5/2024

🔍

A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs

Siddharth Singh, Prajwal Singhania, Aditya K. Ranjan, Zack Sating, Abhinav Bhatele

Heavy communication, in particular, collective operations, can become a critical performance bottleneck in scaling the training of billion-parameter neural networks to large-scale parallel systems. This paper introduces a four-dimensional (4D) approach to optimize communication in parallel training. This 4D approach is a hybrid of 3D tensor and data parallelism, and is implemented in the AxoNN framework. In addition, we employ two key strategies to further minimize communication overheads. First, we aggressively overlap expensive collective operations (reduce-scatter, all-gather, and all-reduce) with computation. Second, we develop an analytical model to identify high-performing configurations within the large search space defined by our 4D algorithm. This model empowers practitioners by simplifying the tuning process for their specific training workloads. When training an 80-billion parameter GPT on 1024 GPUs of Perlmutter, AxoNN surpasses Megatron-LM, a state-of-the-art framework, by a significant 26%. Additionally, it achieves a significantly high 57% of the theoretical peak FLOP/s or 182 PFLOP/s in total.

5/15/2024

🤖

A Unified Sequence Parallelism Approach for Long Context Generative AI

Jiarui Fang, Shangchun Zhao

Sequence parallelism (SP), which divides the sequence dimension of input tensors across multiple computational devices, is becoming key to unlocking the long-context capabilities of generative AI models. This paper investigates the state-of-the-art SP approaches, i.e. DeepSpeed-Ulysses and Ring-Attention, and proposes a unified SP approach, which is more robust to transformer model architectures and network hardware topology. This paper compares the communication and memory cost of SP and existing parallelism, including data/tensor/zero/pipeline parallelism, and discusses the best practices for designing hybrid 4D parallelism involving SP. We achieved 47% MFU on two 8xA800 nodes using SP for the LLAMA3-8B model training using sequence length 208K. Our code is publicly available at https://github.com/feifeibear/long-context-attention.

5/24/2024