A Unified Sequence Parallelism Approach for Long Context Generative AI

Read original: arXiv:2405.07719 - Published 5/24/2024 by Jiarui Fang, Shangchun Zhao

🤖

Overview

This paper investigates the state-of-the-art sequence parallelism (SP) approaches, such as DeepSpeed-Ulysses and Ring-Attention.
It proposes a unified SP approach that is more robust to transformer model architectures and network hardware topology.
The paper compares the communication and memory cost of SP and existing parallelism techniques, including data/tensor/zero/expert/pipeline parallelism.
It discusses best practices for designing hybrid 4D parallelism involving SP.
The authors achieved 86% MFU (Micro-Factored Utilization) on two 8xA800 nodes using SP for a sequence length of 208K for the LLAMA3-8B model.

Plain English Explanation

Sequence parallelism (SP) is a technique that can help unlock the long-context capabilities of generative AI models. It works by dividing the sequence dimension of input tensors (the data used to train the model) across multiple computational devices, like GPUs or CPUs.

This paper looks at the current best ways to do SP, called DeepSpeed-Ulysses and Ring-Attention. It then proposes a new, more flexible approach that works better with different transformer model designs and computer hardware setups.

The paper also compares how much communication and memory is needed for SP versus other parallelism techniques, like data parallelism, tensor parallelism, and pipeline parallelism. It provides guidance on how to best combine these different parallelism approaches to get the most efficient performance.

The researchers were able to achieve 86% efficiency (MFU) on a large language model with a very long sequence length of 208,000 using their SP approach on two powerful 8-GPU nodes. This is a significant improvement and shows the potential of SP for training large, long-context AI models.

Technical Explanation

The paper investigates the state-of-the-art sequence parallelism (SP) approaches, namely DeepSpeed-Ulysses and Ring-Attention, and proposes a unified SP approach. This unified approach is designed to be more robust to different transformer model architectures and network hardware topologies.

The authors compare the communication and memory costs of SP against existing parallelism techniques, including data parallelism, tensor parallelism, zero parallelism, expert parallelism, and pipeline parallelism. They discuss the best practices for designing hybrid 4D parallelism (a combination of these approaches) that incorporates SP.

In their experiments, the researchers achieved 86% MFU (Micro-Factored Utilization) on two 8xA800 nodes using SP for a sequence length of 208K for the LLAMA3-8B model. This demonstrates the effectiveness of their unified SP approach in efficiently training large language models with long input sequences.

Critical Analysis

The paper provides a thorough investigation of sequence parallelism and its potential benefits for training large generative AI models. The proposed unified SP approach seems promising, as it aims to be more robust to different model architectures and hardware setups compared to existing SP methods.

However, the paper does not delve into the potential limitations or caveats of this approach. It would be helpful to understand any scenarios or model types where the unified SP method may not perform as well, or any potential trade-offs that users need to consider when implementing it.

Additionally, the paper could have provided more details on the specific hardware and software configurations used in the experiments, as well as the benchmarking methodologies employed. This would allow readers to better understand the context and reproducibility of the reported results.

Overall, the research presented in this paper is a valuable contribution to the field of large language model training. The authors' work on sequence parallelism and hybrid parallelism strategies could have significant implications for the development of more efficient and capable generative AI systems in the future.

Conclusion

This paper introduces a unified sequence parallelism (SP) approach that aims to be more robust to different transformer model architectures and network hardware topologies. By dividing the sequence dimension of input tensors across multiple devices, SP can help unlock the long-context capabilities of large generative AI models.

The researchers compare SP to other parallelism techniques, such as data, tensor, and pipeline parallelism, and provide guidance on designing hybrid parallelism strategies. They demonstrate the effectiveness of their unified SP approach by achieving 86% efficiency on a large language model with a sequence length of 208,000, a significant improvement over previous methods.

The work presented in this paper represents an important step forward in the development of efficient training techniques for large, long-context AI models. As the field of generative AI continues to advance, innovations like this in sequence parallelism and long-sequence modeling will be crucial for pushing the boundaries of what these models can achieve.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤖

A Unified Sequence Parallelism Approach for Long Context Generative AI

Jiarui Fang, Shangchun Zhao

Sequence parallelism (SP), which divides the sequence dimension of input tensors across multiple computational devices, is becoming key to unlocking the long-context capabilities of generative AI models. This paper investigates the state-of-the-art SP approaches, i.e. DeepSpeed-Ulysses and Ring-Attention, and proposes a unified SP approach, which is more robust to transformer model architectures and network hardware topology. This paper compares the communication and memory cost of SP and existing parallelism, including data/tensor/zero/pipeline parallelism, and discusses the best practices for designing hybrid 4D parallelism involving SP. We achieved 47% MFU on two 8xA800 nodes using SP for the LLAMA3-8B model training using sequence length 208K. Our code is publicly available at https://github.com/feifeibear/long-context-attention.

5/24/2024

Linear Attention Sequence Parallelism

Weigao Sun, Zhen Qin, Dong Li, Xuyang Shen, Yu Qiao, Yiran Zhong

Sequence Parallel (SP) serves as a prevalent strategy to handle long sequences that exceed the memory limit of a single GPU. However, existing SP methods do not take advantage of linear attention features, resulting in sub-optimal parallelism efficiency and usability for linear attention-based language models. In this paper, we introduce Linear Attention Sequence Parallel (LASP), an efficient SP method tailored to linear attention-based language models. Specifically, we design an efficient point-to-point communication mechanism to leverage the right-product kernel trick of linear attention, which sharply decreases the communication overhead of SP. We also enhance the practical efficiency of LASP by performing kernel fusion and intermediate state caching, making the implementation of LASP hardware-friendly on GPU clusters. Furthermore, we meticulously ensure the compatibility of sequence-level LASP with all types of batch-level data parallel methods, which is vital for distributed training on large clusters with long sequences and large batches. We conduct extensive experiments on two linear attention-based models with varying sequence lengths and GPU cluster sizes. LASP scales sequence length up to 4096K using 128 A100 80G GPUs on 1B models, which is 8 times longer than existing SP methods while being significantly faster. The code is available at https://github.com/OpenNLPLab/LASP.

4/4/2024

DSP: Dynamic Sequence Parallelism for Multi-Dimensional Transformers

Xuanlei Zhao, Shenggan Cheng, Chang Chen, Zangwei Zheng, Ziming Liu, Zheming Yang, Yang You

Scaling multi-dimensional transformers to long sequences is indispensable across various domains. However, the challenges of large memory requirements and slow speeds of such sequences necessitate sequence parallelism. All existing approaches fall under the category of embedded sequence parallelism, which are limited to shard along a single sequence dimension, thereby introducing significant communication overhead. However, the nature of multi-dimensional transformers involves independent calculations across multiple sequence dimensions. To this end, we propose Dynamic Sequence Parallelism (DSP) as a novel abstraction of sequence parallelism. DSP dynamically switches the parallel dimension among all sequences according to the computation stage with efficient resharding strategy. DSP offers significant reductions in communication costs, adaptability across modules, and ease of implementation with minimal constraints. Experimental evaluations demonstrate DSP's superiority over state-of-the-art embedded sequence parallelism methods by remarkable throughput improvements ranging from 32.2% to 10x, with less than 25% communication volume.

8/27/2024

LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism

Diandian Gu, Peng Sun, Qinghao Hu, Ting Huang, Xun Chen, Yingtong Xiong, Guoteng Wang, Qiaoling Chen, Shangchun Zhao, Jiarui Fang, Yonggang Wen, Tianwei Zhang, Xin Jin, Xuanzhe Liu

Efficiently training LLMs with long sequences is important yet challenged by the massive computation and memory requirements. Sequence parallelism has been proposed to tackle these problems, but existing methods suffer from scalability or efficiency issues. We propose LoongTrain, a novel system to efficiently train LLMs with long sequences at scale. The core of LoongTrain is the 2D-Attention mechanism, which combines both head-parallel and context-parallel techniques to break the scalability constraints while maintaining efficiency. We introduce Double-Ring-Attention and analyze the performance of device placement strategies to further speed up training. We implement LoongTrain with the hybrid ZeRO and Selective Checkpoint++ techniques. Experiment results show that LoongTrain outperforms state-of-the-art baselines, i.e., DeepSpeed-Ulysses and Megatron Context Parallelism, in both end-to-end training speed and scalability, and improves Model FLOPs Utilization (MFU) by up to 2.88x.

6/27/2024