LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism

Read original: arXiv:2406.18485 - Published 6/27/2024 by Diandian Gu, Peng Sun, Qinghao Hu, Ting Huang, Xun Chen, Yingtong Xiong, Guoteng Wang, Qiaoling Chen, Shangchun Zhao, Jiarui Fang and 4 others

LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism

Overview

Introduces a novel training approach called "LoongTrain" for efficiently training large language models (LLMs) on long sequences
Leverages a head-context parallelism technique to speed up training on long-range dependencies
Evaluates LoongTrain on various language tasks and shows significant performance improvements over existing methods

Plain English Explanation

LoongTrain is a new way to train large language models, which are AI systems that can understand and generate human-like text. These models are often trained on huge amounts of text data, which can take a very long time. LoongTrain aims to make this training process more efficient, especially for models that need to understand long sequences of text.

The key idea behind LoongTrain is to split the training of the model into different parts, and have these parts work together in parallel. This allows the model to learn from long sequences of text more quickly. LoongTrain does this by dividing the model's "attention" mechanism, which is how it understands the relationships between different parts of the text, into separate chunks that can be processed simultaneously.

By using this head-context parallelism approach, LoongTrain was able to train large language models much faster than previous methods, while still maintaining high performance on a variety of language tasks. This could be especially useful for real-world applications that require understanding long-form text, such as long-context scaling for LLMs or efficiently serving long-context LLMs.

Technical Explanation

The LoongTrain approach builds on previous work in distributed training and sequence parallelism for training LLMs on long sequences. It introduces a novel head-context parallelism technique that allows the model's attention mechanism to be split and processed in parallel across multiple devices.

Specifically, LoongTrain partitions the attention heads (which learn to focus on different parts of the input) and their corresponding context vectors (which store information about the input) across multiple GPUs. This enables the model to process long sequences more efficiently by computing the attention scores and updates for different parts of the sequence simultaneously.

The authors evaluate LoongTrain on a range of language tasks, including machine translation, summarization, and question answering. They show that LoongTrain outperforms existing training approaches, particularly on tasks that require understanding long-range dependencies in the input. For example, LoongTrain demonstrated significant performance improvements over training-free long-context scaling and unified sequence parallelism on long-context language tasks.

Critical Analysis

The LoongTrain paper presents a promising approach for efficiently training large language models on long sequences of text. The head-context parallelism technique is a novel contribution that builds on previous work in distributed training and sequence parallelism.

One potential limitation of the approach is that it may be specific to the attention mechanism used in transformer-based models, and may not generalize as well to other model architectures. Additionally, the paper does not explore the impact of the parallelization strategy on model convergence or the quality of the learned representations.

Further research could investigate the scalability of LoongTrain to even larger models and datasets, as well as its performance on a wider range of long-context language tasks. Comparisons to other recently proposed approaches, such as Megalodon, could also provide valuable insights.

Overall, LoongTrain represents an important step forward in addressing the computational challenges of training large language models on long sequences of text, with potential applications in long-context scaling and long-context serving.

Conclusion

The LoongTrain paper introduces a novel approach for efficiently training large language models on long sequences of text. By leveraging head-context parallelism, LoongTrain is able to significantly speed up the training process while maintaining high performance on a variety of language tasks.

This research represents an important contribution to the ongoing efforts to scale up language models and make them more practical for real-world applications that require understanding long-form text. The head-context parallelism technique could also have broader implications for the design of efficient training algorithms for other types of neural networks.

Overall, LoongTrain is a promising step forward in the field of large language model training, with the potential to unlock new capabilities and applications for these powerful AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism

Diandian Gu, Peng Sun, Qinghao Hu, Ting Huang, Xun Chen, Yingtong Xiong, Guoteng Wang, Qiaoling Chen, Shangchun Zhao, Jiarui Fang, Yonggang Wen, Tianwei Zhang, Xin Jin, Xuanzhe Liu

Efficiently training LLMs with long sequences is important yet challenged by the massive computation and memory requirements. Sequence parallelism has been proposed to tackle these problems, but existing methods suffer from scalability or efficiency issues. We propose LoongTrain, a novel system to efficiently train LLMs with long sequences at scale. The core of LoongTrain is the 2D-Attention mechanism, which combines both head-parallel and context-parallel techniques to break the scalability constraints while maintaining efficiency. We introduce Double-Ring-Attention and analyze the performance of device placement strategies to further speed up training. We implement LoongTrain with the hybrid ZeRO and Selective Checkpoint++ techniques. Experiment results show that LoongTrain outperforms state-of-the-art baselines, i.e., DeepSpeed-Ulysses and Megatron Context Parallelism, in both end-to-end training speed and scalability, and improves Model FLOPs Utilization (MFU) by up to 2.88x.

6/27/2024

🔍

InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory

Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, Maosong Sun

Large language models (LLMs) have emerged as a cornerstone in real-world applications with lengthy streaming inputs (e.g., LLM-driven agents). However, existing LLMs, pre-trained on sequences with a restricted maximum length, cannot process longer sequences due to the out-of-domain and distraction issues. Common solutions often involve continual pre-training on longer sequences, which will introduce expensive computational overhead and uncontrollable change in model capabilities. In this paper, we unveil the intrinsic capacity of LLMs for understanding extremely long sequences without any fine-tuning. To this end, we introduce a training-free memory-based method, InfLLM. Specifically, InfLLM stores distant contexts into additional memory units and employs an efficient mechanism to lookup token-relevant units for attention computation. Thereby, InfLLM allows LLMs to efficiently process long sequences with a limited context window and well capture long-distance dependencies. Without any training, InfLLM enables LLMs that are pre-trained on sequences consisting of a few thousand tokens to achieve comparable performance with competitive baselines that continually train these LLMs on long sequences. Even when the sequence length is scaled to $1,024$K, InfLLM still effectively captures long-distance dependencies. Our code can be found in url{https://github.com/thunlp/InfLLM}.

5/29/2024

Efficiently Training 7B LLM with 1 Million Sequence Length on 8 GPUs

Pinxue Zhao, Hailin Zhang, Fangcheng Fu, Xiaonan Nie, Qibin Liu, Fang Yang, Yuanbo Peng, Dian Jiao, Shuaipeng Li, Jinbao Xue, Yangyu Tao, Bin Cui

Nowadays, Large Language Models (LLMs) have been trained using extended context lengths to foster more creative applications. However, long context training poses great challenges considering the constraint of GPU memory. It not only leads to substantial activation memory consumption during training, but also incurs considerable memory fragmentation. To facilitate long context training, existing frameworks have adopted strategies such as recomputation and various forms of parallelisms. Nevertheless, these techniques rely on redundant computation or extensive communication, resulting in low Model FLOPS Utilization (MFU). In this paper, we propose MEMO, a novel LLM training framework designed for fine-grained activation memory management. Given the quadratic scaling of computation and linear scaling of memory with sequence lengths when using FlashAttention, we offload memory-consuming activations to CPU memory after each layer's forward pass and fetch them during the backward pass. To maximize the swapping of activations without hindering computation, and to avoid exhausting limited CPU memory, we implement a token-wise activation recomputation and swapping mechanism. Furthermore, we tackle the memory fragmentation issue by employing a bi-level Mixed Integer Programming (MIP) approach, optimizing the reuse of memory across transformer layers. Empirical results demonstrate that MEMO achieves an average of 2.42x and 2.26x MFU compared to Megatron-LM and DeepSpeed, respectively. This improvement is attributed to MEMO's ability to minimize memory fragmentation, reduce recomputation and intensive communication, and circumvent the delays associated with the memory reorganization process due to fragmentation. By leveraging fine-grained activation memory management, MEMO facilitates efficient training of 7B LLM with 1 million sequence length on just 8 A800 GPUs, achieving an MFU of 52.30%.

7/18/2024

🤖

A Unified Sequence Parallelism Approach for Long Context Generative AI

Jiarui Fang, Shangchun Zhao

Sequence parallelism (SP), which divides the sequence dimension of input tensors across multiple computational devices, is becoming key to unlocking the long-context capabilities of generative AI models. This paper investigates the state-of-the-art SP approaches, i.e. DeepSpeed-Ulysses and Ring-Attention, and proposes a unified SP approach, which is more robust to transformer model architectures and network hardware topology. This paper compares the communication and memory cost of SP and existing parallelism, including data/tensor/zero/pipeline parallelism, and discusses the best practices for designing hybrid 4D parallelism involving SP. We achieved 47% MFU on two 8xA800 nodes using SP for the LLAMA3-8B model training using sequence length 208K. Our code is publicly available at https://github.com/feifeibear/long-context-attention.

5/24/2024