Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum

Read original: arXiv:2405.13226 - Published 5/24/2024 by Hadi Pouransari, Chun-Liang Li, Jen-Hao Rick Chang, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Oncel Tuzel

🏋️

Overview

Large language models (LLMs) are commonly trained on datasets of fixed-length token sequences
These datasets are created by randomly concatenating documents and then dividing them into sequences of a predetermined length
This method can lead to cross-document attention within a sequence, which is not desirable and is computationally inefficient
Training on long sequences is also computationally prohibitive due to the quadratic cost of attention

Plain English Explanation

Large AI language models are typically trained on datasets of text that have been chopped up into small, uniform-length pieces. This is done by taking many different documents, mashing them together randomly, and then cutting the resulting jumble into sequences of a fixed length.

However, this approach has some issues. When a sequence includes content from multiple documents, the model can end up "paying attention" to connections between those documents, even though that's not really what we want it to learn. This cross-document attention is both unhelpful and computationally expensive.

Additionally, training models on very long sequences of text is just plain costly from a computational perspective, due to the way the attention mechanism works under the hood.

Technical Explanation

To address these challenges, the researchers introduce a new technique called "dataset decomposition." Instead of randomly concatenating documents, they decompose the dataset into a collection of "buckets," where each bucket contains sequences of the same length extracted from a single document.

During training, the model samples sequences of variable length from these buckets, rather than using fixed-length sequences. This avoids the cross-document attention issue and also reduces the overall computational cost, since the attention calculation is only performed on the actual length of each sequence, rather than a fixed maximum length.

The researchers show that this approach allows them to train an 8,000-token-context model at the same computational cost as a 2,000-token-context model using the traditional concatenation method. Experiments demonstrate that their technique significantly boosts performance on standard language tasks and long-context benchmarks, while also scaling effectively as the dataset size increases.

The paper also highlights the importance of the distribution and curriculum of sequence lengths during training, which can have a non-trivial impact on model performance.

Critical Analysis

The researchers present a thoughtful and well-designed solution to the challenges of training large language models on long sequences of text. By decomposing the dataset and using variable-length sequences, they are able to address the issues of cross-document attention and computational efficiency.

However, one potential limitation of the approach is that it may require more careful engineering and data preparation upfront, as the dataset needs to be organized into the appropriate buckets. This could add complexity and overhead compared to the simpler concatenation-and-chunking method.

Additionally, the paper does not explore the potential drawbacks or downsides of the variable-length training approach, such as any potential negative impacts on model performance or generalization. Further research may be needed to fully understand the tradeoffs and edge cases.

Overall, the researchers have made a valuable contribution to the field of large language model training, and their work highlights the importance of carefully designing the training data and process to optimize both performance and efficiency.

Conclusion

The paper introduces a novel dataset decomposition technique that addresses key challenges in training large language models on long sequences of text. By using variable-length sequences and a curriculum-based sampling approach, the researchers are able to avoid cross-document attention and significantly reduce the computational cost of training.

The results show that this method leads to substantial improvements in model performance on standard language tasks and long-context benchmarks, while also scaling effectively as the dataset size increases. The work underscores the importance of thoughtful dataset design and training approaches for large language models, and provides a promising direction for future research in this area.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏋️

Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum

Hadi Pouransari, Chun-Liang Li, Jen-Hao Rick Chang, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Oncel Tuzel

Large language models (LLMs) are commonly trained on datasets consisting of fixed-length token sequences. These datasets are created by randomly concatenating documents of various lengths and then chunking them into sequences of a predetermined target length. However, this method of concatenation can lead to cross-document attention within a sequence, which is neither a desirable learning signal nor computationally efficient. Additionally, training on long sequences becomes computationally prohibitive due to the quadratic cost of attention. In this study, we introduce dataset decomposition, a novel variable sequence length training technique, to tackle these challenges. We decompose a dataset into a union of buckets, each containing sequences of the same size extracted from a unique document. During training, we use variable sequence length and batch size, sampling simultaneously from all buckets with a curriculum. In contrast to the concat-and-chunk baseline, which incurs a fixed attention cost at every step of training, our proposed method incurs a penalty proportional to the actual document lengths at each step, resulting in significant savings in training time. We train an 8k context-length 1B model at the same cost as a 2k context-length model trained with the baseline approach. Experiments on a web-scale corpus demonstrate that our approach significantly enhances performance on standard language evaluations and long-context benchmarks, reaching target accuracy 3x faster compared to the baseline. Our method not only enables efficient pretraining on long sequences but also scales effectively with dataset size. Lastly, we shed light on a critical yet less studied aspect of training large language models: the distribution and curriculum of sequence lengths, which results in a non-negligible difference in performance.

5/24/2024

Bucket Pre-training is All You Need

Hongtao Liu, Qiyao Peng, Qing Yang, Kai Liu, Hongyan Xu

Large language models (LLMs) have demonstrated exceptional performance across various natural language processing tasks. However, the conventional fixed-length data composition strategy for pretraining, which involves concatenating and splitting documents, can introduce noise and limit the model's ability to capture long-range dependencies. To address this, we first introduce three metrics for evaluating data composition quality: padding ratio, truncation ratio, and concatenation ratio. We further propose a multi-bucket data composition method that moves beyond the fixed-length paradigm, offering a more flexible and efficient approach to pretraining. Extensive experiments demonstrate that our proposed method could significantly improving both the efficiency and efficacy of LLMs pretraining. Our approach not only reduces noise and preserves context but also accelerates training, making it a promising solution for LLMs pretraining.

7/11/2024

XL3M: A Training-free Framework for LLM Length Extension Based on Segment-wise Inference

Shengnan Wang, Youhui Bai, Lin Zhang, Pingyi Zhou, Shixiong Zhao, Gong Zhang, Sen Wang, Renhai Chen, Hua Xu, Hongwei Sun

Length generalization failure problem, namely the large language model (LLM) fails to generalize to texts longer than its maximum training length, greatly restricts the application of LLM in the scenarios with streaming long inputs. To address this problem, the existing methods either require substantial costs or introduce precision loss. In this paper, we empirically find that the accuracy of the LLM's prediction is highly correlated to its certainty. Based on this, we propose an efficient training free framework, named XL3M (it means extra-long large language model), which enables the LLMs trained on short sequences to reason extremely long sequence without any further training or fine-tuning. Under the XL3M framework, the input context will be firstly decomposed into multiple short sub-contexts, where each sub-context contains an independent segment and a common ``question'' which is a few tokens from the end of the original context. Then XL3M gives a method to measure the relevance between each segment and the ``question'', and constructs a concise key context by splicing all the relevant segments in chronological order. The key context is further used instead of the original context to complete the inference task. Evaluations on comprehensive benchmarks show the superiority of XL3M. Using our framework, a Llama2-7B model is able to reason 20M long sequences on an 8-card Huawei Ascend 910B NPU machine with 64GB memory per card.

5/29/2024

💬

Beyond the Limits: A Survey of Techniques to Extend the Context Length in Large Language Models

Xindi Wang, Mahsa Salmani, Parsa Omidi, Xiangyu Ren, Mehdi Rezagholizadeh, Armaghan Eshaghi

Recently, large language models (LLMs) have shown remarkable capabilities including understanding context, engaging in logical reasoning, and generating responses. However, this is achieved at the expense of stringent computational and memory requirements, hindering their ability to effectively support long input sequences. This survey provides an inclusive review of the recent techniques and methods devised to extend the sequence length in LLMs, thereby enhancing their capacity for long-context understanding. In particular, we review and categorize a wide range of techniques including architectural modifications, such as modified positional encoding and altered attention mechanisms, which are designed to enhance the processing of longer sequences while avoiding a proportional increase in computational requirements. The diverse methodologies investigated in this study can be leveraged across different phases of LLMs, i.e., training, fine-tuning and inference. This enables LLMs to efficiently process extended sequences. The limitations of the current methodologies is discussed in the last section along with the suggestions for future research directions, underscoring the importance of sequence length in the continued advancement of LLMs.

5/30/2024