LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models

2406.00605

Published 6/4/2024 by Liang Zhao, Tianwen Wei, Liang Zeng, Cheng Cheng, Liu Yang, Peng Cheng, Lijie Wang, Chenxia Li, Xuejie Wu, Bo Zhu and 5 others

cs.CL cs.AI

LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models

Abstract

We introduce LongSkywork, a long-context Large Language Model (LLM) capable of processing up to 200,000 tokens. We provide a training recipe for efficiently extending context length of LLMs. We identify that the critical element in enhancing long-context processing capability is to incorporate a long-context SFT stage following the standard SFT stage. A mere 200 iterations can convert the standard SFT model into a long-context model. To reduce the effort in collecting and annotating data for long-context language modeling, we develop two novel methods for creating synthetic data. These methods are applied during the continual pretraining phase as well as the Supervised Fine-Tuning (SFT) phase, greatly enhancing the training efficiency of our long-context LLMs. Our findings suggest that synthetic long-context SFT data can surpass the performance of data curated by humans to some extent. LongSkywork achieves outstanding performance on a variety of long-context benchmarks. In the Needle test, a benchmark for long-context information retrieval, our models achieved perfect accuracy across multiple context spans. Moreover, in realistic application scenarios, LongSkywork-13B demonstrates performance on par with Claude2.1, the leading long-context model, underscoring the effectiveness of our proposed methods.

Create account to get full access

Overview

This paper introduces a new training approach called "LongSkywork" that can efficiently extend the context length of large language models (LLMs).
LLMs typically struggle with processing long-form content due to limitations in their context length. LongSkywork aims to address this challenge.
The authors demonstrate that LongSkywork can increase the context length of the LLAMA-3 model by over 10x, without significantly impacting training time or model size.

Plain English Explanation

LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models tackles an important problem faced by many large language models (LLMs) - their limited ability to process long-form content.

LLMs, such as GPT-3 and LLAMA-3, are powerful AI models that can understand and generate human-like text. However, they typically have a fixed "context length," which limits the amount of text they can consider at once. This makes it challenging for them to work with lengthy documents, long conversations, or complex tasks that require understanding extended context.

The researchers behind this paper developed a new training approach called "LongSkywork" that can significantly increase the context length of LLMs without dramatically increasing the model size or training time. By applying this technique to the LLAMA-3 model, they were able to extend its context length by over 10 times, allowing it to handle much longer inputs.

This is an important advancement, as extending the context length of LLMs has been a long-standing challenge in the field of natural language processing. The ability to work with longer context can unlock new applications for LLMs, such as summarizing lengthy documents, aligning short instructions with synthesized positions, and even training-free long-context extrapolation.

Technical Explanation

The key innovation of the LongSkywork approach is the use of a specialized training process that gradually increases the context length of the model during training. The authors start with a standard LLM, such as LLAMA-3, and then train it using a curriculum learning strategy, where the context length is gradually increased over the course of training.

This is achieved by splitting the training data into smaller chunks and training the model to accurately predict the next token in each chunk, while also conditioning on the previous chunks in the document. As training progresses, the size of the chunks is increased, allowing the model to learn to integrate longer and longer contexts.

The authors find that this approach is significantly more efficient than simply training the model on longer sequences from the beginning, as it allows the model to learn the necessary skills and representations in a more gradual and stable manner.

Through extensive experiments, the researchers demonstrate that LongSkywork can increase the context length of LLAMA-3 by over 10 times, without a substantial impact on training time or model size. This represents a significant improvement over the original model's capabilities, opening up new possibilities for applying LLMs to tasks that require understanding of long-form content.

Critical Analysis

The LongSkywork approach presented in this paper is a promising step forward in addressing the long-standing challenge of extending the context length of large language models. The authors provide a well-designed and rigorously evaluated solution that can significantly boost the capabilities of these models without overly increasing their complexity.

One potential limitation of the approach, as noted by the authors, is that it may not be as effective for models with very large context lengths to begin with. In such cases, the incremental gains from LongSkywork may be more modest. Additionally, the authors acknowledge that the training process for LongSkywork is more computationally intensive than standard LLM training, which could be a concern for researchers and practitioners with limited computational resources.

Furthermore, while the authors demonstrate the effectiveness of LongSkywork on the LLAMA-3 model, it remains to be seen how well the approach generalizes to other LLMs or to more diverse types of content and tasks. Additional research and validation across a broader range of scenarios would be helpful to further assess the versatility and robustness of this technique.

Overall, the LongSkywork approach represents a significant contribution to the field of large language models and their ability to handle long-form content. The authors have provided a well-designed solution that can expand the capabilities of these models, paving the way for new and exciting applications in natural language processing.

Conclusion

The LongSkywork training approach presented in this paper offers a promising solution to the long-standing challenge of extending the context length of large language models (LLMs). By gradually increasing the context length during training, the authors were able to significantly boost the capabilities of the LLAMA-3 model, allowing it to handle over 10 times more context without a substantial impact on training time or model size.

This advancement opens up new possibilities for applying LLMs to tasks that require understanding of long-form content, such as document summarization, long-context alignment, and training-free long-context extrapolation. While the approach may have some limitations, it represents an important step forward in the quest to develop LLMs that can seamlessly operate on extended contextual information, a critical capability for many real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Long-context LLMs Struggle with Long In-context Learning

Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, Wenhu Chen

Large Language Models (LLMs) have made significant strides in handling long sequences. Some models like Gemini could even to be capable of dealing with millions of tokens. However, their performance evaluation has largely been confined to metrics like perplexity and synthetic tasks, which may not fully capture their true abilities in more challenging, real-world scenarios. We introduce a benchmark (LongICLBench) for long in-context learning in extreme-label classification using six datasets with 28 to 174 classes and input lengths from 2K to 50K tokens. Our benchmark requires LLMs to comprehend the entire input to recognize the massive label spaces to make correct predictions. We evaluate on 15 long-context LLMs and find that they perform well on less challenging classification tasks with smaller label space and shorter demonstrations. However, they struggle with more challenging task like Discovery with 174 labels, suggesting a gap in their ability to process long, context-rich sequences. Further analysis reveals a bias towards labels presented later in the sequence and a need for improved reasoning over multiple pieces of information. Our study reveals that long context understanding and reasoning is still a challenging task for the existing LLMs. We believe LongICLBench could serve as a more realistic evaluation for the future long-context LLMs.

6/13/2024

cs.CL cs.AI

💬

Beyond the Limits: A Survey of Techniques to Extend the Context Length in Large Language Models

Xindi Wang, Mahsa Salmani, Parsa Omidi, Xiangyu Ren, Mehdi Rezagholizadeh, Armaghan Eshaghi

Recently, large language models (LLMs) have shown remarkable capabilities including understanding context, engaging in logical reasoning, and generating responses. However, this is achieved at the expense of stringent computational and memory requirements, hindering their ability to effectively support long input sequences. This survey provides an inclusive review of the recent techniques and methods devised to extend the sequence length in LLMs, thereby enhancing their capacity for long-context understanding. In particular, we review and categorize a wide range of techniques including architectural modifications, such as modified positional encoding and altered attention mechanisms, which are designed to enhance the processing of longer sequences while avoiding a proportional increase in computational requirements. The diverse methodologies investigated in this study can be leveraged across different phases of LLMs, i.e., training, fine-tuning and inference. This enables LLMs to efficiently process extended sequences. The limitations of the current methodologies is discussed in the last section along with the suggestions for future research directions, underscoring the importance of sequence length in the continued advancement of LLMs.

5/30/2024

cs.CL cs.LG

🤔

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li

Although large language models (LLMs) demonstrate impressive performance for many language tasks, most of them can only handle texts a few thousand tokens long, limiting their applications on longer sequence inputs, such as books, reports, and codebases. Recent works have proposed methods to improve LLMs' long context capabilities by extending context windows and more sophisticated memory mechanisms. However, comprehensive benchmarks tailored for evaluating long context understanding are lacking. In this paper, we introduce LongBench, the first bilingual, multi-task benchmark for long context understanding, enabling a more rigorous evaluation of long context understanding. LongBench comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese). These tasks cover key long-text application areas including single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic tasks, and code completion. All datasets in LongBench are standardized into a unified format, allowing for effortless automatic evaluation of LLMs. Upon comprehensive evaluation of 8 LLMs on LongBench, we find that: (1) Commercial model (GPT-3.5-Turbo-16k) outperforms other open-sourced models, but still struggles on longer contexts. (2) Scaled position embedding and fine-tuning on longer sequences lead to substantial improvement on long context understanding. (3) Context compression technique such as retrieval brings improvement for model with weak ability on long contexts, but the performance still lags behind models that have strong long context understanding capability. The code and datasets are available at https://github.com/THUDM/LongBench.

6/21/2024

cs.CL

⛏️

Extending Llama-3's Context Ten-Fold Overnight

Peitian Zhang, Ninglu Shao, Zheng Liu, Shitao Xiao, Hongjin Qian, Qiwei Ye, Zhicheng Dou

We extend the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA fine-tuning. The entire training cycle is super efficient, which takes 8 hours on one 8xA800 (80G) GPU machine. The resulted model exhibits superior performances across a broad range of evaluation tasks, such as NIHS, topic retrieval, and long-context language understanding; meanwhile, it also well preserves the original capability over short contexts. The dramatic context extension is mainly attributed to merely 3.5K synthetic training samples generated by GPT-4 , which indicates the LLMs' inherent (yet largely underestimated) potential to extend its original context length. In fact, the context length could be extended far beyond 80K with more computation resources. Therefore, the team will publicly release the entire resources (including data, model, data generation pipeline, training code) so as to facilitate the future research from the community: url{https://github.com/FlagOpen/FlagEmbedding}.

5/1/2024

cs.CL