Extending Llama-3's Context Ten-Fold Overnight

2404.19553

173

Published 5/1/2024 by Peitian Zhang, Ninglu Shao, Zheng Liu, Shitao Xiao, Hongjin Qian, Qiwei Ye, Zhicheng Dou

⛏️

Abstract

We extend the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA fine-tuning. The entire training cycle is super efficient, which takes 8 hours on one 8xA800 (80G) GPU machine. The resulted model exhibits superior performances across a broad range of evaluation tasks, such as NIHS, topic retrieval, and long-context language understanding; meanwhile, it also well preserves the original capability over short contexts. The dramatic context extension is mainly attributed to merely 3.5K synthetic training samples generated by GPT-4 , which indicates the LLMs' inherent (yet largely underestimated) potential to extend its original context length. In fact, the context length could be extended far beyond 80K with more computation resources. Therefore, the team will publicly release the entire resources (including data, model, data generation pipeline, training code) so as to facilitate the future research from the community: url{https://github.com/FlagOpen/FlagEmbedding}.

Create account to get full access

Overview

Extends the context length of Llama-3-8B-Instruct model from 8K to 80K via QLoRA fine-tuning
Training takes only 8 hours on a single 8xA800 (80G) GPU machine
Resulted model exhibits superior performance on a range of evaluation tasks, including long-context language understanding
Preserves original capability over short contexts
Dramatic context extension achieved with just 3.5K synthetic training samples generated by GPT-4
Highlights the potential for large language models (LLMs) to extend their original context length with more computational resources

Plain English Explanation

The researchers extended the context length of a large language model called Llama-3-8B-Instruct from 8,000 tokens to 80,000 tokens. This means the model can now process and understand much longer pieces of text.

They did this by fine-tuning the model using a technique called Quantized Low-Rank Adaptation (QLoRA), which is an efficient way to update the model's parameters. The entire training process only took 8 hours on a single powerful GPU.

The resulting model performed very well on a variety of tasks that require understanding long passages of text, such as answering questions about a topic or summarizing the key points. Importantly, it also maintained its original ability to process short pieces of text effectively.

The researchers found that they could achieve this dramatic increase in context length by using just 3,500 synthetic training samples generated by an even more powerful language model, GPT-4. This suggests that large language models have a lot of untapped potential to handle longer contexts, and that with more computing power, their context length could be extended even further.

Technical Explanation

The researchers extended the context length of the Llama-3-8B-Instruct model from 8,000 tokens to 80,000 tokens using Quantized Low-Rank Adaptation (QLoRA) fine-tuning. This efficient training process took only 8 hours on a single 8xA800 (80G) GPU machine.

The resulting model demonstrated superior performance across a range of evaluation tasks, including natural language inference, topic retrieval, and long-context language understanding. Importantly, the model also well preserved its original capability over short contexts.

The researchers attribute the dramatic context extension to the use of just 3,500 synthetic training samples generated by the powerful GPT-4 model. This indicates that large language models have significant untapped potential to extend their original context length with additional computational resources.

To facilitate future research, the team plans to publicly release the entire set of resources, including the data, model, data generation pipeline, and training code, through a GitHub repository.

Critical Analysis

The researchers provide a compelling demonstration of the potential for large language models to handle significantly longer contexts than their original capabilities. By leveraging efficient fine-tuning techniques and a relatively small amount of synthetic data, they were able to extend the context length of the Llama-3-8B-Instruct model by an order of magnitude.

However, the paper does not explore the limits of this context extension or the potential challenges that may arise as context lengths continue to grow. It would be valuable to understand the computational and memory requirements, as well as any potential trade-offs in model performance, as the context length is scaled even further.

Additionally, the researchers' claim that LLMs have "largely underestimated" potential to extend their context length could benefit from a more nuanced discussion. While the results are impressive, it is important to consider the potential challenges and limitations that may arise as models are pushed to their boundaries.

Overall, this research represents an important step in advancing the capabilities of large language models and highlights the need for continued exploration and critical analysis in this rapidly evolving field.

Conclusion

The researchers have demonstrated a highly efficient method for extending the context length of the Llama-3-8B-Instruct model from 8,000 tokens to 80,000 tokens. This was achieved through QLoRA fine-tuning, which allowed the training process to be completed in just 8 hours on a single powerful GPU.

The resulting model exhibited superior performance on a range of evaluation tasks that require understanding long passages of text, while also preserving its original capability over short contexts. Importantly, the researchers were able to accomplish this dramatic context extension using a relatively small amount of synthetic training data, highlighting the inherent potential of large language models to handle longer contexts with additional computational resources.

By publicly releasing the entire set of resources, including the data, model, and training code, the researchers are poised to facilitate further research and advancements in the field of long-context language understanding.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models

Liang Zhao, Tianwen Wei, Liang Zeng, Cheng Cheng, Liu Yang, Peng Cheng, Lijie Wang, Chenxia Li, Xuejie Wu, Bo Zhu, Yimeng Gan, Rui Hu, Shuicheng Yan, Han Fang, Yahui Zhou

We introduce LongSkywork, a long-context Large Language Model (LLM) capable of processing up to 200,000 tokens. We provide a training recipe for efficiently extending context length of LLMs. We identify that the critical element in enhancing long-context processing capability is to incorporate a long-context SFT stage following the standard SFT stage. A mere 200 iterations can convert the standard SFT model into a long-context model. To reduce the effort in collecting and annotating data for long-context language modeling, we develop two novel methods for creating synthetic data. These methods are applied during the continual pretraining phase as well as the Supervised Fine-Tuning (SFT) phase, greatly enhancing the training efficiency of our long-context LLMs. Our findings suggest that synthetic long-context SFT data can surpass the performance of data curated by humans to some extent. LongSkywork achieves outstanding performance on a variety of long-context benchmarks. In the Needle test, a benchmark for long-context information retrieval, our models achieved perfect accuracy across multiple context spans. Moreover, in realistic application scenarios, LongSkywork-13B demonstrates performance on par with Claude2.1, the leading long-context model, underscoring the effectiveness of our proposed methods.

6/4/2024

cs.CL cs.AI

Training-Free Long-Context Scaling of Large Language Models

Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, Lingpeng Kong

The ability of Large Language Models (LLMs) to process and generate coherent text is markedly weakened when the number of input tokens exceeds their pretraining length. Given the expensive overhead of finetuning large-scale models with longer sequences, we propose Dual Chunk Attention (DCA), which enables Llama2 70B to support context windows of more than 100k tokens without continual training. By decomposing the attention computation for long sequences into chunk-based modules, DCA manages to effectively capture the relative positional information of tokens within the same chunk (Intra-Chunk) and across distinct chunks (Inter-Chunk), as well as integrates seamlessly with Flash Attention. In addition to its impressive extrapolation capability, DCA achieves performance on practical long-context tasks that is comparable to or even better than that of finetuned models. When compared with proprietary models, our training-free 70B model attains 94% of the performance of gpt-3.5-16k, indicating it is a viable open-source alternative. All code and data used in this work are released at url{https://github.com/HKUNLP/ChunkLlama}.

5/30/2024

cs.CL

LongEmbed: Extending Embedding Models for Long Context Retrieval

Dawei Zhu, Liang Wang, Nan Yang, Yifan Song, Wenhao Wu, Furu Wei, Sujian Li

Embedding models play a pivot role in modern NLP applications such as IR and RAG. While the context limit of LLMs has been pushed beyond 1 million tokens, embedding models are still confined to a narrow context window not exceeding 8k tokens, refrained from application scenarios requiring long inputs such as legal contracts. This paper explores context window extension of existing embedding models, pushing the limit to 32k without requiring additional training. First, we examine the performance of current embedding models for long context retrieval on our newly constructed LongEmbed benchmark. LongEmbed comprises two synthetic tasks and four carefully chosen real-world tasks, featuring documents of varying length and dispersed target information. Benchmarking results underscore huge room for improvement in these models. Based on this, comprehensive experiments show that training-free context window extension strategies like position interpolation can effectively extend the context window of existing embedding models by several folds, regardless of their original context being 512 or beyond 4k. Furthermore, for models employing absolute position encoding (APE), we show the possibility of further fine-tuning to harvest notable performance gains while strictly preserving original behavior for short inputs. For models using rotary position embedding (RoPE), significant enhancements are observed when employing RoPE-specific methods, such as NTK and SelfExtend, indicating RoPE's superiority over APE for context window extension. To facilitate future research, we release E5-Base-4k and E5-RoPE-Base, along with the LongEmbed benchmark.

4/26/2024

cs.CL cs.LG

XL3M: A Training-free Framework for LLM Length Extension Based on Segment-wise Inference

Shengnan Wang, Youhui Bai, Lin Zhang, Pingyi Zhou, Shixiong Zhao, Gong Zhang, Sen Wang, Renhai Chen, Hua Xu, Hongwei Sun

Length generalization failure problem, namely the large language model (LLM) fails to generalize to texts longer than its maximum training length, greatly restricts the application of LLM in the scenarios with streaming long inputs. To address this problem, the existing methods either require substantial costs or introduce precision loss. In this paper, we empirically find that the accuracy of the LLM's prediction is highly correlated to its certainty. Based on this, we propose an efficient training free framework, named XL3M (it means extra-long large language model), which enables the LLMs trained on short sequences to reason extremely long sequence without any further training or fine-tuning. Under the XL3M framework, the input context will be firstly decomposed into multiple short sub-contexts, where each sub-context contains an independent segment and a common ``question'' which is a few tokens from the end of the original context. Then XL3M gives a method to measure the relevance between each segment and the ``question'', and constructs a concise key context by splicing all the relevant segments in chronological order. The key context is further used instead of the original context to complete the inference task. Evaluations on comprehensive benchmarks show the superiority of XL3M. Using our framework, a Llama2-7B model is able to reason 20M long sequences on an 8-card Huawei Ascend 910B NPU machine with 64GB memory per card.

5/29/2024

cs.CL cs.AI