Extending Input Contexts of Language Models through Training on Segmented Sequences

2310.14633

Published 6/21/2024 by Petros Karypis, Julian McAuley, George Karypis

💬

Abstract

Effectively training language models on long inputs poses many technical challenges. As a cost consideration, languages models are pretrained on a fixed sequence length before being adapted to longer sequences. We explore various methods for adapting models to longer inputs by training on segmented sequences and an interpolation-based method for extending absolute positional embeddings. We develop a training procedure to extend the input context size of pretrained models with no architectural changes and no additional memory costs than training on the original input lengths. By sub-sampling segments from long inputs while maintaining their original position the model is able to learn new positional interactions. Our method benefits both models trained with absolute positional embeddings, by extending their input contexts, as well as popular relative positional embedding methods showing a reduced perplexity on sequences longer than they were trained on. We demonstrate our method can extend input contexts by a factor of 4x while improving perplexity.

Create account to get full access

Overview

Explores methods for adapting language models to handle longer input sequences, a key challenge in training such models
Proposes a training procedure to extend the input context size of pre-trained models without architectural changes or increased memory costs
Demonstrates the ability to extend input contexts by 4x while improving perplexity, a metric for language model performance

Plain English Explanation

Language models, which are AI systems trained to generate human-like text, often struggle when asked to work with long input sequences. This is because they are typically pre-trained on a fixed sequence length before being adapted to handle longer inputs.

The researchers in this paper explore various techniques to help language models better handle longer inputs. Their key innovation is a training procedure that can extend the input context size of pre-trained models without requiring changes to the model architecture or using more computer memory.

The core idea is to sub-sample segments from long input sequences while preserving the original position information. This allows the model to learn new positional interactions, effectively expanding its understanding of longer contexts.

The researchers show this method can extend the input context by 4 times, while also improving the model's perplexity - a measure of how well it can predict the next word in a sequence. This is a significant advancement, as it enables language models to handle much longer inputs without major architectural changes or increased computational costs.

Technical Explanation

The paper explores various techniques for adapting language models to handle longer input sequences. As a cost-saving measure, language models are typically pre-trained on a fixed sequence length before being fine-tuned on longer inputs.

The researchers propose a training procedure that can extend the input context size of pre-trained models without requiring architectural changes or increased memory usage. The key innovation is a method for sub-sampling segments from long input sequences while maintaining the original position information.

This allows the model to learn new positional interactions, effectively expanding its understanding of longer contexts. The researchers demonstrate this approach benefits both models trained with absolute positional embeddings and those using popular relative positional embedding methods, showing reduced perplexity on sequences longer than the model was originally trained on.

The researchers also explore an interpolation-based method for extending absolute positional embeddings to longer input contexts. Overall, the paper presents a training procedure that can extend the input context size of pre-trained models by a factor of 4x while improving performance.

Critical Analysis

The paper presents a compelling approach for adapting language models to handle longer input sequences, a significant challenge in the field. The researchers' innovation of sub-sampling segments from long inputs while preserving positional information is a clever solution that allows the model to learn new positional interactions without architectural changes or increased memory usage.

One potential limitation is that the method may not generalize as well to extremely long inputs, as the sub-sampling approach may not capture all the necessary positional information. Additionally, the paper does not explore the potential impact on downstream task performance beyond perplexity, which could be an area for further research.

Overall, the paper makes a valuable contribution to the ongoing efforts to extend the context-handling capabilities of large language models. The researchers' training procedure represents an important step forward in enabling language models to effectively process longer input sequences without incurring significant computational or memory costs.

Conclusion

This paper presents a novel training procedure that can effectively extend the input context size of pre-trained language models. By sub-sampling segments from long inputs while preserving positional information, the researchers were able to expand the models' understanding of longer sequences without requiring architectural changes or increased memory usage.

The ability to extend input contexts by a factor of 4x while improving perplexity is a significant advancement in the field of language modeling. This work can have important implications for a wide range of applications that rely on language models, from document summarization to dialogue systems, by enabling these models to better handle and reason about longer input texts.

Overall, the researchers' innovative approach represents an important contribution to the ongoing efforts to push the boundaries of what language models can achieve, particularly when it comes to processing and understanding longer forms of natural language.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Beyond the Limits: A Survey of Techniques to Extend the Context Length in Large Language Models

Xindi Wang, Mahsa Salmani, Parsa Omidi, Xiangyu Ren, Mehdi Rezagholizadeh, Armaghan Eshaghi

Recently, large language models (LLMs) have shown remarkable capabilities including understanding context, engaging in logical reasoning, and generating responses. However, this is achieved at the expense of stringent computational and memory requirements, hindering their ability to effectively support long input sequences. This survey provides an inclusive review of the recent techniques and methods devised to extend the sequence length in LLMs, thereby enhancing their capacity for long-context understanding. In particular, we review and categorize a wide range of techniques including architectural modifications, such as modified positional encoding and altered attention mechanisms, which are designed to enhance the processing of longer sequences while avoiding a proportional increase in computational requirements. The diverse methodologies investigated in this study can be leveraged across different phases of LLMs, i.e., training, fine-tuning and inference. This enables LLMs to efficiently process extended sequences. The limitations of the current methodologies is discussed in the last section along with the suggestions for future research directions, underscoring the importance of sequence length in the continued advancement of LLMs.

5/30/2024

cs.CL cs.LG

LongEmbed: Extending Embedding Models for Long Context Retrieval

Dawei Zhu, Liang Wang, Nan Yang, Yifan Song, Wenhao Wu, Furu Wei, Sujian Li

Embedding models play a pivot role in modern NLP applications such as IR and RAG. While the context limit of LLMs has been pushed beyond 1 million tokens, embedding models are still confined to a narrow context window not exceeding 8k tokens, refrained from application scenarios requiring long inputs such as legal contracts. This paper explores context window extension of existing embedding models, pushing the limit to 32k without requiring additional training. First, we examine the performance of current embedding models for long context retrieval on our newly constructed LongEmbed benchmark. LongEmbed comprises two synthetic tasks and four carefully chosen real-world tasks, featuring documents of varying length and dispersed target information. Benchmarking results underscore huge room for improvement in these models. Based on this, comprehensive experiments show that training-free context window extension strategies like position interpolation can effectively extend the context window of existing embedding models by several folds, regardless of their original context being 512 or beyond 4k. Furthermore, for models employing absolute position encoding (APE), we show the possibility of further fine-tuning to harvest notable performance gains while strictly preserving original behavior for short inputs. For models using rotary position embedding (RoPE), significant enhancements are observed when employing RoPE-specific methods, such as NTK and SelfExtend, indicating RoPE's superiority over APE for context window extension. To facilitate future research, we release E5-Base-4k and E5-RoPE-Base, along with the LongEmbed benchmark.

4/26/2024

cs.CL cs.LG

🔍

InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory

Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, Maosong Sun

Large language models (LLMs) have emerged as a cornerstone in real-world applications with lengthy streaming inputs (e.g., LLM-driven agents). However, existing LLMs, pre-trained on sequences with a restricted maximum length, cannot process longer sequences due to the out-of-domain and distraction issues. Common solutions often involve continual pre-training on longer sequences, which will introduce expensive computational overhead and uncontrollable change in model capabilities. In this paper, we unveil the intrinsic capacity of LLMs for understanding extremely long sequences without any fine-tuning. To this end, we introduce a training-free memory-based method, InfLLM. Specifically, InfLLM stores distant contexts into additional memory units and employs an efficient mechanism to lookup token-relevant units for attention computation. Thereby, InfLLM allows LLMs to efficiently process long sequences with a limited context window and well capture long-distance dependencies. Without any training, InfLLM enables LLMs that are pre-trained on sequences consisting of a few thousand tokens to achieve comparable performance with competitive baselines that continually train these LLMs on long sequences. Even when the sequence length is scaled to $1,024$K, InfLLM still effectively captures long-distance dependencies. Our code can be found in url{https://github.com/thunlp/InfLLM}.

5/29/2024

cs.CL cs.AI cs.LG

Long-context LLMs Struggle with Long In-context Learning

Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, Wenhu Chen

Large Language Models (LLMs) have made significant strides in handling long sequences. Some models like Gemini could even to be capable of dealing with millions of tokens. However, their performance evaluation has largely been confined to metrics like perplexity and synthetic tasks, which may not fully capture their true abilities in more challenging, real-world scenarios. We introduce a benchmark (LongICLBench) for long in-context learning in extreme-label classification using six datasets with 28 to 174 classes and input lengths from 2K to 50K tokens. Our benchmark requires LLMs to comprehend the entire input to recognize the massive label spaces to make correct predictions. We evaluate on 15 long-context LLMs and find that they perform well on less challenging classification tasks with smaller label space and shorter demonstrations. However, they struggle with more challenging task like Discovery with 174 labels, suggesting a gap in their ability to process long, context-rich sequences. Further analysis reveals a bias towards labels presented later in the sequence and a need for improved reasoning over multiple pieces of information. Our study reveals that long context understanding and reasoning is still a challenging task for the existing LLMs. We believe LongICLBench could serve as a more realistic evaluation for the future long-context LLMs.

6/13/2024

cs.CL cs.AI