LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

2308.14508

Published 6/21/2024 by Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou and 3 others

cs.CL

🤔

Abstract

Although large language models (LLMs) demonstrate impressive performance for many language tasks, most of them can only handle texts a few thousand tokens long, limiting their applications on longer sequence inputs, such as books, reports, and codebases. Recent works have proposed methods to improve LLMs' long context capabilities by extending context windows and more sophisticated memory mechanisms. However, comprehensive benchmarks tailored for evaluating long context understanding are lacking. In this paper, we introduce LongBench, the first bilingual, multi-task benchmark for long context understanding, enabling a more rigorous evaluation of long context understanding. LongBench comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese). These tasks cover key long-text application areas including single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic tasks, and code completion. All datasets in LongBench are standardized into a unified format, allowing for effortless automatic evaluation of LLMs. Upon comprehensive evaluation of 8 LLMs on LongBench, we find that: (1) Commercial model (GPT-3.5-Turbo-16k) outperforms other open-sourced models, but still struggles on longer contexts. (2) Scaled position embedding and fine-tuning on longer sequences lead to substantial improvement on long context understanding. (3) Context compression technique such as retrieval brings improvement for model with weak ability on long contexts, but the performance still lags behind models that have strong long context understanding capability. The code and datasets are available at https://github.com/THUDM/LongBench.

Create account to get full access

Overview

This paper introduces LongBench, a new benchmark for evaluating large language models' (LLMs) ability to understand long-form texts.
LongBench includes 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese).
The benchmark covers key long-text application areas like single-document question answering, multi-document question answering, summarization, few-shot learning, synthetic tasks, and code completion.
The authors evaluate 8 LLMs on LongBench and find that commercial models outperform open-source ones, but struggle with longer contexts, and that techniques like scaled position embedding and fine-tuning on longer sequences can substantially improve long context understanding.

Plain English Explanation

Large language models (LLMs) like GPT-3 have become incredibly powerful at understanding and generating human language. However, most of these models are limited to handling texts that are only a few thousand words long. This can be a problem for applications that involve longer documents, such as books, reports, or even computer code.

To address this, the researchers created LongBench, a new benchmark specifically designed to evaluate how well LLMs can understand long-form texts. LongBench includes a variety of tasks, like answering questions about a single document, answering questions that require information from multiple documents, summarizing long texts, and even completing code snippets.

The researchers tested 8 different LLMs on LongBench, including both commercial and open-source models. They found that the commercial model (GPT-3.5-Turbo-16k) performed the best overall, but even it struggled with the longer texts. However, the researchers also discovered that techniques like scaling the position embeddings and fine-tuning the models on longer sequences could significantly improve their long context understanding.

The key takeaway is that while LLMs are incredibly capable, they still have trouble with really long texts. But the research presented in this paper suggests that there are ways to make them better at handling these kinds of long-form inputs, which could open up new applications for these powerful language models.

Technical Explanation

The paper introduces LongBench, a new benchmark for evaluating large language models' (LLMs) ability to understand long-form texts. LongBench consists of 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese). These tasks cover key long-text application areas including single-document question answering, multi-document question answering, summarization, few-shot learning, synthetic tasks, and code completion.

The authors evaluate 8 LLMs on LongBench, including GPT-3, GPT-3.5-Turbo-16k, and several open-source models. They find that the commercial model (GPT-3.5-Turbo-16k) outperforms the other models, but still struggles on longer contexts. The authors also investigate techniques to improve long context understanding, including scaled position embedding and fine-tuning on longer sequences, which lead to substantial improvements.

Additionally, the authors explore the use of context compression techniques like retrieval to aid models with weaker long context capabilities. However, they find that the performance of these models still lags behind models that have strong long context understanding.

Overall, the paper provides a comprehensive benchmark for evaluating long context understanding in LLMs and offers insights into the current limitations and potential solutions for improving long-form text processing.

Critical Analysis

The LongBench benchmark presented in this paper is a valuable contribution to the field of long-form text processing with large language models. By providing a standardized set of tasks and datasets, the benchmark enables a more rigorous and comprehensive evaluation of LLMs' long context capabilities.

One potential limitation of the benchmark is the specific task categories and datasets included. While the authors have made a concerted effort to cover a wide range of long-text application areas, there may be other relevant tasks or datasets that could be added to further diversify the benchmark and provide a more complete picture of long context understanding.

Additionally, the paper does not delve deeply into the specific architectural choices or training techniques that may be most effective for improving long context understanding. The authors mention some promising approaches, such as scaled position embedding and fine-tuning on longer sequences, but a more detailed exploration of these and other potential solutions could provide more actionable insights for researchers and practitioners.

Furthermore, the authors note that even the best-performing model, GPT-3.5-Turbo-16k, still struggles with the longer contexts in LongBench. This suggests that there is still significant room for improvement in developing LLMs that can truly excel at long-form text processing. Future research could explore more advanced memory mechanisms, retrieval-augmented architectures, or other novel approaches to address this challenge.

Overall, the LongBench benchmark and the insights presented in this paper represent an important step forward in understanding and improving large language models' capabilities for long-form text understanding. By continuing to push the boundaries of what these models can achieve, researchers can unlock new possibilities for a wide range of long-text applications.

Conclusion

This paper introduces LongBench, a new benchmark for evaluating large language models' (LLMs) ability to understand long-form texts. LongBench includes a diverse set of tasks and datasets in both English and Chinese, with an average length of over 6,000 words.

The authors' evaluation of 8 LLMs on LongBench reveals that commercial models like GPT-3.5-Turbo-16k outperform open-source models, but still struggle with longer contexts. However, techniques like scaled position embedding and fine-tuning on longer sequences can substantially improve long context understanding.

The LongBench benchmark and the insights from this research represent an important step forward in developing LLMs that can effectively process and comprehend long-form texts. By pushing the boundaries of what these models can achieve, researchers can unlock new possibilities for a wide range of applications, from summarizing lengthy reports to answering complex questions across multiple documents.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

XL$^2$Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies

Xuanfan Ni, Hengyi Cai, Xiaochi Wei, Shuaiqiang Wang, Dawei Yin, Piji Li

Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks but are constrained by their small context window sizes. Various efforts have been proposed to expand the context window to accommodate even up to 200K input tokens. Meanwhile, building high-quality benchmarks with much longer text lengths and more demanding tasks to provide comprehensive evaluations is of immense practical interest to facilitate long context understanding research of LLMs. However, prior benchmarks create datasets that ostensibly cater to long-text comprehension by expanding the input of traditional tasks, which falls short to exhibit the unique characteristics of long-text understanding, including long dependency tasks and longer text length compatible with modern LLMs' context window size. In this paper, we introduce a benchmark for extremely long context understanding with long-range dependencies, XL$^2$Bench, which includes three scenarios: Fiction Reading, Paper Reading, and Law Reading, and four tasks of increasing complexity: Memory Retrieval, Detailed Understanding, Overall Understanding, and Open-ended Generation, covering 27 subtasks in English and Chinese. It has an average length of 100K+ words (English) and 200K+ characters (Chinese). Evaluating six leading LLMs on XL$^2$Bench, we find that their performance significantly lags behind human levels. Moreover, the observed decline in performance across both the original and enhanced datasets underscores the efficacy of our approach to mitigating data contamination.

4/9/2024

cs.CL

Long-context LLMs Struggle with Long In-context Learning

Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, Wenhu Chen

Large Language Models (LLMs) have made significant strides in handling long sequences. Some models like Gemini could even to be capable of dealing with millions of tokens. However, their performance evaluation has largely been confined to metrics like perplexity and synthetic tasks, which may not fully capture their true abilities in more challenging, real-world scenarios. We introduce a benchmark (LongICLBench) for long in-context learning in extreme-label classification using six datasets with 28 to 174 classes and input lengths from 2K to 50K tokens. Our benchmark requires LLMs to comprehend the entire input to recognize the massive label spaces to make correct predictions. We evaluate on 15 long-context LLMs and find that they perform well on less challenging classification tasks with smaller label space and shorter demonstrations. However, they struggle with more challenging task like Discovery with 174 labels, suggesting a gap in their ability to process long, context-rich sequences. Further analysis reveals a bias towards labels presented later in the sequence and a need for improved reasoning over multiple pieces of information. Our study reveals that long context understanding and reasoning is still a challenging task for the existing LLMs. We believe LongICLBench could serve as a more realistic evaluation for the future long-context LLMs.

6/13/2024

cs.CL cs.AI

MileBench: Benchmarking MLLMs in Long Context

Dingjie Song, Shunian Chen, Guiming Hardy Chen, Fei Yu, Xiang Wan, Benyou Wang

Despite the advancements and impressive performance of Multimodal Large Language Models (MLLMs) on benchmarks, their effectiveness in real-world, long-context, and multi-image tasks is unclear due to the benchmarks' limited scope. Existing benchmarks often focus on single-image and short-text samples, and when assessing multi-image tasks, they either limit the image count or focus on specific task (e.g time-series captioning), potentially obscuring the performance challenges of MLLMs. To address these limitations, we introduce MileBench, a pioneering benchmark designed to test the MultImodal Long-contExt capabilities of MLLMs. This benchmark comprises not only multimodal long contexts, but also multiple tasks requiring both comprehension and generation. We establish two distinct evaluation sets, diagnostic and realistic, to systematically assess MLLMs' long-context adaptation capacity and their ability to complete tasks in long-context scenarios. Our experimental results, obtained from testing 22 models, revealed that while the closed-source GPT-4o outperforms others, most open-source MLLMs struggle in long-context situations. Interestingly, the performance gap tends to widen with an increase in the number of images. We strongly encourage an intensification of research efforts towards enhancing MLLMs' long-context capabilities, especially in scenarios involving multiple images.

5/16/2024

cs.CL cs.AI cs.CV cs.LG

New!MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, Pan Zhang, Liangming Pan, Yu-Gang Jiang, Jiaqi Wang, Yixin Cao, Aixin Sun

Understanding documents with rich layouts and multi-modal components is a long-standing and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable strides in various tasks, particularly in single-page document understanding (DU). However, their abilities on long-context DU remain an open problem. This work presents MMLongBench-Doc, a long-context, multi-modal benchmark comprising 1,062 expert-annotated questions. Distinct from previous datasets, it is constructed upon 130 lengthy PDF-formatted documents with an average of 49.4 pages and 20,971 textual tokens. Towards comprehensive evaluation, answers to these questions rely on pieces of evidence from (1) different sources (text, image, chart, table, and layout structure) and (2) various locations (i.e. page number). Moreover, 33.2% of the questions are cross-page questions requiring evidence across multiple pages. 22.8% of the questions are designed to be unanswerable for detecting potential hallucinations. Experiments on 14 LVLMs demonstrate that long-context DU greatly challenges current models. Notably, the best-performing model, GPT-4o, achieves an F1 score of only 42.7%, while the second-best, GPT-4V, scores 31.4%. Furthermore, 12 LVLMs (all except GPT-4o and GPT-4V) even present worse performance than their LLM counterparts which are fed with lossy-parsed OCR documents. These results validate the necessity of future research toward more capable long-context LVLMs. Project Page: https://mayubo2333.github.io/MMLongBench-Doc

7/2/2024

cs.CV cs.CL