LongIns: A Challenging Long-context Instruction-based Exam for LLMs

2406.17588

Published 6/27/2024 by Shawn Gavin, Tuney Zheng, Jiaheng Liu, Quehry Que, Noah Wang, Jian Yang, Chenchen Zhang, Wenhao Huang, Wenhu Chen, Ge Zhang

cs.CL

LongIns: A Challenging Long-context Instruction-based Exam for LLMs

Abstract

The long-context capabilities of large language models (LLMs) have been a hot topic in recent years. To evaluate the performance of LLMs in different scenarios, various assessment benchmarks have emerged. However, as most of these benchmarks focus on identifying key information to answer questions, which mainly requires the retrieval ability of LLMs, these benchmarks can partially represent the reasoning performance of LLMs from large amounts of information. Meanwhile, although LLMs often claim to have context windows of 32k, 128k, 200k, or even longer, these benchmarks fail to reveal the actual supported length of these LLMs. To address these issues, we propose the LongIns benchmark dataset, a challenging long-context instruction-based exam for LLMs, which is built based on the existing instruction datasets. Specifically, in our LongIns, we introduce three evaluation settings: Global Instruction & Single Task (GIST), Local Instruction & Single Task (LIST), and Local Instruction & Multiple Tasks (LIMT). Based on LongIns, we perform comprehensive evaluations on existing LLMs and have the following important findings: (1). The top-performing GPT-4 with 128k context length performs poorly on the evaluation context window of 16k in our LongIns. (2). For the multi-hop reasoning ability of many existing LLMs, significant efforts are still needed under short context windows (less than 4k).

Create account to get full access

Overview

This paper introduces LongIns, a challenging long-context instruction-based exam for evaluating the performance of large language models (LLMs) on complex, multi-step tasks that require understanding and reasoning over long passages of text.
The authors argue that existing benchmarks for LLMs may not adequately capture their ability to handle long-context understanding and reasoning, which is crucial for many real-world applications.
LongIns is designed to push the boundaries of LLM capabilities by presenting them with open-ended, multi-part instructions that span several paragraphs of context, requiring models to maintain coherence and consistency across long stretches of text.

Plain English Explanation

The researchers behind this paper have created a new test called LongIns to evaluate how well large language models (LLMs) can handle complex, multi-step tasks that involve understanding and reasoning over long passages of text. Many existing benchmarks for LLMs may not fully capture their ability to work with long contexts, which is an important skill for real-world applications like question answering or document summarization.

LongIns presents LLMs with open-ended instructions that span multiple paragraphs, requiring the models to maintain coherence and consistency as they work through the different parts of the task. This is meant to be more challenging than the shorter, more self-contained prompts typically used to test LLMs. The goal is to push the boundaries of what these models are capable of and identify areas where they may struggle with long-context understanding and reasoning.

Technical Explanation

The LongIns benchmark consists of a set of open-ended, multi-part instructions that require large language models (LLMs) to understand and reason over long passages of text. Each LongIns instance provides several paragraphs of context, followed by a series of questions or tasks that the model must complete in a coherent and consistent manner.

The authors argue that existing benchmarks for LLMs, such as LongBench, MileBench, and XLD2DBench, may not adequately capture the models' ability to handle long-context understanding and reasoning, which is crucial for many real-world applications. In contrast, LongIns is designed to push the boundaries of LLM capabilities by presenting them with more open-ended, multi-step tasks that span several paragraphs of context.

The authors evaluate several state-of-the-art LLMs on the LongIns benchmark and find that even the best-performing models struggle to maintain coherence and consistency across the long passages of text. This suggests that there is still significant room for improvement in developing LLMs that can effectively handle long-context understanding and reasoning.

Critical Analysis

While the LongIns benchmark represents an important step forward in testing the capabilities of large language models (LLMs), the authors acknowledge that it has some limitations. For example, the benchmark is currently limited to English-language tasks, and it may not fully capture the models' ability to handle long-context tasks in other languages or domains.

Additionally, the authors note that the LongIns tasks are primarily focused on open-ended, multi-step instructions, which may not be representative of all the types of long-context challenges that LLMs may face in real-world applications. BabiLong, for instance, focuses more on long-context reasoning and inference, which could be a valuable complement to the LongIns benchmark.

Further research is needed to explore the extent to which LLM performance on LongIns is predictive of their real-world performance on long-context tasks, and to identify the specific architectural or training approaches that are most effective for improving long-context understanding and reasoning in these models.

Conclusion

The LongIns benchmark represents an important step forward in the evaluation of large language models (LLMs), focusing specifically on their ability to handle long-context understanding and reasoning. By presenting LLMs with open-ended, multi-part instructions that span several paragraphs of text, the authors aim to push the boundaries of what these models are capable of and identify areas where they may struggle.

The results of the authors' evaluation suggest that even state-of-the-art LLMs have significant room for improvement when it comes to maintaining coherence and consistency across long passages of text. This underscores the importance of continued research and development in this area, as long-context understanding and reasoning are crucial for many real-world applications of language models.

Overall, the LongIns benchmark provides a valuable new tool for the research community to better understand the limitations of current LLMs and to drive progress towards more capable and robust models that can effectively handle the challenges of long-context tasks.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Long-context LLMs Struggle with Long In-context Learning

Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, Wenhu Chen

Large Language Models (LLMs) have made significant strides in handling long sequences. Some models like Gemini could even to be capable of dealing with millions of tokens. However, their performance evaluation has largely been confined to metrics like perplexity and synthetic tasks, which may not fully capture their true abilities in more challenging, real-world scenarios. We introduce a benchmark (LongICLBench) for long in-context learning in extreme-label classification using six datasets with 28 to 174 classes and input lengths from 2K to 50K tokens. Our benchmark requires LLMs to comprehend the entire input to recognize the massive label spaces to make correct predictions. We evaluate on 15 long-context LLMs and find that they perform well on less challenging classification tasks with smaller label space and shorter demonstrations. However, they struggle with more challenging task like Discovery with 174 labels, suggesting a gap in their ability to process long, context-rich sequences. Further analysis reveals a bias towards labels presented later in the sequence and a need for improved reasoning over multiple pieces of information. Our study reveals that long context understanding and reasoning is still a challenging task for the existing LLMs. We believe LongICLBench could serve as a more realistic evaluation for the future long-context LLMs.

6/13/2024

cs.CL cs.AI

🤔

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li

Although large language models (LLMs) demonstrate impressive performance for many language tasks, most of them can only handle texts a few thousand tokens long, limiting their applications on longer sequence inputs, such as books, reports, and codebases. Recent works have proposed methods to improve LLMs' long context capabilities by extending context windows and more sophisticated memory mechanisms. However, comprehensive benchmarks tailored for evaluating long context understanding are lacking. In this paper, we introduce LongBench, the first bilingual, multi-task benchmark for long context understanding, enabling a more rigorous evaluation of long context understanding. LongBench comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese). These tasks cover key long-text application areas including single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic tasks, and code completion. All datasets in LongBench are standardized into a unified format, allowing for effortless automatic evaluation of LLMs. Upon comprehensive evaluation of 8 LLMs on LongBench, we find that: (1) Commercial model (GPT-3.5-Turbo-16k) outperforms other open-sourced models, but still struggles on longer contexts. (2) Scaled position embedding and fine-tuning on longer sequences lead to substantial improvement on long context understanding. (3) Context compression technique such as retrieval brings improvement for model with weak ability on long contexts, but the performance still lags behind models that have strong long context understanding capability. The code and datasets are available at https://github.com/THUDM/LongBench.

6/21/2024

cs.CL

MileBench: Benchmarking MLLMs in Long Context

Dingjie Song, Shunian Chen, Guiming Hardy Chen, Fei Yu, Xiang Wan, Benyou Wang

Despite the advancements and impressive performance of Multimodal Large Language Models (MLLMs) on benchmarks, their effectiveness in real-world, long-context, and multi-image tasks is unclear due to the benchmarks' limited scope. Existing benchmarks often focus on single-image and short-text samples, and when assessing multi-image tasks, they either limit the image count or focus on specific task (e.g time-series captioning), potentially obscuring the performance challenges of MLLMs. To address these limitations, we introduce MileBench, a pioneering benchmark designed to test the MultImodal Long-contExt capabilities of MLLMs. This benchmark comprises not only multimodal long contexts, but also multiple tasks requiring both comprehension and generation. We establish two distinct evaluation sets, diagnostic and realistic, to systematically assess MLLMs' long-context adaptation capacity and their ability to complete tasks in long-context scenarios. Our experimental results, obtained from testing 22 models, revealed that while the closed-source GPT-4o outperforms others, most open-source MLLMs struggle in long-context situations. Interestingly, the performance gap tends to widen with an increase in the number of images. We strongly encourage an intensification of research efforts towards enhancing MLLMs' long-context capabilities, especially in scenarios involving multiple images.

5/16/2024

cs.CL cs.AI cs.CV cs.LG

XL$^2$Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies

Xuanfan Ni, Hengyi Cai, Xiaochi Wei, Shuaiqiang Wang, Dawei Yin, Piji Li

Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks but are constrained by their small context window sizes. Various efforts have been proposed to expand the context window to accommodate even up to 200K input tokens. Meanwhile, building high-quality benchmarks with much longer text lengths and more demanding tasks to provide comprehensive evaluations is of immense practical interest to facilitate long context understanding research of LLMs. However, prior benchmarks create datasets that ostensibly cater to long-text comprehension by expanding the input of traditional tasks, which falls short to exhibit the unique characteristics of long-text understanding, including long dependency tasks and longer text length compatible with modern LLMs' context window size. In this paper, we introduce a benchmark for extremely long context understanding with long-range dependencies, XL$^2$Bench, which includes three scenarios: Fiction Reading, Paper Reading, and Law Reading, and four tasks of increasing complexity: Memory Retrieval, Detailed Understanding, Overall Understanding, and Open-ended Generation, covering 27 subtasks in English and Chinese. It has an average length of 100K+ words (English) and 200K+ characters (Chinese). Evaluating six leading LLMs on XL$^2$Bench, we find that their performance significantly lags behind human levels. Moreover, the observed decline in performance across both the original and enhanced datasets underscores the efficacy of our approach to mitigating data contamination.

4/9/2024

cs.CL