NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens

2403.12766

Published 6/18/2024 by Cunxiang Wang, Ruoxi Ning, Boqi Pan, Tonghui Wu, Qipeng Guo, Cheng Deng, Guangsheng Bao, Xiangkun Hu, Zheng Zhang, Qian Wang and 1 other

cs.CL

NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens

Abstract

The rapid advancement of Large Language Models (LLMs) has introduced a new frontier in natural language processing, particularly in understanding and processing long-context information. However, the evaluation of these models' long-context abilities remains a challenge due to the limitations of current benchmarks. To address this gap, we introduce NovelQA, a benchmark specifically designed to test the capabilities of LLMs with extended texts. Constructed from English novels, NovelQA offers a unique blend of complexity, length, and narrative coherence, making it an ideal tool for assessing deep textual understanding in LLMs. This paper presents the design and construction of NovelQA, highlighting its manual annotation, and diverse question types. Our evaluation of Long-context LLMs on NovelQA reveals significant insights into the models' performance, particularly emphasizing the challenges they face with multi-hop reasoning, detail-oriented questions, and extremely long input with an average length more than 200,000 tokens. The results underscore the necessity for further advancements in LLMs to improve their long-context comprehension.

Create account to get full access

Overview

This paper introduces a new benchmark called NovelQA for evaluating long-range question answering capabilities of language models on full-length novels.
NovelQA consists of over 6,000 questions that require reading and comprehending entire novels to answer, going beyond the typical short-context question answering tasks.
The authors create this benchmark to better understand the limitations of current language models in handling long-range reasoning and language understanding required for novel-length texts.

Plain English Explanation

The NovelQA: A Benchmark for Long-Range Novel Question Answering paper introduces a new test for language models called NovelQA. This benchmark is designed to evaluate how well these models can answer questions that require understanding the full context of a novel, rather than just short passages.

Typical question answering benchmarks only use short text snippets, but the researchers behind NovelQA wanted to create a more challenging test that mirrors real-world reading and reasoning. The NovelQA dataset contains over 6,000 questions about full-length novels, covering a variety of topics and requiring the model to integrate information across the entire book.

By creating this novel-based benchmark, the researchers aim to better understand the limitations of current language models when it comes to long-range comprehension and reasoning. As explored in related work, modern language models can struggle with tasks that require understanding very long contexts, like entire books. NovelQA provides a new way to measure and improve these important capabilities.

Technical Explanation

The NovelQA: A Benchmark for Long-Range Novel Question Answering paper introduces a new benchmark for evaluating question answering abilities of language models on full-length novels.

The dataset consists of over 6,000 questions about 14 different novels, created using an automatic question generation pipeline. The questions cover a wide range of topics and require the model to integrate information across the entire novel to answer correctly.

To create the dataset, the authors first collected a corpus of 14 full-length novels from the Project Gutenberg library. They then used an existing question generation model to automatically produce questions about these novels, filtering the output to ensure high-quality and diverse questions.

The resulting NovelQA benchmark is designed to go beyond typical short-context question answering tasks and evaluate a language model's ability to perform long-range reasoning and comprehension on novel-length texts. This is an important frontier, as recent work has shown that current large language models can struggle with tasks that require understanding very long contexts.

The authors evaluate several strong baseline models on the NovelQA dataset, including T5 and GPT-3, and find that performance lags far behind human-level understanding. This suggests significant room for improvement in developing language models that can truly grasp long-form narratives.

Critical Analysis

The NovelQA benchmark represents an important step forward in evaluating language model capabilities on long-range comprehension and reasoning tasks. By focusing on full-length novels rather than short passages, it provides a more realistic and demanding test of these models' abilities.

That said, the authors acknowledge several limitations of the current benchmark. For one, the automatically generated questions may not capture the full breadth and complexity of human-authored questions about novels. There is also the question of how well the benchmark generalizes beyond the specific 14 novels included.

Additionally, as noted in related work, evaluating language models on long-form tasks like this raises challenges around measuring and interpreting performance. Metrics like exact match may not fully capture a model's underlying language understanding.

Further research is needed to better understand the strengths, weaknesses, and failure modes of language models on long-range comprehension tasks. Incorporating human evaluation, probing for specific reasoning capabilities, and expanding the benchmark to cover more diverse novels could all yield valuable insights.

Overall, the NovelQA benchmark represents an important contribution, but there remains significant room for improving the long-range language understanding abilities of AI systems, as highlighted by the limitations of current large language models.

Conclusion

The NovelQA: A Benchmark for Long-Range Novel Question Answering paper introduces a new benchmark for evaluating language models' ability to answer questions that require understanding the full context of a novel.

By creating a dataset of over 6,000 questions about 14 full-length novels, the authors aim to push the boundaries of current question answering capabilities and better understand the limitations of modern language models when it comes to long-range reasoning and comprehension.

The results show that even strong baseline models like T5 and GPT-3 struggle to match human performance on this challenging task, suggesting significant room for improvement in developing AI systems that can truly grasp and reason about long-form narratives. Further research building on this benchmark could yield valuable insights for advancing the field of natural language understanding.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🛸

Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation

Bernd Bohnet, Kevin Swersky, Rosanne Liu, Pranjal Awasthi, Azade Nova, Javier Snaider, Hanie Sedghi, Aaron T Parisi, Michael Collins, Angeliki Lazaridou, Orhan Firat, Noah Fiedel

We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books. Previous efforts to construct such datasets relied on crowd-sourcing, but the emergence of transformers with a context size of 1 million or more tokens now enables entirely automatic approaches. Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text, such as questions involving character arcs, broader themes, or the consequences of early actions later in the story. We propose a holistic pipeline for automatic data generation including question generation, answering, and model scoring using an ``Evaluator''. We find that a relative approach, comparing answers between models in a pairwise fashion and ranking with a Bradley-Terry model, provides a more consistent and differentiating scoring mechanism than an absolute scorer that rates answers individually. We also show that LLMs from different model families produce moderate agreement in their ratings. We ground our approach using the manually curated NarrativeQA dataset, where our evaluator shows excellent agreement with human judgement and even finds errors in the dataset. Using our automatic evaluation approach, we show that using an entire book as context produces superior reading comprehension performance compared to baseline no-context (parametric knowledge only) and retrieval-based approaches.

6/4/2024

cs.CL cs.AI

🤔

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li

Although large language models (LLMs) demonstrate impressive performance for many language tasks, most of them can only handle texts a few thousand tokens long, limiting their applications on longer sequence inputs, such as books, reports, and codebases. Recent works have proposed methods to improve LLMs' long context capabilities by extending context windows and more sophisticated memory mechanisms. However, comprehensive benchmarks tailored for evaluating long context understanding are lacking. In this paper, we introduce LongBench, the first bilingual, multi-task benchmark for long context understanding, enabling a more rigorous evaluation of long context understanding. LongBench comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese). These tasks cover key long-text application areas including single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic tasks, and code completion. All datasets in LongBench are standardized into a unified format, allowing for effortless automatic evaluation of LLMs. Upon comprehensive evaluation of 8 LLMs on LongBench, we find that: (1) Commercial model (GPT-3.5-Turbo-16k) outperforms other open-sourced models, but still struggles on longer contexts. (2) Scaled position embedding and fine-tuning on longer sequences lead to substantial improvement on long context understanding. (3) Context compression technique such as retrieval brings improvement for model with weak ability on long contexts, but the performance still lags behind models that have strong long context understanding capability. The code and datasets are available at https://github.com/THUDM/LongBench.

6/21/2024

cs.CL

XL$^2$Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies

Xuanfan Ni, Hengyi Cai, Xiaochi Wei, Shuaiqiang Wang, Dawei Yin, Piji Li

Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks but are constrained by their small context window sizes. Various efforts have been proposed to expand the context window to accommodate even up to 200K input tokens. Meanwhile, building high-quality benchmarks with much longer text lengths and more demanding tasks to provide comprehensive evaluations is of immense practical interest to facilitate long context understanding research of LLMs. However, prior benchmarks create datasets that ostensibly cater to long-text comprehension by expanding the input of traditional tasks, which falls short to exhibit the unique characteristics of long-text understanding, including long dependency tasks and longer text length compatible with modern LLMs' context window size. In this paper, we introduce a benchmark for extremely long context understanding with long-range dependencies, XL$^2$Bench, which includes three scenarios: Fiction Reading, Paper Reading, and Law Reading, and four tasks of increasing complexity: Memory Retrieval, Detailed Understanding, Overall Understanding, and Open-ended Generation, covering 27 subtasks in English and Chinese. It has an average length of 100K+ words (English) and 200K+ characters (Chinese). Evaluating six leading LLMs on XL$^2$Bench, we find that their performance significantly lags behind human levels. Moreover, the observed decline in performance across both the original and enhanced datasets underscores the efficacy of our approach to mitigating data contamination.

4/9/2024

cs.CL

Long-context LLMs Struggle with Long In-context Learning

Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, Wenhu Chen

Large Language Models (LLMs) have made significant strides in handling long sequences. Some models like Gemini could even to be capable of dealing with millions of tokens. However, their performance evaluation has largely been confined to metrics like perplexity and synthetic tasks, which may not fully capture their true abilities in more challenging, real-world scenarios. We introduce a benchmark (LongICLBench) for long in-context learning in extreme-label classification using six datasets with 28 to 174 classes and input lengths from 2K to 50K tokens. Our benchmark requires LLMs to comprehend the entire input to recognize the massive label spaces to make correct predictions. We evaluate on 15 long-context LLMs and find that they perform well on less challenging classification tasks with smaller label space and shorter demonstrations. However, they struggle with more challenging task like Discovery with 174 labels, suggesting a gap in their ability to process long, context-rich sequences. Further analysis reveals a bias towards labels presented later in the sequence and a need for improved reasoning over multiple pieces of information. Our study reveals that long context understanding and reasoning is still a challenging task for the existing LLMs. We believe LongICLBench could serve as a more realistic evaluation for the future long-context LLMs.

6/13/2024

cs.CL cs.AI