Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation

2406.00179

Published 6/4/2024 by Bernd Bohnet, Kevin Swersky, Rosanne Liu, Pranjal Awasthi, Azade Nova, Javier Snaider, Hanie Sedghi, Aaron T Parisi, Michael Collins, Angeliki Lazaridou and 2 others

cs.CL cs.AI

🛸

Abstract

We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books. Previous efforts to construct such datasets relied on crowd-sourcing, but the emergence of transformers with a context size of 1 million or more tokens now enables entirely automatic approaches. Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text, such as questions involving character arcs, broader themes, or the consequences of early actions later in the story. We propose a holistic pipeline for automatic data generation including question generation, answering, and model scoring using an ``Evaluator''. We find that a relative approach, comparing answers between models in a pairwise fashion and ranking with a Bradley-Terry model, provides a more consistent and differentiating scoring mechanism than an absolute scorer that rates answers individually. We also show that LLMs from different model families produce moderate agreement in their ratings. We ground our approach using the manually curated NarrativeQA dataset, where our evaluator shows excellent agreement with human judgement and even finds errors in the dataset. Using our automatic evaluation approach, we show that using an entire book as context produces superior reading comprehension performance compared to baseline no-context (parametric knowledge only) and retrieval-based approaches.

Create account to get full access

Overview

This paper explores the use of large language models (LLMs) with long context capabilities to automatically generate synthetic reading comprehension data from entire books.
Previous efforts relied on crowdsourcing, but the emergence of transformers with 1 million+ token context size now enables fully automatic approaches.
The goal is to test LLM capabilities to analyze, understand, and reason over problems requiring detailed comprehension of long text, such as questions about character arcs, themes, or consequences of early actions.
The authors propose a pipeline for automatic data generation including question generation, answering, and model scoring using an "Evaluator".
They find that a relative approach comparing model answers in a pairwise fashion and ranking with a Bradley-Terry model provides more consistent and differentiating scoring than an absolute scorer.
LLMs from different model families also show moderate agreement in their ratings.

Plain English Explanation

The researchers wanted to see if powerful language models could be used to automatically create reading comprehension datasets from entire books, rather than relying on crowd-sourcing like previous efforts. The idea is to test these models' ability to deeply understand long stretches of text, such as by asking questions about the characters' journeys, the broader themes of a story, or how earlier events impact later parts of the plot.

To do this, the researchers developed a complete system that can generate questions, get answers from language models, and then evaluate the quality of those answers. Interestingly, they found that comparing the answers from different models in a relative way (i.e. ranking them against each other) works better for scoring than just judging each answer individually. They also noticed that language models from different families tend to agree moderately on their evaluations.

The researchers tested their approach using an existing high-quality reading comprehension dataset, and found that using an entire book as context helps the language models perform better than just relying on their general knowledge or retrieving relevant passages. This suggests these models are developing impressive capabilities for deeply understanding lengthy narratives.

Technical Explanation

The paper proposes a pipeline for automatically generating synthetic reading comprehension datasets from entire books, leveraging the long context capabilities of modern transformers.

The key components are:

Question Generation: The system automatically generates questions about the content of the books.
Answer Generation: Language models are used to provide answers to the generated questions.
Evaluator: A scoring mechanism is developed to assess the quality of the model-generated answers.

The authors find that a relative scoring approach, where answers are compared pairwise between models and ranked using a Bradley-Terry model, provides more consistent and differentiating results than an absolute scoring mechanism. They also observe moderate agreement between language models from different families in their evaluations.

The system is tested on the NarrativeQA dataset, where it shows strong agreement with human judgments and even identifies errors in the original dataset. Importantly, the authors demonstrate that using an entire book as context leads to superior reading comprehension performance compared to baseline approaches that only leverage parametric knowledge or retrieve relevant passages.

Critical Analysis

The paper presents a compelling approach to automatically generate high-quality reading comprehension datasets from books, addressing the limitations of previous crowd-sourcing efforts. By leveraging the long context capabilities of modern language models, the authors are able to test these models' ability to deeply understand and reason over lengthy narratives.

One potential limitation is the reliance on a single dataset, NarrativeQA, for grounding and evaluation. It would be valuable to see how the system performs on a broader range of reading comprehension benchmarks, especially those that cover different genres or styles of text.

Additionally, the authors note that their relative scoring approach introduces some subjective elements, as the rankings depend on the specific set of models being compared. It could be worthwhile to explore alternative scoring mechanisms that provide more objective, absolute measures of answer quality.

Finally, while the paper demonstrates the advantages of using full-book context, it would be interesting to further investigate the specific types of questions or reasoning tasks that benefit most from this long-range understanding, versus those that can be adequately addressed using more local contextual information.

Overall, this research represents an important step forward in empowering language models to engage with and comprehend lengthy, complex narratives, with promising applications in areas like educational assessment, literary analysis, and interactive storytelling.

Conclusion

This paper presents a novel approach for automatically generating high-quality reading comprehension datasets from entire books, leveraging the long-context capabilities of modern language models. By developing a holistic pipeline for question generation, answer production, and relative model scoring, the authors demonstrate the ability of these models to deeply analyze and reason over lengthy narratives.

The findings suggest that using full-book context can significantly improve reading comprehension performance compared to more limited approaches, and that language models from different families exhibit moderate agreement in their evaluations. This work lays the groundwork for further advancements in applying large language models to complex, long-form understanding tasks with broad applications in education, literary studies, and beyond.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens

Cunxiang Wang, Ruoxi Ning, Boqi Pan, Tonghui Wu, Qipeng Guo, Cheng Deng, Guangsheng Bao, Xiangkun Hu, Zheng Zhang, Qian Wang, Yue Zhang

The rapid advancement of Large Language Models (LLMs) has introduced a new frontier in natural language processing, particularly in understanding and processing long-context information. However, the evaluation of these models' long-context abilities remains a challenge due to the limitations of current benchmarks. To address this gap, we introduce NovelQA, a benchmark specifically designed to test the capabilities of LLMs with extended texts. Constructed from English novels, NovelQA offers a unique blend of complexity, length, and narrative coherence, making it an ideal tool for assessing deep textual understanding in LLMs. This paper presents the design and construction of NovelQA, highlighting its manual annotation, and diverse question types. Our evaluation of Long-context LLMs on NovelQA reveals significant insights into the models' performance, particularly emphasizing the challenges they face with multi-hop reasoning, detail-oriented questions, and extremely long input with an average length more than 200,000 tokens. The results underscore the necessity for further advancements in LLMs to improve their long-context comprehension.

6/18/2024

cs.CL

🛸

PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models

Haochen Tan, Zhijiang Guo, Zhan Shi, Lu Xu, Zhili Liu, Yunlong Feng, Xiaoguang Li, Yasheng Wang, Lifeng Shang, Qun Liu, Linqi Song

Large Language Models (LLMs) have succeeded remarkably in understanding long-form contents. However, exploring their capability for generating long-form contents, such as reports and articles, has been relatively unexplored and inadequately assessed by existing benchmarks. The prevalent evaluation methods, which predominantly rely on crowdsourcing, are recognized for their labor-intensive nature and lack of efficiency, whereas automated metrics, such as the ROUGE score, demonstrate discordance with human judgment criteria. In this paper, we propose ProxyQA, an innovative framework dedicated to assessing long-text generation. ProxyQA comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers. LLMs are tasked to generate extensive content in response to these meta-questions, by engaging an evaluator and incorporating the generated texts as contextual background, ProxyQA assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions. We examine multiple LLMs, emphasizing ProxyQA's demanding nature as a high-quality assessment tool. Human evaluation demonstrates that the proxy-question method is notably self-consistent and aligns closely with human evaluative standards. The dataset and leaderboard is available at url{https://proxy-qa.com}.

6/5/2024

cs.CL cs.AI

Synthetic Context Generation for Question Generation

Naiming Liu, Zichao Wang, Richard Baraniuk

Despite rapid advancements in large language models (LLMs), QG remains a challenging problem due to its complicated process, open-ended nature, and the diverse settings in which question generation occurs. A common approach to address these challenges involves fine-tuning smaller, custom models using datasets containing background context, question, and answer. However, obtaining suitable domain-specific datasets with appropriate context is often more difficult than acquiring question-answer pairs. In this paper, we investigate training QG models using synthetic contexts generated by LLMs from readily available question-answer pairs. We conduct a comprehensive study to answer critical research questions related to the performance of models trained on synthetic contexts and their potential impact on QG research and applications. Our empirical results reveal: 1) contexts are essential for QG tasks, even if they are synthetic; 2) fine-tuning smaller language models has the capability of achieving better performances as compared to prompting larger language models; and 3) synthetic context and real context could achieve comparable performances. These findings highlight the effectiveness of synthetic contexts in QG and paves the way for future advancements in the field.

6/21/2024

cs.CL cs.LG

Retrieval Meets Reasoning: Dynamic In-Context Editing for Long-Text Understanding

Weizhi Fei, Xueyan Niu, Guoqing Xie, Yanhua Zhang, Bo Bai, Lei Deng, Wei Han

Current Large Language Models (LLMs) face inherent limitations due to their pre-defined context lengths, which impede their capacity for multi-hop reasoning within extensive textual contexts. While existing techniques like Retrieval-Augmented Generation (RAG) have attempted to bridge this gap by sourcing external information, they fall short when direct answers are not readily available. We introduce a novel approach that re-imagines information retrieval through dynamic in-context editing, inspired by recent breakthroughs in knowledge editing. By treating lengthy contexts as malleable external knowledge, our method interactively gathers and integrates relevant information, thereby enabling LLMs to perform sophisticated reasoning steps. Experimental results demonstrate that our method effectively empowers context-limited LLMs, such as Llama2, to engage in multi-hop reasoning with improved performance, which outperforms state-of-the-art context window extrapolation methods and even compares favorably to more advanced commercial long-context models. Our interactive method not only enhances reasoning capabilities but also mitigates the associated training and computational costs, making it a pragmatic solution for enhancing LLMs' reasoning within expansive contexts.

6/19/2024

cs.CL cs.AI