PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models

2401.15042

Published 6/5/2024 by Haochen Tan, Zhijiang Guo, Zhan Shi, Lu Xu, Zhili Liu, Yunlong Feng, Xiaoguang Li, Yasheng Wang, Lifeng Shang, Qun Liu and 1 other

cs.CL cs.AI

🛸

Abstract

Large Language Models (LLMs) have succeeded remarkably in understanding long-form contents. However, exploring their capability for generating long-form contents, such as reports and articles, has been relatively unexplored and inadequately assessed by existing benchmarks. The prevalent evaluation methods, which predominantly rely on crowdsourcing, are recognized for their labor-intensive nature and lack of efficiency, whereas automated metrics, such as the ROUGE score, demonstrate discordance with human judgment criteria. In this paper, we propose ProxyQA, an innovative framework dedicated to assessing long-text generation. ProxyQA comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers. LLMs are tasked to generate extensive content in response to these meta-questions, by engaging an evaluator and incorporating the generated texts as contextual background, ProxyQA assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions. We examine multiple LLMs, emphasizing ProxyQA's demanding nature as a high-quality assessment tool. Human evaluation demonstrates that the proxy-question method is notably self-consistent and aligns closely with human evaluative standards. The dataset and leaderboard is available at url{https://proxy-qa.com}.

Create account to get full access

Overview

The paper explores the ability of Large Language Models (LLMs) to generate high-quality long-form content, such as reports and articles, which has been relatively unexplored.
Existing evaluation methods, like crowdsourcing and automated metrics, are recognized as labor-intensive, inefficient, or discordant with human judgment.
The paper introduces ProxyQA, a novel framework for assessing long-text generation by LLMs.

Plain English Explanation

Large language models (LLMs) are AI systems that can understand and process long texts very well. However, their ability to generate high-quality long-form content, like reports or articles, has not been studied in depth. The current ways of evaluating this ability, such as relying on crowdsourced feedback or automated metrics, have issues - they are either labor-intensive, inefficient, or don't match up well with how humans would judge the content.

To address this, the researchers propose a new framework called ProxyQA. The idea is to provide LLMs with a set of broad "meta-questions" on various topics, and then have them generate detailed responses. The quality of the responses is then evaluated indirectly, by seeing how well a human evaluator can answer more specific "proxy-questions" about the content, whose answers are pre-defined. This allows for a more rigorous and consistent assessment of the LLMs' abilities to generate high-quality long-form text.

Technical Explanation

The paper introduces ProxyQA, a novel framework for evaluating the long-form text generation capabilities of large language models (LLMs). It consists of a set of human-curated "meta-questions" spanning different domains, each accompanied by specific "proxy-questions" with pre-annotated answers.

LLMs are tasked with generating detailed responses to the meta-questions. An evaluator then uses the generated text as background information to answer the corresponding proxy-questions. The accuracy of the evaluator in addressing the proxy-questions is used as a proxy for assessing the quality of the LLM's generated content.

The researchers examine the performance of multiple LLMs using the ProxyQA framework, which they demonstrate is a more demanding and high-quality assessment tool compared to existing methods. The human evaluation shows that the proxy-question approach is notably self-consistent and closely aligns with human evaluative standards.

Critical Analysis

The ProxyQA framework addresses important limitations of current evaluation methods for long-form text generation by LLMs. By using a set of pre-defined proxy-questions, it provides a more consistent and reliable way to assess the quality of the generated content, compared to relying on crowdsourced feedback or automated metrics like ROUGE, which may not fully capture human judgment criteria.

However, the paper does not explore the potential biases or limitations of the proxy-question approach itself. It is possible that the selection of meta-questions and proxy-questions could introduce systematic biases, and the pre-defined answers to the proxy-questions may not fully capture the nuances of human evaluation.

Additionally, the paper does not address how the ProxyQA framework could be scaled to a larger number of topics and domains, or how it could be adapted to assess other forms of long-form content, such as financial question-answering or advisory question-answering. Further research is needed to explore the broader applicability and potential limitations of this approach.

Conclusion

The paper presents ProxyQA, a novel framework for evaluating the long-form text generation capabilities of large language models. By using a set of human-curated meta-questions and corresponding proxy-questions, ProxyQA provides a more consistent and reliable way to assess the quality of the generated content compared to existing evaluation methods.

The findings suggest that ProxyQA is a promising approach for advancing the state of the art in long-form text generation by LLMs, which has significant implications for the development of AI systems that can produce high-quality, informative, and coherent long-form content across a wide range of domains.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens

Cunxiang Wang, Ruoxi Ning, Boqi Pan, Tonghui Wu, Qipeng Guo, Cheng Deng, Guangsheng Bao, Xiangkun Hu, Zheng Zhang, Qian Wang, Yue Zhang

The rapid advancement of Large Language Models (LLMs) has introduced a new frontier in natural language processing, particularly in understanding and processing long-context information. However, the evaluation of these models' long-context abilities remains a challenge due to the limitations of current benchmarks. To address this gap, we introduce NovelQA, a benchmark specifically designed to test the capabilities of LLMs with extended texts. Constructed from English novels, NovelQA offers a unique blend of complexity, length, and narrative coherence, making it an ideal tool for assessing deep textual understanding in LLMs. This paper presents the design and construction of NovelQA, highlighting its manual annotation, and diverse question types. Our evaluation of Long-context LLMs on NovelQA reveals significant insights into the models' performance, particularly emphasizing the challenges they face with multi-hop reasoning, detail-oriented questions, and extremely long input with an average length more than 200,000 tokens. The results underscore the necessity for further advancements in LLMs to improve their long-context comprehension.

6/18/2024

cs.CL

🛸

Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation

Bernd Bohnet, Kevin Swersky, Rosanne Liu, Pranjal Awasthi, Azade Nova, Javier Snaider, Hanie Sedghi, Aaron T Parisi, Michael Collins, Angeliki Lazaridou, Orhan Firat, Noah Fiedel

We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books. Previous efforts to construct such datasets relied on crowd-sourcing, but the emergence of transformers with a context size of 1 million or more tokens now enables entirely automatic approaches. Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text, such as questions involving character arcs, broader themes, or the consequences of early actions later in the story. We propose a holistic pipeline for automatic data generation including question generation, answering, and model scoring using an ``Evaluator''. We find that a relative approach, comparing answers between models in a pairwise fashion and ranking with a Bradley-Terry model, provides a more consistent and differentiating scoring mechanism than an absolute scorer that rates answers individually. We also show that LLMs from different model families produce moderate agreement in their ratings. We ground our approach using the manually curated NarrativeQA dataset, where our evaluator shows excellent agreement with human judgement and even finds errors in the dataset. Using our automatic evaluation approach, we show that using an entire book as context produces superior reading comprehension performance compared to baseline no-context (parametric knowledge only) and retrieval-based approaches.

6/4/2024

cs.CL cs.AI

Beyond the Answers: Reviewing the Rationality of Multiple Choice Question Answering for the Evaluation of Large Language Models

Haochun Wang, Sendong Zhao, Zewen Qiang, Nuwa Xi, Bing Qin, Ting Liu

In the field of natural language processing (NLP), Large Language Models (LLMs) have precipitated a paradigm shift, markedly enhancing performance in natural language generation tasks. Despite these advancements, the comprehensive evaluation of LLMs remains an inevitable challenge for the community. Recently, the utilization of Multiple Choice Question Answering (MCQA) as a benchmark for LLMs has gained considerable traction. This study first investigates the limitations of MCQA as an evaluation method for LLMs and then analyzes the fundamental reason for the limitations of MCQA, that while LLMs may select the correct answers, it is possible that they also recognize other wrong options as correct. Finally, we propose a dataset augmenting method for Multiple-Choice Questions (MCQs), MCQA+, that can more accurately reflect the performance of the model, which underscores the need for more robust evaluation mechanisms in assessing the performance of LLMs.

5/31/2024

cs.CL cs.AI

👁️

ExpertQA: Expert-Curated Questions and Attributed Answers

Chaitanya Malaviya, Subin Lee, Sihao Chen, Elizabeth Sieber, Mark Yatskar, Dan Roth

As language models are adopted by a more sophisticated and diverse set of users, the importance of guaranteeing that they provide factually correct information supported by verifiable sources is critical across fields of study. This is especially the case for high-stakes fields, such as medicine and law, where the risk of propagating false information is high and can lead to undesirable societal consequences. Previous work studying attribution and factuality has not focused on analyzing these characteristics of language model outputs in domain-specific scenarios. In this work, we conduct human evaluation of responses from a few representative systems along various axes of attribution and factuality, by bringing domain experts in the loop. Specifically, we collect expert-curated questions from 484 participants across 32 fields of study, and then ask the same experts to evaluate generated responses to their own questions. In addition, we ask experts to improve upon responses from language models. The output of our analysis is ExpertQA, a high-quality long-form QA dataset with 2177 questions spanning 32 fields, along with verified answers and attributions for claims in the answers.

4/3/2024

cs.CL cs.AI