CLAPNQ: Cohesive Long-form Answers from Passages in Natural Questions for RAG systems

Read original: arXiv:2404.02103 - Published 4/3/2024 by Sara Rosenthal, Avirup Sil, Radu Florian, Salim Roukos

CLAPNQ: Cohesive Long-form Answers from Passages in Natural Questions for RAG systems

Overview

This paper introduces a new dataset called "CLAPnq" for training and evaluating question-answering systems.
The dataset consists of natural questions paired with long-form, cohesive answers extracted from web passages.
The goal is to support the development of more advanced question-answering systems, particularly those using Retrieval-Augmented Generation (RAG) architectures.

Plain English Explanation

The paper focuses on creating a dataset to help improve the ability of AI systems to answer complex, open-ended questions. Current question-answering systems often struggle to provide detailed, well-structured responses, especially for more nuanced questions.

The researchers built a new dataset called "CLAPnq" that contains natural questions alongside long, coherent answers drawn from web content. The idea is that training AI models on this data will allow them to generate more comprehensive and contextual responses, rather than just short, factual answers.

This is particularly relevant for Retrieval-Augmented Generation (RAG) models, which combine retrieval of relevant information with language generation. By having access to the CLAPnq dataset, these RAG systems can learn to synthesize multi-sentence answers that flow logically and provide deeper insights, rather than just pulling isolated facts.

Overall, the goal is to push the boundaries of what is possible with question-answering AI, moving beyond simple lookups to more human-like language understanding and response generation.

Technical Explanation

The researchers created the CLAPnq dataset by first gathering a large corpus of natural questions from the Google Natural Questions dataset. They then used a retrieval model to find relevant passages from a web corpus that could potentially answer each question.

Human annotators were employed to carefully review the retrieved passages and construct a coherent, multi-sentence answer that synthesized the key information. The final dataset contains over 12,000 question-answer pairs, with the answers averaging around 100 words in length.

Experiments were conducted to evaluate the quality and characteristics of the CLAPnq dataset. The researchers found that the long-form answers display stronger coherence, relevance, and informativeness compared to typical short-answer datasets. Additionally, the dataset supports more challenging question types that require reasoning across multiple pieces of information.

The paper also discusses the potential benefits of using CLAPnq to train advanced RAG-based question-answering models. By having access to high-quality, context-rich training data, these systems may be able to generate more comprehensive and natural-sounding responses.

Critical Analysis

The CLAPnq dataset represents an important step forward in building more capable question-answering systems. The focus on long-form, coherent answers is a valuable contribution, as many current models struggle to go beyond simple factual lookups.

One potential limitation mentioned in the paper is that the dataset may not cover the full breadth of question types and topics that real-world users might ask about. The researchers acknowledge this and suggest further expansion and diversification of the dataset as an area for future work.

Additionally, while the human-written answers in CLAPnq demonstrate strong coherence, there may be opportunities to further improve the quality and fluency of the responses through techniques like multi-document summarization. Exploring these kinds of advancements could lead to even more natural and informative question-answering capabilities.

Overall, the CLAPnq dataset and the ideas presented in this paper represent an important step forward in advancing the state-of-the-art in question-answering AI. Continued research and development in this direction could yield significant benefits for a wide range of real-world applications.

Conclusion

This paper introduces the CLAPnq dataset, which aims to support the development of more capable question-answering systems, particularly those using Retrieval-Augmented Generation (RAG) architectures. By providing long-form, coherent answers drawn from web passages, the dataset enables AI models to learn how to synthesize comprehensive and contextual responses, rather than just returning short factual snippets.

The experiments conducted by the researchers demonstrate the quality and usefulness of the CLAPnq data, highlighting its potential to drive progress in the field of question-answering. While there are some limitations to address, this work represents an important advancement that could lead to significant improvements in the abilities of AI systems to engage in more natural and informative dialogue with users.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

CLAPNQ: Cohesive Long-form Answers from Passages in Natural Questions for RAG systems

Sara Rosenthal, Avirup Sil, Radu Florian, Salim Roukos

Retrieval Augmented Generation (RAG) has become a popular application for large language models. It is preferable that successful RAG systems provide accurate answers that are supported by being grounded in a passage without any hallucinations. While considerable work is required for building a full RAG pipeline, being able to benchmark performance is also necessary. We present ClapNQ, a benchmark Long-form Question Answering dataset for the full RAG pipeline. ClapNQ includes long answers with grounded gold passages from Natural Questions (NQ) and a corpus to perform either retrieval, generation, or the full RAG pipeline. The ClapNQ answers are concise, 3x smaller than the full passage, and cohesive, with multiple pieces of the passage that are not contiguous. RAG models must adapt to these properties to be successful at ClapNQ. We present baseline experiments and analysis for ClapNQ that highlight areas where there is still significant room for improvement in grounded RAG. CLAPNQ is publicly available at https://github.com/primeqa/clapnq

4/3/2024

RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering

Rujun Han, Yuhao Zhang, Peng Qi, Yumo Xu, Jenyuan Wang, Lan Liu, William Yang Wang, Bonan Min, Vittorio Castelli

Question answering based on retrieval augmented generation (RAG-QA) is an important research topic in NLP and has a wide range of real-world applications. However, most existing datasets for this task are either constructed using a single source corpus or consist of short extractive answers, which fall short of evaluating large language model (LLM) based RAG-QA systems on cross-domain generalization. To address these limitations, we create Long-form RobustQA (LFRQA), a new dataset comprising human-written long-form answers that integrate short extractive answers from multiple documents into a single, coherent narrative, covering 26K queries and large corpora across seven different domains. We further propose RAG-QA Arena by directly comparing model-generated answers against LFRQA's answers using LLMs as evaluators. We show via extensive experiments that RAG-QA Arena and human judgments on answer quality are highly correlated. Moreover, only 41.3% of the most competitive LLM's answers are preferred to LFRQA's answers, demonstrating RAG-QA Arena as a challenging evaluation platform for future research.

7/22/2024

CRAG -- Comprehensive RAG Benchmark

Xiao Yang, Kai Sun, Hao Xin, Yushi Sun, Nikita Bhalla, Xiangsen Chen, Sajal Choudhary, Rongze Daniel Gui, Ziran Will Jiang, Ziyu Jiang, Lingkun Kong, Brian Moran, Jiaqi Wang, Yifan Ethan Xu, An Yan, Chenyu Yang, Eting Yuan, Hanwen Zha, Nan Tang, Lei Chen, Nicolas Scheffer, Yue Liu, Nirav Shah, Rakesh Wanga, Anuj Kumar, Wen-tau Yih, Xin Luna Dong

Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate Large Language Model (LLM)'s deficiency in lack of knowledge. Existing RAG datasets, however, do not adequately represent the diverse and dynamic nature of real-world Question Answering (QA) tasks. To bridge this gap, we introduce the Comprehensive RAG Benchmark (CRAG), a factual question answering benchmark of 4,409 question-answer pairs and mock APIs to simulate web and Knowledge Graph (KG) search. CRAG is designed to encapsulate a diverse array of questions across five domains and eight question categories, reflecting varied entity popularity from popular to long-tail, and temporal dynamisms ranging from years to seconds. Our evaluation on this benchmark highlights the gap to fully trustworthy QA. Whereas most advanced LLMs achieve <=34% accuracy on CRAG, adding RAG in a straightforward manner improves the accuracy only to 44%. State-of-the-art industry RAG solutions only answer 63% questions without any hallucination. CRAG also reveals much lower accuracy in answering questions regarding facts with higher dynamism, lower popularity, or higher complexity, suggesting future research directions. The CRAG benchmark laid the groundwork for a KDD Cup 2024 challenge, attracting thousands of participants and submissions within the first 50 days of the competition. We commit to maintaining CRAG to serve research communities in advancing RAG solutions and general QA solutions.

6/10/2024

RAG based Question-Answering for Contextual Response Prediction System

Sriram Veturi, Saurabh Vaichal, Reshma Lal Jagadheesh, Nafis Irtiza Tripto, Nian Yan

Large Language Models (LLMs) have shown versatility in various Natural Language Processing (NLP) tasks, including their potential as effective question-answering systems. However, to provide precise and relevant information in response to specific customer queries in industry settings, LLMs require access to a comprehensive knowledge base to avoid hallucinations. Retrieval Augmented Generation (RAG) emerges as a promising technique to address this challenge. Yet, developing an accurate question-answering framework for real-world applications using RAG entails several challenges: 1) data availability issues, 2) evaluating the quality of generated content, and 3) the costly nature of human evaluation. In this paper, we introduce an end-to-end framework that employs LLMs with RAG capabilities for industry use cases. Given a customer query, the proposed system retrieves relevant knowledge documents and leverages them, along with previous chat history, to generate response suggestions for customer service agents in the contact centers of a major retail company. Through comprehensive automated and human evaluations, we show that this solution outperforms the current BERT-based algorithms in accuracy and relevance. Our findings suggest that RAG-based LLMs can be an excellent support to human customer service representatives by lightening their workload.

9/9/2024