SEMQA: Semi-Extractive Multi-Source Question Answering

2311.04886

Published 7/2/2024 by Tal Schuster, Adam D. Lelkes, Haitian Sun, Jai Gupta, Jonathan Berant, William W. Cohen, Donald Metzler

cs.CL cs.AI cs.LG

🗣️

Abstract

Recently proposed long-form question answering (QA) systems, supported by large language models (LLMs), have shown promising capabilities. Yet, attributing and verifying their generated abstractive answers can be difficult, and automatically evaluating their accuracy remains an ongoing challenge. In this work, we introduce a new QA task for answering multi-answer questions by summarizing multiple diverse sources in a semi-extractive fashion. Specifically, Semi-extractive Multi-source QA (SEMQA) requires models to output a comprehensive answer, while mixing factual quoted spans -- copied verbatim from given input sources -- and non-factual free-text connectors that glue these spans together into a single cohesive passage. This setting bridges the gap between the outputs of well-grounded but constrained extractive QA systems and more fluent but harder to attribute fully abstractive answers. Particularly, it enables a new mode for language models that leverages their advanced language generation capabilities, while also producing fine in-line attributions by-design that are easy to verify, interpret, and evaluate. To study this task, we create the first dataset of this kind, QuoteSum, with human-written semi-extractive answers to natural and generated questions, and define text-based evaluation metrics. Experimenting with several LLMs in various settings, we find this task to be surprisingly challenging, demonstrating the importance of QuoteSum for developing and studying such consolidation capabilities.

Create account to get full access

Overview

This paper introduces a new task called Semi-extractive Multi-source QA (SEMQA) for long-form question answering.
The task requires models to generate comprehensive answers that mix verbatim quotes from input sources and original text to connect them.
This bridges the gap between constrained extractive QA and more fluent but less verifiable abstractive QA.
The authors create a new dataset called QuoteSum to study this task and find it to be surprisingly challenging for large language models.

Plain English Explanation

The paper discusses a new approach to question answering (QA) that aims to combine the strengths of different QA techniques. Traditional extractive QA systems can provide answers backed by direct quotes from source materials, but the answers may lack fluency. Fully abstractive QA systems can generate more natural-sounding answers, but it can be difficult to attribute and verify the information in those answers.

The new SEMQA task introduced in this paper tries to bridge that gap. It requires models to generate answers that mix direct quotes from source texts with original connecting text. This allows the models to leverage their language generation capabilities to produce coherent answers, while still providing clear attributions to the source material.

To study this task, the authors created a new dataset called QuoteSum, which contains human-written semi-extractive answers to both natural and generated questions. Evaluating several large language models on this task, the researchers found it to be surprisingly challenging, highlighting the need for further work in this area.

Technical Explanation

The paper proposes a new question answering task called Semi-extractive Multi-source QA (SEMQA), where models must generate comprehensive answers that mix verbatim quotes from multiple input sources and original text to connect those quotes into a cohesive passage.

This setting aims to combine the strengths of extractive QA, which provides verifiable answers backed by direct evidence, and abstractive QA, which can generate more fluent, natural-sounding responses. By requiring models to include both quoted spans and original text, SEMQA enables a new mode of generation that is more transparent and easy to evaluate than fully abstractive approaches.

To study this task, the authors introduce the QuoteSum dataset, which contains human-written semi-extractive answers to both natural and machine-generated questions. QuoteSum provides a benchmark for evaluating models' ability to consolidate information from multiple sources into a single, coherent answer.

The paper experiments with several large language models (LLMs) on the SEMQA task and finds it to be surprisingly challenging, even for state-of-the-art systems. This highlights the importance of QuoteSum as a testbed for developing and studying the consolidation capabilities required for this type of long-form, multi-source QA.

Critical Analysis

The authors acknowledge several limitations of their work and areas for further research. For example, they note that automatically evaluating the accuracy of SEMQA answers remains an ongoing challenge, as it requires assessing both the factual correctness of the quoted spans and the coherence of the generated connecting text.

Additionally, the paper does not explore the potential biases or factual errors that LLMs may introduce when generating the non-quoted portions of the answers. Investigating these issues, as well as the model's ability to select the most relevant source material, would be important areas for future work.

It would also be valuable to study human performance on the SEMQA task, both to better understand the cognitive processes involved and to provide a more robust benchmark for evaluating machine performance. The authors' finding that the task is "surprisingly challenging" for LLMs suggests that there may be significant room for improvement.

Overall, this paper presents an interesting new task and dataset that could help drive progress in long-form, multi-source question answering. However, the challenges identified in the research highlight the need for continued innovation and careful evaluation to develop systems that can reliably and transparently consolidate information from diverse sources.

Conclusion

This paper introduces a novel question answering task called Semi-extractive Multi-source QA (SEMQA), which requires models to generate comprehensive answers that mix verbatim quotes from multiple input sources and original text to connect those quotes. The authors create a new dataset, QuoteSum, to study this task and find it to be surprisingly challenging for state-of-the-art large language models.

The SEMQA task represents an important step towards developing QA systems that can leverage the strengths of both extractive and abstractive approaches. By requiring models to produce answers with clear attributions, this setting could lead to more transparent and verifiable long-form QA systems. However, the difficulties identified in the research demonstrate that there is still significant work to be done in this area.

As the field of natural language processing continues to advance, the ability to consolidate information from multiple sources will become increasingly important. The SEMQA task and QuoteSum dataset provide a valuable benchmark for driving progress in this direction and ultimately improving the reliability and interpretability of question answering systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models

Zihan Zhao, Yiyang Jiang, Heyang Liu, Yanfeng Wang, Yu Wang

While Large Language Models (LLMs) have demonstrated commendable performance across a myriad of domains and tasks, existing LLMs still exhibit a palpable deficit in handling multimodal functionalities, especially for the Spoken Question Answering (SQA) task which necessitates precise alignment and deep interaction between speech and text features. To address the SQA challenge on LLMs, we initially curated the free-form and open-ended LibriSQA dataset from Librispeech, comprising Part I with natural conversational formats and Part II encompassing multiple-choice questions followed by answers and analytical segments. Both parts collectively include 107k SQA pairs that cover various topics. Given the evident paucity of existing speech-text LLMs, we propose a lightweight, end-to-end framework to execute the SQA task on the LibriSQA, witnessing significant results. By reforming ASR into the SQA format, we further substantiate our framework's capability in handling ASR tasks. Our empirical findings bolster the LLMs' aptitude for aligning and comprehending multimodal information, paving the way for the development of universal multimodal LLMs. The dataset and demo can be found at https://github.com/ZihanZhaoSJTU/LibriSQA.

4/19/2024

cs.CL

NewsQs: Multi-Source Question Generation for the Inquiring Mind

Alyssa Hwang, Kalpit Dixit, Miguel Ballesteros, Yassine Benajiba, Vittorio Castelli, Markus Dreyer, Mohit Bansal, Kathleen McKeown

We present NewsQs (news-cues), a dataset that provides question-answer pairs for multiple news documents. To create NewsQs, we augment a traditional multi-document summarization dataset with questions automatically generated by a T5-Large model fine-tuned on FAQ-style news articles from the News On the Web corpus. We show that fine-tuning a model with control codes produces questions that are judged acceptable more often than the same model without them as measured through human evaluation. We use a QNLI model with high correlation with human annotations to filter our data. We release our final dataset of high-quality questions, answers, and document clusters as a resource for future work in query-based multi-document summarization.

6/18/2024

cs.CL

🧪

MemeMQA: Multimodal Question Answering for Memes via Rationale-Based Inferencing

Siddhant Agarwal, Shivam Sharma, Preslav Nakov, Tanmoy Chakraborty

Memes have evolved as a prevalent medium for diverse communication, ranging from humour to propaganda. With the rising popularity of image-focused content, there is a growing need to explore its potential harm from different aspects. Previous studies have analyzed memes in closed settings - detecting harm, applying semantic labels, and offering natural language explanations. To extend this research, we introduce MemeMQA, a multimodal question-answering framework aiming to solicit accurate responses to structured questions while providing coherent explanations. We curate MemeMQACorpus, a new dataset featuring 1,880 questions related to 1,122 memes with corresponding answer-explanation pairs. We further propose ARSENAL, a novel two-stage multimodal framework that leverages the reasoning capabilities of LLMs to address MemeMQA. We benchmark MemeMQA using competitive baselines and demonstrate its superiority - ~18% enhanced answer prediction accuracy and distinct text generation lead across various metrics measuring lexical and semantic alignment over the best baseline. We analyze ARSENAL's robustness through diversification of question-set, confounder-based evaluation regarding MemeMQA's generalizability, and modality-specific assessment, enhancing our understanding of meme interpretation in the multimodal communication landscape.

5/21/2024

cs.CL cs.CY

SEC-QA: A Systematic Evaluation Corpus for Financial QA

Viet Dac Lai, Michael Krumdick, Charles Lovering, Varshini Reddy, Craig Schmidt, Chris Tanner

The financial domain frequently deals with large numbers of long documents that are essential for daily operations. Significant effort is put towards automating financial data analysis. However, a persistent challenge, not limited to the finance domain, is the scarcity of datasets that accurately reflect real-world tasks for model evaluation. Existing datasets are often constrained by size, context, or relevance to practical applications. Moreover, LLMs are currently trained on trillions of tokens of text, limiting access to novel data or documents that models have not encountered during training for unbiased evaluation. We propose SEC-QA, a continuous dataset generation framework with two key features: 1) the semi-automatic generation of Question-Answer (QA) pairs spanning multiple long context financial documents, which better represent real-world financial scenarios; 2) the ability to continually refresh the dataset using the most recent public document collections, not yet ingested by LLMs. Our experiments show that current retrieval augmented generation methods systematically fail to answer these challenging multi-document questions. In response, we introduce a QA system based on program-of-thought that improves the ability to perform complex information retrieval and quantitative reasoning pipelines, thereby increasing QA accuracy.

6/21/2024

cs.CL