Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering with Multi-Granularity Answers

Read original: arXiv:2401.04695 - Published 8/2/2024 by Gal Yona, Roee Aharoni, Mor Geva

🔍

Overview

Factual questions can be answered at different levels of granularity, but standard QA evaluation protocols do not account for this.
The authors propose GRANOLA QA, a novel evaluation setting that assesses the accuracy and informativeness of answers against a set of multi-granularity answers.
They create GRANOLA-EQ, a multi-granularity version of the EntityQuestions dataset, and evaluate decoding methods, including a new algorithm called Decoding with Response Aggregation (DRAG) that aligns the response granularity with model uncertainty.

Plain English Explanation

When answering factual questions, people can provide responses at different levels of detail. For example, for the question "When was Barack Obama born?", both "August 4, 1961" and "1961" would be correct answers, but with varying levels of specificity. However, the standard ways of evaluating question-answering (QA) systems do not take this into account - they simply compare the predicted answer to a single, specific answer.

The researchers in this paper propose a new approach called GRANOLA QA that evaluates the accuracy and informativeness of answers against a set of multi-granularity answers. They create a new dataset called GRANOLA-EQ, which is a version of the existing EntityQuestions dataset that includes multiple levels of detail for each answer.

The researchers then evaluate different decoding methods, including a new algorithm they developed called Decoding with Response Aggregation (DRAG). DRAG is designed to generate answers that match the level of detail that the model is most confident about.

The results show that large language models with standard decoding tend to generate very specific answers, which are often incorrect. However, when evaluated on the multi-granularity answers in GRANOLA-EQ, the DRAG method yields a nearly 20 point increase in accuracy on average, and an even bigger improvement for rare entities.

This suggests that the standard ways of evaluating QA systems may significantly underestimate the knowledge that is actually captured in large language models. By considering multiple levels of detail in the answers, the researchers were able to better assess the models' true capabilities.

Technical Explanation

The key technical contributions of this paper are:

GRANOLA QA Evaluation Setting: The authors propose a novel evaluation framework called GRANOLA QA that assesses the accuracy and informativeness of answers against a set of multi-granularity answers, rather than a single specific answer.
GRANOLA-EQ Dataset: The researchers created a new multi-granularity version of the EntityQuestions dataset, called GRANOLA-EQ, by annotating existing questions with additional answers at different levels of granularity.
Decoding with Response Aggregation (DRAG): The authors developed a new decoding algorithm called DRAG that aims to generate answers that align with the model's confidence in the level of detail. DRAG uses a combination of greedy decoding and uncertainty-aware response aggregation to produce answers at the appropriate granularity level.

The experiments on the GRANOLA-EQ dataset show that large language models with standard decoding methods tend to generate very specific answers, which are often incorrect. In contrast, the DRAG algorithm yields a nearly 20 point increase in accuracy on average, with even larger improvements for rare entities.

This suggests that the standard QA evaluation protocols may significantly underestimate the knowledge captured in large language models, as they fail to account for the models' ability to provide answers at different levels of granularity.

Critical Analysis

The key strengths of this research are:

Innovative Evaluation Approach: The GRANOLA QA evaluation framework is a novel and important contribution, as it recognizes the nuanced nature of factual question answering and provides a more comprehensive way to assess model performance.
Practical Dataset Creation: The authors' methodology for enriching existing datasets with multi-granularity answers is a practical and scalable approach that can be applied to other QA datasets.
Effective Decoding Algorithm: The DRAG algorithm demonstrates the potential benefits of aligning the response granularity with the model's uncertainty, which is an important consideration for practical QA applications.

Some potential limitations and areas for further research include:

Generalization to Other Domains: While the GRANOLA-EQ dataset focuses on entity-related questions, it would be valuable to explore the multi-granularity approach in other QA domains, such as open-ended questions or procedural knowledge.
Explainability of Granularity Selection: The paper does not provide much insight into the factors that influence the model's selection of the appropriate granularity level. Investigating this could lead to further improvements in the DRAG algorithm.
Human Evaluation: The paper relies solely on automatic evaluation metrics, and it would be helpful to also assess the GRANOLA QA approach through human evaluation to better understand its practical implications.

Overall, this research makes an important contribution to the field of question answering by introducing a new evaluation framework that better captures the nuances of factual knowledge. The insights from this work could lead to more accurate and informative QA systems in the future.

Conclusion

This paper proposes a novel evaluation framework called GRANOLA QA that assesses the accuracy and informativeness of question-answering models against a set of multi-granularity answers, rather than a single specific answer. The authors create a new dataset called GRANOLA-EQ and introduce a new decoding algorithm called DRAG that aligns the response granularity with the model's uncertainty.

The results show that standard QA evaluation and decoding methods tend to underestimate the knowledge captured in large language models, as they fail to account for the models' ability to provide answers at different levels of detail. The GRANOLA QA approach and the DRAG algorithm demonstrated significant improvements in accuracy, especially for rare entities, revealing the importance of considering multi-granularity answers in QA evaluation.

This research represents an important step forward in the development of more nuanced and informative question-answering systems, with potential applications in a variety of domains that require factual knowledge.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔍

Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering with Multi-Granularity Answers

Gal Yona, Roee Aharoni, Mor Geva

Factual questions typically can be answered correctly at different levels of granularity. For example, both ``August 4, 1961'' and ``1961'' are correct answers to the question ``When was Barack Obama born?''. Standard question answering (QA) evaluation protocols, however, do not explicitly take this into account and compare a predicted answer against answers of a single granularity level. In this work, we propose GRANOLA QA, a novel evaluation setting where a predicted answer is evaluated in terms of accuracy and informativeness against a set of multi-granularity answers. We present a simple methodology for enriching existing datasets with multi-granularity answers, and create GRANOLA-EQ, a multi-granularity version of the EntityQuestions dataset. We evaluate a range of decoding methods on GRANOLA-EQ, including a new algorithm, called Decoding with Response Aggregation (DRAG), that is geared towards aligning the response granularity with the model's uncertainty. Our experiments show that large language models with standard decoding tend to generate specific answers, which are often incorrect. In contrast, when evaluated on multi-granularity answers, DRAG yields a nearly 20 point increase in accuracy on average, which further increases for rare entities. Overall, this reveals that standard evaluation and decoding schemes may significantly underestimate the knowledge encapsulated in LMs.

8/2/2024

Multi-Granularity Guided Fusion-in-Decoder

Eunseong Choi, Hyeri Lee, Jongwuk Lee

In Open-domain Question Answering (ODQA), it is essential to discern relevant contexts as evidence and avoid spurious ones among retrieved results. The model architecture that uses concatenated multiple contexts in the decoding phase, i.e., Fusion-in-Decoder, demonstrates promising performance but generates incorrect outputs from seemingly plausible contexts. To address this problem, we propose the Multi-Granularity guided Fusion-in-Decoder (MGFiD), discerning evidence across multiple levels of granularity. Based on multi-task learning, MGFiD harmonizes passage re-ranking with sentence classification. It aggregates evident sentences into an anchor vector that instructs the decoder. Additionally, it improves decoding efficiency by reusing the results of passage re-ranking for passage pruning. Through our experiments, MGFiD outperforms existing models on the Natural Questions (NQ) and TriviaQA (TQA) datasets, highlighting the benefits of its multi-granularity solution.

4/4/2024

🛸

Retrieval Augmented Generation for Domain-specific Question Answering

Sanat Sharma, David Seunghyun Yoon, Franck Dernoncourt, Dewang Sultania, Karishma Bagga, Mengjiao Zhang, Trung Bui, Varun Kotte

Question answering (QA) has become an important application in the advanced development of large language models. General pre-trained large language models for question-answering are not trained to properly understand the knowledge or terminology for a specific domain, such as finance, healthcare, education, and customer service for a product. To better cater to domain-specific understanding, we build an in-house question-answering system for Adobe products. We propose a novel framework to compile a large question-answer database and develop the approach for retrieval-aware finetuning of a Large Language model. We showcase that fine-tuning the retriever leads to major improvements in the final generation. Our overall approach reduces hallucinations during generation while keeping in context the latest retrieval information for contextual grounding.

5/30/2024

Accurate and Nuanced Open-QA Evaluation Through Textual Entailment

Peiran Yao, Denilson Barbosa

Open-domain question answering (Open-QA) is a common task for evaluating large language models (LLMs). However, current Open-QA evaluations are criticized for the ambiguity in questions and the lack of semantic understanding in evaluators. Complex evaluators, powered by foundation models or LLMs and pertaining to semantic equivalence, still deviate from human judgments by a large margin. We propose to study the entailment relations of answers to identify more informative and more general system answers, offering a much closer evaluation to human judgment on both NaturalQuestions and TriviaQA while being learning-free. The entailment-based evaluation we propose allows the assignment of bonus or partial marks by quantifying the inference gap between answers, enabling a nuanced ranking of answer correctness that has higher AUC than current methods.

5/28/2024