Measuring text summarization factuality using atomic facts entailment metrics in the context of retrieval augmented generation

Read original: arXiv:2408.15171 - Published 8/28/2024 by N. E. Kriman

Measuring text summarization factuality using atomic facts entailment metrics in the context of retrieval augmented generation

Overview

This paper proposes a new method for measuring the factuality of text summarization using atomic facts entailment metrics.
The authors test their approach in the context of retrieval-augmented generation, where a model generates summaries by combining information from a retrieval system and its own generation.
The key idea is to evaluate the factual accuracy of the generated summaries by checking if the atomic facts they contain are entailed by the original text.

Plain English Explanation

The paper discusses a way to evaluate how factual the summaries produced by text summarization systems are. Factuality refers to whether the information in the summary is accurate and true, rather than made up or incorrect.

The authors' approach is to break down the summary and the original text into atomic facts - the basic individual pieces of information. They then check whether each atomic fact in the summary is entailed by, or logically follows from, the original text.

This allows them to measure how factually accurate the summary is, by seeing what proportion of the summary's atomic facts are actually supported by the underlying text. They test this in the specific context of retrieval-augmented generation, where the summary is produced by combining information from a retrieval system and the model's own generation.

The key advantage of this approach is that it provides a more granular and nuanced way to evaluate factuality, compared to simply checking if the summary as a whole is factually correct or not.

Technical Explanation

The paper proposes a new method for evaluating the factuality of text summarization systems, called Atomic Facts Entailment (AFE). The core idea is to break down the summary and the original text into atomic facts, and then check whether each atomic fact in the summary is entailed by the original text.

To do this, the authors first define an atomic fact as a single, discrete piece of information, such as a subject-predicate-object triple. They then use a fact extraction model to identify the atomic facts present in both the summary and the original text.

Next, they employ a textual entailment model to determine whether each atomic fact in the summary is entailed by the original text. This allows them to compute an "atomic facts entailment score" that reflects the proportion of summary facts that are supported by the source.

The authors test this approach in the context of retrieval-augmented generation, where a model generates summaries by combining information from a retrieval system and its own generation. They find that the AFE metric is able to better capture the factual accuracy of these summaries compared to existing holistic evaluation metrics.

The key advantage of the AFE approach is that it provides a more granular and nuanced evaluation of factuality. Rather than simply judging the summary as a whole, it allows the identification of specific factual errors or hallucinations within the summary.

Critical Analysis

The authors acknowledge several limitations of their proposed AFE metric. First, the performance of the metric depends on the accuracy of the underlying fact extraction and textual entailment models, which may not be perfect.

Additionally, the AFE metric may not capture all aspects of factuality, such as the importance or salience of different facts. It's possible for a summary to be mostly factual but miss key details that are crucial for understanding the original text.

The authors also note that their experiments are focused on a specific retrieval-augmented generation setting. Further research would be needed to understand how well the AFE metric generalizes to other summarization approaches and domains.

Despite these caveats, the AFE metric represents an important step forward in the evaluation of text summarization factuality. By providing a more granular and interpretable way to assess factual accuracy, it could help drive progress in developing more reliable and trustworthy summarization systems.

Conclusion

This paper introduces a new method for evaluating the factual accuracy of text summarization systems, called Atomic Facts Entailment (AFE). The key idea is to break down the summary and original text into atomic facts, and then check whether each summary fact is entailed by the source material.

The authors demonstrate the effectiveness of this approach in the context of retrieval-augmented generation, where it provides a more nuanced assessment of factuality compared to existing holistic metrics. While the method has some limitations, it represents an important advance in the evaluation of summarization systems and could help drive the development of more reliable and trustworthy summarization technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Measuring text summarization factuality using atomic facts entailment metrics in the context of retrieval augmented generation

N. E. Kriman

The use of large language models (LLMs) has significantly increased since the introduction of ChatGPT in 2022, demonstrating their value across various applications. However, a major challenge for enterprise and commercial adoption of LLMs is their tendency to generate inaccurate information, a phenomenon known as hallucination. This project proposes a method for estimating the factuality of a summary generated by LLMs when compared to a source text. Our approach utilizes Naive Bayes classification to assess the accuracy of the content produced.

8/28/2024

FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction

Alessandro Scir`e, Karim Ghonim, Roberto Navigli

Recent advancements in text summarization, particularly with the advent of Large Language Models (LLMs), have shown remarkable performance. However, a notable challenge persists as a substantial number of automatically-generated summaries exhibit factual inconsistencies, such as hallucinations. In response to this issue, various approaches for the evaluation of consistency for summarization have emerged. Yet, these newly-introduced metrics face several limitations, including lack of interpretability, focus on short document summaries (e.g., news articles), and computational impracticality, especially for LLM-based metrics. To address these shortcomings, we propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE), a more interpretable and efficient factuality-oriented metric. FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary. Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation. Moreover, we extend our evaluation to a more challenging setting by conducting a human annotation process of long-form summarization. In the hope of fostering research in summarization factuality evaluation, we release the code of our metric and our factuality annotations of long-form summarization at https://github.com/Babelscape/FENICE.

9/4/2024

LongDocFACTScore: Evaluating the Factuality of Long Document Abstractive Summarisation

Jennifer A Bishop, Qianqian Xie, Sophia Ananiadou

Maintaining factual consistency is a critical issue in abstractive text summarisation, however, it cannot be assessed by traditional automatic metrics used for evaluating text summarisation, such as ROUGE scoring. Recent efforts have been devoted to developing improved metrics for measuring factual consistency using pre-trained language models, but these metrics have restrictive token limits, and are therefore not suitable for evaluating long document text summarisation. Moreover, there is limited research and resources available for evaluating whether existing automatic evaluation metrics are fit for purpose when applied in long document settings. In this work, we evaluate the efficacy of automatic metrics for assessing the factual consistency of long document text summarisation. We create a human-annotated data set for evaluating automatic factuality metrics, LongSciVerify, which contains fine-grained factual consistency annotations for long document summaries from the scientific domain. We also propose a new evaluation framework, LongDocFACTScore, which is suitable for evaluating long document summarisation. This framework allows metrics to be efficiently extended to any length document and outperforms existing state-of-the-art metrics in its ability to correlate with human measures of factuality when used to evaluate long document summarisation data sets. We make our code and LongSciVerify data set publicly available: https://github.com/jbshp/LongDocFACTScore.

5/29/2024

🧠

Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall

Jiaqing Yuan, Lin Pan, Chung-Wei Hang, Jiang Guo, Jiarong Jiang, Bonan Min, Patrick Ng, Zhiguo Wang

Large language models (LLMs) have shown remarkable performance on a variety of NLP tasks, and are being rapidly adopted in a wide range of use cases. It is therefore of vital importance to holistically evaluate the factuality of their generated outputs, as hallucinations remain a challenging issue. In this work, we focus on assessing LLMs' ability to recall factual knowledge learned from pretraining, and the factors that affect this ability. To that end, we construct FACT-BENCH, a representative benchmark covering 20 domains, 134 property types, 3 answer types, and different knowledge popularity levels. We benchmark 31 models from 10 model families and provide a holistic assessment of their strengths and weaknesses. We observe that instruction-tuning hurts knowledge recall, as pretraining-only models consistently outperform their instruction-tuned counterparts, and positive effects of model scaling, as larger models outperform smaller ones for all model families. However, the best performance from GPT-4 still represents a large gap with the upper-bound. We additionally study the role of in-context exemplars using counterfactual demonstrations, which lead to significant degradation of factual knowledge recall for large models. By further decoupling model known and unknown knowledge, we find the degradation is attributed to exemplars that contradict a model's known knowledge, as well as the number of such exemplars. Lastly, we fine-tune LLaMA-7B in different settings of known and unknown knowledge. In particular, fine-tuning on a model's known knowledge is beneficial, and consistently outperforms fine-tuning on unknown and mixed knowledge. We will make our benchmark publicly available.

4/26/2024