FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction

Read original: arXiv:2403.02270 - Published 9/4/2024 by Alessandro Scir`e, Karim Ghonim, Roberto Navigli

FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction

Overview

The paper presents FENICE, a system for evaluating the factuality of text summarizations.
FENICE uses natural language inference and claim extraction to assess whether the summary accurately reflects the facts in the source text.
The system was evaluated on several summarization datasets, showing it can effectively identify factual inconsistencies.

Plain English Explanation

The paper introduces FENICE, a system designed to evaluate how accurately a text summary reflects the facts from the original source material. FENICE works by using natural language inference - the ability to logically reason about the relationship between two text snippets - and claim extraction, which identifies key factual statements in the summary.

By analyzing the logical connections between the summary and the source text, FENICE can detect if the summary is making claims that are not supported or are inconsistent with the original facts. This helps identify summarizations that may be biased, misleading, or simply inaccurate in their portrayal of the source material.

The researchers evaluated FENICE on several existing summarization datasets, demonstrating its effectiveness at identifying factual discrepancies. This is an important capability, as faithful summarization - where the key facts are preserved - is crucial for many real-world applications of text summarization technology.

Technical Explanation

The core of FENICE is its use of natural language inference (NLI) to assess the factual consistency between a text summary and its source document. NLI models can determine the logical relationship between two text snippets, such as whether one statement entails, contradicts, or is neutral with respect to another.

FENICE first extracts key claims from the summary using a claim extraction model. It then uses an NLI model to analyze the logical relationship between each summary claim and the corresponding evidence in the source text. Summaries that contain claims not entailed by the source, or that contradict the source, are flagged as factually inconsistent.

The researchers evaluated FENICE on several summarization datasets, including CNN/DailyMail, XSum, and WikiHow. They found that FENICE was able to effectively identify factual inconsistencies, outperforming previous approaches. The system's modular design also allows it to be easily adapted to new summarization tasks and datasets.

Critical Analysis

The paper provides a thorough evaluation of FENICE's performance, including comparisons to prior work. However, the authors acknowledge that the system has some limitations. For example, FENICE may struggle with nuanced logical relationships that are difficult for current NLI models to capture.

Additionally, the claim extraction component could be further improved, as missing or inaccurately extracted claims would impact FENICE's ability to comprehensively evaluate a summary. The authors suggest exploring ways to combine FENICE with other factuality assessment techniques, such as LongDocFactScore and FactCheck, to provide a more holistic assessment.

Overall, FENICE represents a valuable contribution to the field of summarization evaluation, highlighting the importance of factual consistency as a key quality metric. As the authors note, faithfully preserving the facts from source material is critical for many real-world applications of summarization technology, such as medical summarization and dialogue summarization. Further research in this area could lead to significant improvements in the reliability and trustworthiness of automated summarization systems.

Conclusion

The FENICE system introduced in this paper represents an important advance in the field of summarization evaluation. By leveraging natural language inference and claim extraction, FENICE can effectively assess whether a summary accurately reflects the factual content of the source material. This capability is crucial for many real-world applications of summarization technology, where preserving the integrity of information is paramount.

While FENICE has some limitations that could be addressed through future research, the paper demonstrates the value of focusing on factual consistency as a key quality metric for text summarization. As the use of summarization systems continues to grow, tools like FENICE will become increasingly important for ensuring the reliability and trustworthiness of the information they provide.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction

Alessandro Scir`e, Karim Ghonim, Roberto Navigli

Recent advancements in text summarization, particularly with the advent of Large Language Models (LLMs), have shown remarkable performance. However, a notable challenge persists as a substantial number of automatically-generated summaries exhibit factual inconsistencies, such as hallucinations. In response to this issue, various approaches for the evaluation of consistency for summarization have emerged. Yet, these newly-introduced metrics face several limitations, including lack of interpretability, focus on short document summaries (e.g., news articles), and computational impracticality, especially for LLM-based metrics. To address these shortcomings, we propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE), a more interpretable and efficient factuality-oriented metric. FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary. Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation. Moreover, we extend our evaluation to a more challenging setting by conducting a human annotation process of long-form summarization. In the hope of fostering research in summarization factuality evaluation, we release the code of our metric and our factuality annotations of long-form summarization at https://github.com/Babelscape/FENICE.

9/4/2024

Measuring text summarization factuality using atomic facts entailment metrics in the context of retrieval augmented generation

N. E. Kriman

The use of large language models (LLMs) has significantly increased since the introduction of ChatGPT in 2022, demonstrating their value across various applications. However, a major challenge for enterprise and commercial adoption of LLMs is their tendency to generate inaccurate information, a phenomenon known as hallucination. This project proposes a method for estimating the factuality of a summary generated by LLMs when compared to a source text. Our approach utilizes Naive Bayes classification to assess the accuracy of the content produced.

8/28/2024

LongDocFACTScore: Evaluating the Factuality of Long Document Abstractive Summarisation

Jennifer A Bishop, Qianqian Xie, Sophia Ananiadou

Maintaining factual consistency is a critical issue in abstractive text summarisation, however, it cannot be assessed by traditional automatic metrics used for evaluating text summarisation, such as ROUGE scoring. Recent efforts have been devoted to developing improved metrics for measuring factual consistency using pre-trained language models, but these metrics have restrictive token limits, and are therefore not suitable for evaluating long document text summarisation. Moreover, there is limited research and resources available for evaluating whether existing automatic evaluation metrics are fit for purpose when applied in long document settings. In this work, we evaluate the efficacy of automatic metrics for assessing the factual consistency of long document text summarisation. We create a human-annotated data set for evaluating automatic factuality metrics, LongSciVerify, which contains fine-grained factual consistency annotations for long document summaries from the scientific domain. We also propose a new evaluation framework, LongDocFACTScore, which is suitable for evaluating long document summarisation. This framework allows metrics to be efficiently extended to any length document and outperforms existing state-of-the-art metrics in its ability to correlate with human measures of factuality when used to evaluate long document summarisation data sets. We make our code and LongSciVerify data set publicly available: https://github.com/jbshp/LongDocFACTScore.

5/29/2024

💬

FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence

Sebastian Antony Joseph, Lily Chen, Jan Trienes, Hannah Louisa Goke, Monika Coers, Wei Xu, Byron C Wallace, Junyi Jessy Li

Plain language summarization with LLMs can be useful for improving textual accessibility of technical content. But how factual are these summaries in a high-stakes domain like medicine? This paper presents FactPICO, a factuality benchmark for plain language summarization of medical texts describing randomized controlled trials (RCTs), which are the basis of evidence-based medicine and can directly inform patient treatment. FactPICO consists of 345 plain language summaries of RCT abstracts generated from three LLMs (i.e., GPT-4, Llama-2, and Alpaca), with fine-grained evaluation and natural language rationales from experts. We assess the factuality of critical elements of RCTs in those summaries: Populations, Interventions, Comparators, Outcomes (PICO), as well as the reported findings concerning these. We also evaluate the correctness of the extra information (e.g., explanations) added by LLMs. Using FactPICO, we benchmark a range of existing factuality metrics, including the newly devised ones based on LLMs. We find that plain language summarization of medical evidence is still challenging, especially when balancing between simplicity and factuality, and that existing metrics correlate poorly with expert judgments on the instance level.

6/6/2024