Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers

2311.09000

Published 4/17/2024 by Yuxia Wang, Revanth Gangi Reddy, Zain Muhammad Mujahid, Arnav Arora, Aleksandr Rubashevskii, Jiahui Geng, Osama Mohammed Afzal, Liangming Pan, Nadav Borenstein, Aditya Pillai and 3 others

cs.CL

↗️

Abstract

The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs. In this work, we present a holistic end-to-end solution for annotating the factuality of LLM-generated responses, which encompasses a multi-stage annotation scheme designed to yield detailed labels concerning the verifiability and factual inconsistencies found in LLM outputs. We further construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document, aiming to facilitate the evaluation of automatic fact-checking systems. Preliminary experiments show that FacTool, FactScore and Perplexity.ai are struggling to identify false claims, with the best F1=0.63 by this annotation solution based on GPT-4. Annotation tool, benchmark and code are available at https://github.com/yuxiaw/Factcheck-GPT.

Create account to get full access

Overview

This paper presents a holistic end-to-end solution for annotating the factual accuracy of responses generated by large language models (LLMs).
The authors developed a multi-stage annotation scheme to identify verifiability and factual inconsistencies in LLM outputs.
They also constructed an open-domain, document-level factuality benchmark to facilitate the evaluation of automatic fact-checking systems.
Preliminary experiments show that existing tools like FacTool, FactScore, and Perplexity.ai struggle to accurately identify false claims, with the best F1 score of 0.63 achieved by the authors' annotation solution based on GPT-4.

Plain English Explanation

As large language models (LLMs) are increasingly used in real-world applications, it's important to have ways to verify the accuracy of the information they generate. This paper presents a comprehensive solution to address this need.

The researchers developed a multi-step process to carefully examine the factual accuracy of LLM outputs. They created a detailed set of labels to identify whether the generated content is verifiable and if there are any factual inconsistencies.

To further support the development of automatic fact-checking systems, the researchers also built an open-domain benchmark dataset that includes factual information at the claim, sentence, and document levels. This benchmark can be used to evaluate how well these systems perform at identifying false or inaccurate statements.

The researchers found that existing tools like FacTool, FactScore, and Perplexity.ai struggled to accurately detect false claims, with the best performance achieving an F1 score of 0.63 using the researchers' own annotation solution based on GPT-4.

Technical Explanation

The paper presents a comprehensive solution for annotating the factuality of LLM-generated responses. The authors developed a multi-stage annotation scheme that assigns detailed labels to identify the verifiability and factual inconsistencies in the output.

The annotation process involves several steps:

Claim identification: Extracting individual claims from the LLM output.
Claim verifiability assessment: Determining whether each claim is verifiable or not.
Factual consistency checking: Identifying any factual inconsistencies within the claims.
Document-level factuality scoring: Assigning an overall factuality score to the entire output.

To facilitate the evaluation of automatic fact-checking systems, the researchers also constructed an open-domain, document-level factuality benchmark. This benchmark includes factual information at three levels of granularity: claim, sentence, and document.

The researchers conducted preliminary experiments to test the performance of existing tools, such as FacTool, FactScore, and Perplexity.ai, in identifying false claims. The results showed that these tools struggled, with the best F1 score of 0.63 achieved by the authors' annotation solution based on GPT-4.

Critical Analysis

The researchers' approach to annotating the factuality of LLM outputs is comprehensive and promising. By developing a multi-stage annotation scheme and an open-domain benchmark, they have provided valuable resources for the research community to further explore and improve automatic fact-checking systems.

However, the paper does not address the scalability of their approach, particularly when dealing with large volumes of LLM outputs. The manual annotation process may be time-consuming and labor-intensive, which could limit its practical application in real-world scenarios.

Additionally, the researchers mention that their benchmark dataset is limited to English-language content. Expanding the dataset to include multilingual content would greatly enhance its utility and ensure the factuality verification solutions are applicable to a wider range of LLM applications.

Furthermore, the paper does not delve into the potential biases and limitations of the GPT-4 model used in their annotation process. Exploring the impact of model biases on the factuality assessment could provide valuable insights for developing more robust and trustworthy fact-checking systems.

Conclusion

This paper presents a significant step forward in addressing the need for verifying the factual accuracy of LLM-generated outputs. By developing a comprehensive annotation scheme and an open-domain benchmark, the researchers have laid the foundation for the advancement of automatic fact-checking systems.

The findings from the preliminary experiments highlight the challenges faced by existing tools in accurately identifying false claims, underscoring the importance of continued research and development in this area. As LLMs become more prevalent in various applications, the ability to reliably assess the factuality of their outputs will be crucial for ensuring their safe and trustworthy deployment.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🏅

OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs

Yuxia Wang, Minghan Wang, Hasan Iqbal, Georgi Georgiev, Jiahui Geng, Preslav Nakov

The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs. Difficulties lie in assessing the factuality of free-form responses in open domains. Also, different papers use disparate evaluation benchmarks and measurements, which renders them hard to compare and hampers future progress. To mitigate these issues, we propose OpenFactCheck, a unified factuality evaluation framework for LLMs. OpenFactCheck consists of three modules: (i) CUSTCHECKER allows users to easily customize an automatic fact-checker and verify the factual correctness of documents and claims, (ii) LLMEVAL, a unified evaluation framework assesses LLM's factuality ability from various perspectives fairly, and (iii) CHECKEREVAL is an extensible solution for gauging the reliability of automatic fact-checkers' verification results using human-annotated datasets. OpenFactCheck is publicly released at https://github.com/yuxiaw/OpenFactCheck.

5/10/2024

cs.CL

MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents

Liyan Tang, Philippe Laban, Greg Durrett

Recognizing if LLM output can be grounded in evidence is central to many tasks in NLP: retrieval-augmented generation, summarization, document-grounded dialogue, and more. Current approaches to this kind of fact-checking are based on verifying each piece of a model generation against potential evidence using an LLM. However, this process can be very computationally expensive, requiring many calls to LLMs to check a single response. In this work, we show how to build small models that have GPT-4-level performance but for 400x lower cost. We do this by constructing synthetic training data with GPT-4, which involves creating realistic yet challenging instances of factual errors via a structured generation procedure. Training on this data teaches models to check each fact in the claim and recognize synthesis of information across sentences. For evaluation, we unify pre-existing datasets into a benchmark LLM-AggreFact, collected from recent work on fact-checking and grounding LLM generations. Our best system MiniCheck-FT5 (770M parameters) outperforms all systems of comparable size and reaches GPT-4 accuracy. We release LLM-AggreFact, code for data synthesis, and models.

4/17/2024

cs.CL cs.AI

🧠

Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall

Jiaqing Yuan, Lin Pan, Chung-Wei Hang, Jiang Guo, Jiarong Jiang, Bonan Min, Patrick Ng, Zhiguo Wang

Large language models (LLMs) have shown remarkable performance on a variety of NLP tasks, and are being rapidly adopted in a wide range of use cases. It is therefore of vital importance to holistically evaluate the factuality of their generated outputs, as hallucinations remain a challenging issue. In this work, we focus on assessing LLMs' ability to recall factual knowledge learned from pretraining, and the factors that affect this ability. To that end, we construct FACT-BENCH, a representative benchmark covering 20 domains, 134 property types, 3 answer types, and different knowledge popularity levels. We benchmark 31 models from 10 model families and provide a holistic assessment of their strengths and weaknesses. We observe that instruction-tuning hurts knowledge recall, as pretraining-only models consistently outperform their instruction-tuned counterparts, and positive effects of model scaling, as larger models outperform smaller ones for all model families. However, the best performance from GPT-4 still represents a large gap with the upper-bound. We additionally study the role of in-context exemplars using counterfactual demonstrations, which lead to significant degradation of factual knowledge recall for large models. By further decoupling model known and unknown knowledge, we find the degradation is attributed to exemplars that contradict a model's known knowledge, as well as the number of such exemplars. Lastly, we fine-tune LLaMA-7B in different settings of known and unknown knowledge. In particular, fine-tuning on a model's known knowledge is beneficial, and consistently outperforms fine-tuning on unknown and mixed knowledge. We will make our benchmark publicly available.

4/26/2024

cs.CL cs.AI cs.LG

💬

Surprising Efficacy of Fine-Tuned Transformers for Fact-Checking over Larger Language Models

Vinay Setty

In this paper, we explore the challenges associated with establishing an end-to-end fact-checking pipeline in a real-world context, covering over 90 languages. Our real-world experimental benchmarks demonstrate that fine-tuning Transformer models specifically for fact-checking tasks, such as claim detection and veracity prediction, provide superior performance over large language models (LLMs) like GPT-4, GPT-3.5-Turbo, and Mistral-7b. However, we illustrate that LLMs excel in generative tasks such as question decomposition for evidence retrieval. Through extensive evaluation, we show the efficacy of fine-tuned models for fact-checking in a multilingual setting and complex claims that include numerical quantities.

5/1/2024

cs.CL cs.AI