MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents

2404.10774

Published 4/17/2024 by Liyan Tang, Philippe Laban, Greg Durrett

MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents

Abstract

Recognizing if LLM output can be grounded in evidence is central to many tasks in NLP: retrieval-augmented generation, summarization, document-grounded dialogue, and more. Current approaches to this kind of fact-checking are based on verifying each piece of a model generation against potential evidence using an LLM. However, this process can be very computationally expensive, requiring many calls to LLMs to check a single response. In this work, we show how to build small models that have GPT-4-level performance but for 400x lower cost. We do this by constructing synthetic training data with GPT-4, which involves creating realistic yet challenging instances of factual errors via a structured generation procedure. Training on this data teaches models to check each fact in the claim and recognize synthesis of information across sentences. For evaluation, we unify pre-existing datasets into a benchmark LLM-AggreFact, collected from recent work on fact-checking and grounding LLM generations. Our best system MiniCheck-FT5 (770M parameters) outperforms all systems of comparable size and reaches GPT-4 accuracy. We release LLM-AggreFact, code for data synthesis, and models.

Create account to get full access

Overview

The paper proposes a new method called MiniCheck for efficiently verifying the factual accuracy of claims made by large language models (LLMs) against grounding documents.
MiniCheck aims to improve on existing approaches by reducing the computational and time requirements for fact-checking, while maintaining high accuracy.
The paper presents experiments demonstrating the effectiveness of MiniCheck in verifying claims across a range of domains, including FactCheck Bench, a benchmark for evaluating claim verification systems.

Plain English Explanation

The paper focuses on the challenge of ensuring that the information produced by large language models (LLMs) is accurate and truthful. LLMs are powerful AI systems that can generate human-like text on a wide range of topics, but there is a risk that they may produce false or misleading information.

To address this, the researchers developed a new method called MiniCheck that can efficiently verify the accuracy of claims made by LLMs by comparing them to relevant source documents. The key idea behind MiniCheck is to use a more targeted and efficient approach to fact-checking, rather than simply searching through large amounts of text.

The paper demonstrates that MiniCheck can achieve high accuracy in verifying claims while using significantly less computational resources and time than previous methods. This makes it a promising tool for ensuring the trustworthiness of information generated by LLMs, which is an important concern as these models become more widely used.

Technical Explanation

The paper proposes a novel approach called MiniCheck for efficiently fact-checking the claims made by large language models (LLMs) against relevant grounding documents. The key innovation of MiniCheck is its ability to perform this verification process more quickly and with fewer computational resources than existing methods.

The MiniCheck approach works by first identifying the most relevant passage(s) from the grounding documents for a given claim, using a novel passage retrieval technique. It then generates a compact "mini-version" of the claim and the relevant passage(s), which are used to perform the fact-checking. This compact representation allows MiniCheck to avoid the computational overhead of processing the full text of the claim and documents.

The researchers evaluated MiniCheck on a range of benchmarks, including the FactCheck Bench dataset, and found that it achieved high accuracy while using significantly less time and computational resources than previous state-of-the-art fact-checking methods. This efficiency makes MiniCheck a promising approach for improving the grounding of LLMs and verifying the truthfulness of the information they generate.

Critical Analysis

The paper makes a convincing case for the potential of MiniCheck to improve the efficiency and accuracy of fact-checking for LLMs. However, there are a few areas that could benefit from further exploration or discussion:

Generalization: The paper focuses on evaluating MiniCheck on specific benchmarks, but it would be helpful to understand how well the approach generalizes to a wider range of domains and claim types. The researchers acknowledge this as a potential limitation and an area for future work.
Limitations of Compact Representations: While the compact representation used by MiniCheck is a key innovation, it may also introduce some potential limitations or tradeoffs in terms of the level of detail and nuance that can be captured. The paper could delve deeper into the implications of this aspect of the approach.
Complementary Approaches: The paper does not discuss how MiniCheck might be combined with other LLM enhancement or logical reasoning techniques to further improve the reliability and trustworthiness of LLM outputs. Exploring such synergies could be a fruitful area for future research.

Overall, the MiniCheck approach presented in this paper represents a promising step forward in addressing the critical challenge of ensuring the factual accuracy of information generated by large language models.

Conclusion

The paper introduces MiniCheck, a novel method for efficiently verifying the factual accuracy of claims made by large language models (LLMs) against relevant grounding documents. MiniCheck aims to improve on existing fact-checking approaches by using a more targeted and compact representation of claims and documents, which allows for faster and more resource-efficient verification.

The researchers demonstrate the effectiveness of MiniCheck through experiments on benchmark datasets, showing that it can maintain high accuracy while using significantly less computational time and resources than previous state-of-the-art methods. This efficiency makes MiniCheck a promising tool for enhancing the reliability and trustworthiness of the information generated by LLMs, which is a crucial concern as these powerful AI systems become more widely deployed.

Overall, the MiniCheck approach represents an important contribution to the ongoing efforts to ensure the factual accuracy and transparency of large language models, which will be essential for building public trust and realizing the full potential of these transformative technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

↗️

Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers

Yuxia Wang, Revanth Gangi Reddy, Zain Muhammad Mujahid, Arnav Arora, Aleksandr Rubashevskii, Jiahui Geng, Osama Mohammed Afzal, Liangming Pan, Nadav Borenstein, Aditya Pillai, Isabelle Augenstein, Iryna Gurevych, Preslav Nakov

The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs. In this work, we present a holistic end-to-end solution for annotating the factuality of LLM-generated responses, which encompasses a multi-stage annotation scheme designed to yield detailed labels concerning the verifiability and factual inconsistencies found in LLM outputs. We further construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document, aiming to facilitate the evaluation of automatic fact-checking systems. Preliminary experiments show that FacTool, FactScore and Perplexity.ai are struggling to identify false claims, with the best F1=0.63 by this annotation solution based on GPT-4. Annotation tool, benchmark and code are available at https://github.com/yuxiaw/Factcheck-GPT.

4/17/2024

cs.CL

Multimodal Large Language Models to Support Real-World Fact-Checking

Jiahui Geng, Yova Kementchedjhieva, Preslav Nakov, Iryna Gurevych

Multimodal large language models (MLLMs) carry the potential to support humans in processing vast amounts of information. While MLLMs are already being used as a fact-checking tool, their abilities and limitations in this regard are understudied. Here is aim to bridge this gap. In particular, we propose a framework for systematically assessing the capacity of current multimodal models to facilitate real-world fact-checking. Our methodology is evidence-free, leveraging only these models' intrinsic knowledge and reasoning capabilities. By designing prompts that extract models' predictions, explanations, and confidence levels, we delve into research questions concerning model accuracy, robustness, and reasons for failure. We empirically find that (1) GPT-4V exhibits superior performance in identifying malicious and misleading multimodal claims, with the ability to explain the unreasonable aspects and underlying motives, and (2) existing open-source models exhibit strong biases and are highly sensitive to the prompt. Our study offers insights into combating false multimodal information and building secure, trustworthy multimodal models. To the best of our knowledge, we are the first to evaluate MLLMs for real-world fact-checking.

4/29/2024

cs.CL cs.AI

FactFinders at CheckThat! 2024: Refining Check-worthy Statement Detection with LLMs through Data Pruning

Yufeng Li, Rrubaa Panchendrarajan, Arkaitz Zubiaga

The rapid dissemination of information through social media and the Internet has posed a significant challenge for fact-checking, among others in identifying check-worthy claims that fact-checkers should pay attention to, i.e. filtering claims needing fact-checking from a large pool of sentences. This challenge has stressed the need to focus on determining the priority of claims, specifically which claims are worth to be fact-checked. Despite advancements in this area in recent years, the application of large language models (LLMs), such as GPT, has only recently drawn attention in studies. However, many open-source LLMs remain underexplored. Therefore, this study investigates the application of eight prominent open-source LLMs with fine-tuning and prompt engineering to identify check-worthy statements from political transcriptions. Further, we propose a two-step data pruning approach to automatically identify high-quality training data instances for effective learning. The efficiency of our approach is demonstrated through evaluations on the English language dataset as part of the check-worthiness estimation task of CheckThat! 2024. Further, the experiments conducted with data pruning demonstrate that competitive performance can be achieved with only about 44% of the training data. Our team ranked first in the check-worthiness estimation task in the English language.

6/27/2024

cs.CL

🏅

OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs

Yuxia Wang, Minghan Wang, Hasan Iqbal, Georgi Georgiev, Jiahui Geng, Preslav Nakov

The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs. Difficulties lie in assessing the factuality of free-form responses in open domains. Also, different papers use disparate evaluation benchmarks and measurements, which renders them hard to compare and hampers future progress. To mitigate these issues, we propose OpenFactCheck, a unified factuality evaluation framework for LLMs. OpenFactCheck consists of three modules: (i) CUSTCHECKER allows users to easily customize an automatic fact-checker and verify the factual correctness of documents and claims, (ii) LLMEVAL, a unified evaluation framework assesses LLM's factuality ability from various perspectives fairly, and (iii) CHECKEREVAL is an extensible solution for gauging the reliability of automatic fact-checkers' verification results using human-annotated datasets. OpenFactCheck is publicly released at https://github.com/yuxiaw/OpenFactCheck.

5/10/2024

cs.CL