OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs

2405.05583

Published 5/10/2024 by Yuxia Wang, Minghan Wang, Hasan Iqbal, Georgi Georgiev, Jiahui Geng, Preslav Nakov

🏅

Abstract

The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs. Difficulties lie in assessing the factuality of free-form responses in open domains. Also, different papers use disparate evaluation benchmarks and measurements, which renders them hard to compare and hampers future progress. To mitigate these issues, we propose OpenFactCheck, a unified factuality evaluation framework for LLMs. OpenFactCheck consists of three modules: (i) CUSTCHECKER allows users to easily customize an automatic fact-checker and verify the factual correctness of documents and claims, (ii) LLMEVAL, a unified evaluation framework assesses LLM's factuality ability from various perspectives fairly, and (iii) CHECKEREVAL is an extensible solution for gauging the reliability of automatic fact-checkers' verification results using human-annotated datasets. OpenFactCheck is publicly released at https://github.com/yuxiaw/OpenFactCheck.

Create account to get full access

Overview

Researchers propose a framework called OpenFactCheck to evaluate the factual accuracy of large language models (LLMs) in various applications.
OpenFactCheck consists of three main components: CUSTCHECKER, LLMEVAL, and CHECKEREVAL.
This framework aims to address the challenges in assessing the factual accuracy of LLM outputs in open-domain applications and the lack of a unified evaluation approach.

Plain English Explanation

As large language models (LLMs) become increasingly used in various real-world applications, it is crucial to verify the accuracy of the information they provide. However, this can be challenging, as LLMs can generate free-form responses on a wide range of topics, making it difficult to assess the factual correctness of their outputs.

Additionally, different research papers use different evaluation methods and metrics, which makes it hard to compare the factual accuracy of LLMs across studies. To address these issues, the researchers have developed a framework called OpenFactCheck.

OpenFactCheck has three main components:

CUSTCHECKER: This allows users to easily create and customize their own automated fact-checking tools to verify the factual correctness of documents and claims.
LLMEVAL: This is a unified framework that assesses the factuality of LLMs from various perspectives, making it easier to compare their performance across different studies.
CHECKEREVAL: This is an extensible solution for evaluating the reliability of automated fact-checkers, using human-annotated datasets as a reference.

By providing these tools, the researchers aim to help researchers and developers better understand the factual accuracy of LLMs and improve their reliability in real-world applications, such as verifying the truthfulness of information.

Technical Explanation

The researchers propose OpenFactCheck, a unified framework for evaluating the factual accuracy of large language models (LLMs) across a variety of applications. The framework consists of three main components:

CUSTCHECKER: This module allows users to easily customize an automatic fact-checking system to verify the factual correctness of documents and claims. It provides a flexible and extensible platform for developing and evaluating fact-checking tools.
LLMEVAL: This unified evaluation framework assesses the factuality of LLMs from multiple perspectives, including coherence, consistency, and grounding in external knowledge. This approach enables fair and comprehensive comparisons of LLM factuality across different studies and applications.
CHECKEREVAL: This component provides an extensible solution for gauging the reliability of automated fact-checkers. It uses human-annotated datasets as a ground truth to evaluate the performance of fact-checking systems, helping to ensure the trustworthiness of their verification results.

By integrating these three modules, OpenFactCheck aims to address the challenges in assessing the factual accuracy of LLM outputs in open-domain applications and the lack of a unified evaluation approach. The framework is publicly available at https://github.com/yuxiaw/OpenFactCheck, allowing researchers and developers to leverage its capabilities for their own work.

Critical Analysis

The researchers have identified an important challenge in the widespread use of large language models (LLMs) – the need to verify the factual accuracy of their outputs. OpenFactCheck provides a comprehensive framework to address this issue, offering tools for customizing automated fact-checkers, evaluating LLM factuality, and assessing the reliability of fact-checking systems.

One potential limitation of the framework is the reliance on human-annotated datasets for evaluating fact-checkers. While this approach helps ensure the trustworthiness of the verification results, the availability and quality of such datasets may vary, which could impact the effectiveness of the CHECKEREVAL component.

Additionally, the researchers do not address the potential biases or inconsistencies that may exist in the fact-checking tools themselves. Further research could explore methods for detecting and mitigating these issues to enhance the overall reliability of the factuality assessment process.

Despite these minor concerns, OpenFactCheck represents an important step forward in addressing the challenges of verifying the factual accuracy of LLM outputs. By providing a unified framework for evaluation and fact-checking, the researchers are laying the groundwork for more robust and trustworthy deployment of LLMs in real-world applications, such as supporting the verification of information truthfulness.

Conclusion

The increased use of large language models (LLMs) across various domains has highlighted the need for reliable mechanisms to assess the factual accuracy of their outputs. The researchers have developed OpenFactCheck, a comprehensive framework that addresses this challenge by providing tools for customizing automated fact-checkers, evaluating LLM factuality, and gauging the reliability of fact-checking systems.

By integrating these three key components – CUSTCHECKER, LLMEVAL, and CHECKEREVAL – OpenFactCheck offers a unified and extensible solution for verifying the factual correctness of LLM outputs in open-domain applications. This framework has the potential to enhance the reliability and trustworthiness of LLMs, ultimately supporting their effective deployment in real-world scenarios, such as helping humans verify the truthfulness of information.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

↗️

Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers

Yuxia Wang, Revanth Gangi Reddy, Zain Muhammad Mujahid, Arnav Arora, Aleksandr Rubashevskii, Jiahui Geng, Osama Mohammed Afzal, Liangming Pan, Nadav Borenstein, Aditya Pillai, Isabelle Augenstein, Iryna Gurevych, Preslav Nakov

The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs. In this work, we present a holistic end-to-end solution for annotating the factuality of LLM-generated responses, which encompasses a multi-stage annotation scheme designed to yield detailed labels concerning the verifiability and factual inconsistencies found in LLM outputs. We further construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document, aiming to facilitate the evaluation of automatic fact-checking systems. Preliminary experiments show that FacTool, FactScore and Perplexity.ai are struggling to identify false claims, with the best F1=0.63 by this annotation solution based on GPT-4. Annotation tool, benchmark and code are available at https://github.com/yuxiaw/Factcheck-GPT.

4/17/2024

cs.CL

MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents

Liyan Tang, Philippe Laban, Greg Durrett

Recognizing if LLM output can be grounded in evidence is central to many tasks in NLP: retrieval-augmented generation, summarization, document-grounded dialogue, and more. Current approaches to this kind of fact-checking are based on verifying each piece of a model generation against potential evidence using an LLM. However, this process can be very computationally expensive, requiring many calls to LLMs to check a single response. In this work, we show how to build small models that have GPT-4-level performance but for 400x lower cost. We do this by constructing synthetic training data with GPT-4, which involves creating realistic yet challenging instances of factual errors via a structured generation procedure. Training on this data teaches models to check each fact in the claim and recognize synthesis of information across sentences. For evaluation, we unify pre-existing datasets into a benchmark LLM-AggreFact, collected from recent work on fact-checking and grounding LLM generations. Our best system MiniCheck-FT5 (770M parameters) outperforms all systems of comparable size and reaches GPT-4 accuracy. We release LLM-AggreFact, code for data synthesis, and models.

4/17/2024

cs.CL cs.AI

Multimodal Large Language Models to Support Real-World Fact-Checking

Jiahui Geng, Yova Kementchedjhieva, Preslav Nakov, Iryna Gurevych

Multimodal large language models (MLLMs) carry the potential to support humans in processing vast amounts of information. While MLLMs are already being used as a fact-checking tool, their abilities and limitations in this regard are understudied. Here is aim to bridge this gap. In particular, we propose a framework for systematically assessing the capacity of current multimodal models to facilitate real-world fact-checking. Our methodology is evidence-free, leveraging only these models' intrinsic knowledge and reasoning capabilities. By designing prompts that extract models' predictions, explanations, and confidence levels, we delve into research questions concerning model accuracy, robustness, and reasons for failure. We empirically find that (1) GPT-4V exhibits superior performance in identifying malicious and misleading multimodal claims, with the ability to explain the unreasonable aspects and underlying motives, and (2) existing open-source models exhibit strong biases and are highly sensitive to the prompt. Our study offers insights into combating false multimodal information and building secure, trustworthy multimodal models. To the best of our knowledge, we are the first to evaluate MLLMs for real-world fact-checking.

4/29/2024

cs.CL cs.AI

FactFinders at CheckThat! 2024: Refining Check-worthy Statement Detection with LLMs through Data Pruning

Yufeng Li, Rrubaa Panchendrarajan, Arkaitz Zubiaga

The rapid dissemination of information through social media and the Internet has posed a significant challenge for fact-checking, among others in identifying check-worthy claims that fact-checkers should pay attention to, i.e. filtering claims needing fact-checking from a large pool of sentences. This challenge has stressed the need to focus on determining the priority of claims, specifically which claims are worth to be fact-checked. Despite advancements in this area in recent years, the application of large language models (LLMs), such as GPT, has only recently drawn attention in studies. However, many open-source LLMs remain underexplored. Therefore, this study investigates the application of eight prominent open-source LLMs with fine-tuning and prompt engineering to identify check-worthy statements from political transcriptions. Further, we propose a two-step data pruning approach to automatically identify high-quality training data instances for effective learning. The efficiency of our approach is demonstrated through evaluations on the English language dataset as part of the check-worthiness estimation task of CheckThat! 2024. Further, the experiments conducted with data pruning demonstrate that competitive performance can be achieved with only about 44% of the training data. Our team ranked first in the check-worthiness estimation task in the English language.

6/27/2024

cs.CL