FactFinders at CheckThat! 2024: Refining Check-worthy Statement Detection with LLMs through Data Pruning

2406.18297

Published 6/27/2024 by Yufeng Li, Rrubaa Panchendrarajan, Arkaitz Zubiaga

FactFinders at CheckThat! 2024: Refining Check-worthy Statement Detection with LLMs through Data Pruning

Abstract

The rapid dissemination of information through social media and the Internet has posed a significant challenge for fact-checking, among others in identifying check-worthy claims that fact-checkers should pay attention to, i.e. filtering claims needing fact-checking from a large pool of sentences. This challenge has stressed the need to focus on determining the priority of claims, specifically which claims are worth to be fact-checked. Despite advancements in this area in recent years, the application of large language models (LLMs), such as GPT, has only recently drawn attention in studies. However, many open-source LLMs remain underexplored. Therefore, this study investigates the application of eight prominent open-source LLMs with fine-tuning and prompt engineering to identify check-worthy statements from political transcriptions. Further, we propose a two-step data pruning approach to automatically identify high-quality training data instances for effective learning. The efficiency of our approach is demonstrated through evaluations on the English language dataset as part of the check-worthiness estimation task of CheckThat! 2024. Further, the experiments conducted with data pruning demonstrate that competitive performance can be achieved with only about 44% of the training data. Our team ranked first in the check-worthiness estimation task in the English language.

Create account to get full access

Overview

This paper explores how to improve the detection of check-worthy statements using large language models (LLMs) and data pruning techniques.
The researchers developed a system called FactFinders that participated in the CheckThat! 2024 competition, which evaluates the ability to identify check-worthy claims in text.
The key contributions include using LLMs for check-worthiness detection and exploring data pruning methods to refine the training data.

Plain English Explanation

The researchers wanted to make it easier to find claims or statements in text that are worth fact-checking. They developed a system called FactFinders that uses large language models to analyze text and identify potentially check-worthy statements.

To train FactFinders, the team used a process called data pruning to refine the training data. This involved removing certain data points that weren't as useful for the task, in order to improve the model's performance. The researchers tested FactFinders in the CheckThat! 2024 competition, which evaluates how well systems can detect claims that should be fact-checked.

The key idea is to use powerful language models in combination with selective data cleaning to get better at identifying statements that deserve a closer look from fact-checkers. This could help make the fact-checking process more efficient and effective. The researchers explored different techniques to refine the training data, with the goal of improving the model's ability to spot important claims that should be verified.

Technical Explanation

The paper introduces FactFinders, a system developed by the researchers to participate in the CheckThat! 2024 competition. FactFinders uses large language models (LLMs) for the task of detecting check-worthy statements in text.

To train FactFinders, the researchers experimented with different data pruning techniques to refine the training data. This involved selectively removing certain data points that were less useful for the check-worthiness detection task. The goal was to improve the model's performance by focusing the training on more informative examples.

The researchers evaluated FactFinders on the FactCheck-Bench dataset, which provides a benchmark for assessing the ability to identify check-worthy claims. They compared the performance of FactFinders with and without the data pruning techniques to understand the impact of this approach.

Additionally, the paper explores the use of MiniCheck, a framework for efficiently fact-checking claims using LLMs, as a potential component within the FactFinders system.

Critical Analysis

The researchers acknowledge that their data pruning approach is a heuristic method, and they suggest that more sophisticated techniques could be explored in future work. The paper also notes that the performance of FactFinders is still limited by the underlying capabilities of the LLMs used, and further advancements in language model technology could lead to even better check-worthiness detection.

One potential limitation of the study is that it focuses solely on the CheckThat! 2024 dataset and competition, which may not fully capture the diversity of real-world scenarios for claim verification. Expanding the evaluation to other datasets or use cases could provide a more comprehensive understanding of the system's performance.

Additionally, the paper does not delve into the ethical considerations of deploying such a system in the real world. There may be concerns around the potential for misuse or unintended consequences, and the researchers could have discussed these aspects in more depth.

Overall, the work presented in this paper represents a valuable contribution to the field of automatic claim verification and fact-checking. The exploration of data pruning techniques to refine LLM-based systems is an interesting approach that could inspire further research in this area.

Conclusion

This paper introduces FactFinders, a system that leverages large language models and data pruning techniques to improve the detection of check-worthy statements in text. The key contributions include:

Using LLMs as the backbone for check-worthiness detection, building on the advancements in language modeling research.
Exploring data pruning methods to refine the training data and enhance the model's performance on the task.
Evaluating the system on the FactCheck-Bench dataset and exploring the integration of MiniCheck for efficient fact-checking.

The findings suggest that data pruning can be a valuable technique for improving check-worthiness detection with LLMs, and the work highlights the potential for advancing automated claim verification systems. As the need for reliable fact-checking grows, continued research in this direction could lead to more efficient and effective tools for identifying important claims that warrant further investigation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Claim Check-Worthiness Detection: How Well do LLMs Grasp Annotation Guidelines?

Laura Majer, Jan v{S}najder

The increasing threat of disinformation calls for automating parts of the fact-checking pipeline. Identifying text segments requiring fact-checking is known as claim detection (CD) and claim check-worthiness detection (CW), the latter incorporating complex domain-specific criteria of worthiness and often framed as a ranking task. Zero- and few-shot LLM prompting is an attractive option for both tasks, as it bypasses the need for labeled datasets and allows verbalized claim and worthiness criteria to be directly used for prompting. We evaluate the LLMs' predictive and calibration accuracy on five CD/CW datasets from diverse domains, each utilizing a different worthiness criterion. We investigate two key aspects: (1) how best to distill factuality and worthiness criteria into a prompt and (2) what amount of context to provide for each claim. To this end, we experiment with varying the level of prompt verbosity and the amount of contextual information provided to the model. Our results show that optimal prompt verbosity is domain-dependent, adding context does not improve performance, and confidence scores can be directly used to produce reliable check-worthiness rankings.

4/19/2024

cs.CL

Multimodal Large Language Models to Support Real-World Fact-Checking

Jiahui Geng, Yova Kementchedjhieva, Preslav Nakov, Iryna Gurevych

Multimodal large language models (MLLMs) carry the potential to support humans in processing vast amounts of information. While MLLMs are already being used as a fact-checking tool, their abilities and limitations in this regard are understudied. Here is aim to bridge this gap. In particular, we propose a framework for systematically assessing the capacity of current multimodal models to facilitate real-world fact-checking. Our methodology is evidence-free, leveraging only these models' intrinsic knowledge and reasoning capabilities. By designing prompts that extract models' predictions, explanations, and confidence levels, we delve into research questions concerning model accuracy, robustness, and reasons for failure. We empirically find that (1) GPT-4V exhibits superior performance in identifying malicious and misleading multimodal claims, with the ability to explain the unreasonable aspects and underlying motives, and (2) existing open-source models exhibit strong biases and are highly sensitive to the prompt. Our study offers insights into combating false multimodal information and building secure, trustworthy multimodal models. To the best of our knowledge, we are the first to evaluate MLLMs for real-world fact-checking.

4/29/2024

cs.CL cs.AI

MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents

Liyan Tang, Philippe Laban, Greg Durrett

Recognizing if LLM output can be grounded in evidence is central to many tasks in NLP: retrieval-augmented generation, summarization, document-grounded dialogue, and more. Current approaches to this kind of fact-checking are based on verifying each piece of a model generation against potential evidence using an LLM. However, this process can be very computationally expensive, requiring many calls to LLMs to check a single response. In this work, we show how to build small models that have GPT-4-level performance but for 400x lower cost. We do this by constructing synthetic training data with GPT-4, which involves creating realistic yet challenging instances of factual errors via a structured generation procedure. Training on this data teaches models to check each fact in the claim and recognize synthesis of information across sentences. For evaluation, we unify pre-existing datasets into a benchmark LLM-AggreFact, collected from recent work on fact-checking and grounding LLM generations. Our best system MiniCheck-FT5 (770M parameters) outperforms all systems of comparable size and reaches GPT-4 accuracy. We release LLM-AggreFact, code for data synthesis, and models.

4/17/2024

cs.CL cs.AI

↗️

Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers

Yuxia Wang, Revanth Gangi Reddy, Zain Muhammad Mujahid, Arnav Arora, Aleksandr Rubashevskii, Jiahui Geng, Osama Mohammed Afzal, Liangming Pan, Nadav Borenstein, Aditya Pillai, Isabelle Augenstein, Iryna Gurevych, Preslav Nakov

The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs. In this work, we present a holistic end-to-end solution for annotating the factuality of LLM-generated responses, which encompasses a multi-stage annotation scheme designed to yield detailed labels concerning the verifiability and factual inconsistencies found in LLM outputs. We further construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document, aiming to facilitate the evaluation of automatic fact-checking systems. Preliminary experiments show that FacTool, FactScore and Perplexity.ai are struggling to identify false claims, with the best F1=0.63 by this annotation solution based on GPT-4. Annotation tool, benchmark and code are available at https://github.com/yuxiaw/Factcheck-GPT.

4/17/2024

cs.CL