Surprising Efficacy of Fine-Tuned Transformers for Fact-Checking over Larger Language Models

2402.12147

Published 5/1/2024 by Vinay Setty

💬

Abstract

In this paper, we explore the challenges associated with establishing an end-to-end fact-checking pipeline in a real-world context, covering over 90 languages. Our real-world experimental benchmarks demonstrate that fine-tuning Transformer models specifically for fact-checking tasks, such as claim detection and veracity prediction, provide superior performance over large language models (LLMs) like GPT-4, GPT-3.5-Turbo, and Mistral-7b. However, we illustrate that LLMs excel in generative tasks such as question decomposition for evidence retrieval. Through extensive evaluation, we show the efficacy of fine-tuned models for fact-checking in a multilingual setting and complex claims that include numerical quantities.

Create account to get full access

Overview

The paper explores the challenges of building an end-to-end fact-checking pipeline that works across over 90 languages.
The researchers found that fine-tuning transformer models specifically for fact-checking tasks, like claim detection and veracity prediction, outperformed large language models (LLMs) like GPT-4, GPT-3.5-Turbo, and Mistral-7b.
However, the LLMs excelled at generative tasks like question decomposition for evidence retrieval.
The paper demonstrates the effectiveness of fine-tuned models for multilingual fact-checking, even for complex claims involving numerical quantities.

Plain English Explanation

The researchers in this paper looked at the challenges of creating a fact-checking system that could work across a wide range of languages, covering over 90 different languages. They found that by taking large language models (like GPT-4 or GPT-3.5-Turbo) and fine-tuning them specifically for fact-checking tasks, such as identifying claims and predicting their truthfulness, the models performed better than the original large language models.

However, the large language models were still better at tasks like breaking down questions to help find relevant evidence to check facts. So the researchers showed that a combination of fine-tuned models and large language models could work well together in a complete fact-checking system.

The paper also demonstrated that these fine-tuned models were effective at fact-checking in multiple languages, and could even handle more complex claims that involved numerical information.

Technical Explanation

The researchers explored the challenges of building an end-to-end fact-checking pipeline that could operate across a diverse set of over 90 languages. Their experiments showed that fine-tuning transformer models specifically for fact-checking tasks, such as claim detection and veracity prediction, outperformed using large language models (LLMs) like GPT-4, GPT-3.5-Turbo, and Mistral-7b directly.

At the same time, the researchers found that the LLMs were superior at generative tasks like question decomposition, which is important for retrieving relevant evidence to fact-check claims. This suggests that a combined approach leveraging both fine-tuned models and LLMs could be an effective way to build a comprehensive fact-checking system.

The paper's extensive evaluation demonstrated the efficacy of the fine-tuned models for multilingual fact-checking and even for handling more complex claims that included numerical quantities.

Critical Analysis

The paper provides a thorough analysis of the challenges involved in building a real-world, end-to-end fact-checking pipeline that can operate across a wide range of languages. While the findings around the superior performance of fine-tuned models for specific fact-checking tasks are compelling, the researchers acknowledge that there is still room for improvement.

One potential limitation is that the paper does not delve into the details of the fine-tuning process or the specific architectures used. This makes it difficult to fully assess the generalizability of the approach and understand how it might be replicated or adapted to other domains.

Additionally, the paper focuses primarily on textual fact-checking and does not explore the potential benefits of incorporating multimodal information, such as images or videos, which could be crucial for verifying certain types of claims.

Further research could also investigate the performance of the fine-tuned models on more complex, nuanced claims, as well as their robustness to adversarial attacks or attempts to game the system.

Conclusion

This paper presents a comprehensive exploration of the challenges involved in building a multilingual, end-to-end fact-checking pipeline. The key finding is that fine-tuning transformer models specifically for fact-checking tasks can outperform using large language models directly, while the LLMs still have advantages in certain generative subtasks.

The demonstrated effectiveness of the fine-tuned models for multilingual fact-checking and handling complex claims involving numerical information suggests that this approach could be a valuable contribution to the ongoing effort to combat the spread of misinformation and disinformation, particularly in a global context.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Multimodal Large Language Models to Support Real-World Fact-Checking

Jiahui Geng, Yova Kementchedjhieva, Preslav Nakov, Iryna Gurevych

Multimodal large language models (MLLMs) carry the potential to support humans in processing vast amounts of information. While MLLMs are already being used as a fact-checking tool, their abilities and limitations in this regard are understudied. Here is aim to bridge this gap. In particular, we propose a framework for systematically assessing the capacity of current multimodal models to facilitate real-world fact-checking. Our methodology is evidence-free, leveraging only these models' intrinsic knowledge and reasoning capabilities. By designing prompts that extract models' predictions, explanations, and confidence levels, we delve into research questions concerning model accuracy, robustness, and reasons for failure. We empirically find that (1) GPT-4V exhibits superior performance in identifying malicious and misleading multimodal claims, with the ability to explain the unreasonable aspects and underlying motives, and (2) existing open-source models exhibit strong biases and are highly sensitive to the prompt. Our study offers insights into combating false multimodal information and building secure, trustworthy multimodal models. To the best of our knowledge, we are the first to evaluate MLLMs for real-world fact-checking.

4/29/2024

cs.CL cs.AI

↗️

Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers

Yuxia Wang, Revanth Gangi Reddy, Zain Muhammad Mujahid, Arnav Arora, Aleksandr Rubashevskii, Jiahui Geng, Osama Mohammed Afzal, Liangming Pan, Nadav Borenstein, Aditya Pillai, Isabelle Augenstein, Iryna Gurevych, Preslav Nakov

The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs. In this work, we present a holistic end-to-end solution for annotating the factuality of LLM-generated responses, which encompasses a multi-stage annotation scheme designed to yield detailed labels concerning the verifiability and factual inconsistencies found in LLM outputs. We further construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document, aiming to facilitate the evaluation of automatic fact-checking systems. Preliminary experiments show that FacTool, FactScore and Perplexity.ai are struggling to identify false claims, with the best F1=0.63 by this annotation solution based on GPT-4. Annotation tool, benchmark and code are available at https://github.com/yuxiaw/Factcheck-GPT.

4/17/2024

cs.CL

Fine-Tuning or Fine-Failing? Debunking Performance Myths in Large Language Models

Scott Barnett, Zac Brannelly, Stefanus Kurniawan, Sheng Wong

Large Language Models (LLMs) have the unique capability to understand and generate human-like text from input queries. When fine-tuned, these models show enhanced performance on domain-specific queries. OpenAI highlights the process of fine-tuning, stating: To fine-tune a model, you are required to provide at least 10 examples. We typically see clear improvements from fine-tuning on 50 to 100 training examples, but the right number varies greatly based on the exact use case. This study extends this concept to the integration of LLMs within Retrieval-Augmented Generation (RAG) pipelines, which aim to improve accuracy and relevance by leveraging external corpus data for information retrieval. However, RAG's promise of delivering optimal responses often falls short in complex query scenarios. This study aims to specifically examine the effects of fine-tuning LLMs on their ability to extract and integrate contextual data to enhance the performance of RAG systems across multiple domains. We evaluate the impact of fine-tuning on the LLMs' capacity for data extraction and contextual understanding by comparing the accuracy and completeness of fine-tuned models against baseline performances across datasets from multiple domains. Our findings indicate that fine-tuning resulted in a decline in performance compared to the baseline models, contrary to the improvements observed in standalone LLM applications as suggested by OpenAI. This study highlights the need for vigorous investigation and validation of fine-tuned models for domain-specific tasks.

7/2/2024

cs.CL cs.AI

🤯

Mining the Explainability and Generalization: Fact Verification Based on Self-Instruction

Guangyao Lu, Yulin Liu

Fact-checking based on commercial LLMs has become mainstream. Although these methods offer high explainability, it falls short in accuracy compared to traditional fine-tuning approaches, and data security is also a significant concern. In this paper, we propose a self-instruction based fine-tuning approach for fact-checking that balances accuracy and explainability. Our method consists of Data Augmentation and Improved DPO fine-tuning. The former starts by instructing the model to generate both positive and negative explanations based on claim-evidence pairs and labels, then sampling the dataset according to our customized difficulty standards. The latter employs our proposed improved DPO to fine-tune the model using the generated samples. We fine-tune the smallest-scale LLaMA-7B model and evaluate it on the challenging fact-checking datasets FEVEROUS and HOVER, utilizing four fine-tuning methods and three few-shot learning methods for comparison. The experiments demonstrate that our approach not only retains accuracy comparable to, or even surpassing, traditional fine-tuning methods, but also generates fluent explanation text. Moreover, it also exhibit high generalization performance. Our method is the first to leverage self-supervised learning for fact-checking and innovatively combines contrastive learning and improved DPO in fine-tuning LLMs, as shown in the experiments.

5/24/2024

cs.CL