Mining the Explainability and Generalization: Fact Verification Based on Self-Instruction

2405.12579

Published 5/24/2024 by Guangyao Lu, Yulin Liu

🤯

Abstract

Fact-checking based on commercial LLMs has become mainstream. Although these methods offer high explainability, it falls short in accuracy compared to traditional fine-tuning approaches, and data security is also a significant concern. In this paper, we propose a self-instruction based fine-tuning approach for fact-checking that balances accuracy and explainability. Our method consists of Data Augmentation and Improved DPO fine-tuning. The former starts by instructing the model to generate both positive and negative explanations based on claim-evidence pairs and labels, then sampling the dataset according to our customized difficulty standards. The latter employs our proposed improved DPO to fine-tune the model using the generated samples. We fine-tune the smallest-scale LLaMA-7B model and evaluate it on the challenging fact-checking datasets FEVEROUS and HOVER, utilizing four fine-tuning methods and three few-shot learning methods for comparison. The experiments demonstrate that our approach not only retains accuracy comparable to, or even surpassing, traditional fine-tuning methods, but also generates fluent explanation text. Moreover, it also exhibit high generalization performance. Our method is the first to leverage self-supervised learning for fact-checking and innovatively combines contrastive learning and improved DPO in fine-tuning LLMs, as shown in the experiments.

Create account to get full access

Overview

The paper proposes a novel self-instruction-based fine-tuning approach for fact-checking using large language models (LLMs).
This approach aims to balance accuracy and explainability, addressing the limitations of existing commercial LLM-based fact-checking methods.
The method involves data augmentation and an improved DPO (Differentiable Parameterized Objective) fine-tuning technique.
The authors evaluate their approach on challenging fact-checking datasets and compare it to traditional fine-tuning and few-shot learning methods.

Plain English Explanation

The paper focuses on a problem called fact-checking, which is the process of verifying the truthfulness of claims or statements. The researchers noticed that while some existing fact-checking methods using large language models (LLMs) offer high explainability, they tend to be less accurate than traditional fine-tuning approaches. Additionally, data security is a significant concern with these methods.

To address these issues, the researchers propose a new approach that combines self-instruction and fine-tuning. The self-instruction part involves teaching the model to generate both positive and negative explanations for claim-evidence pairs and labels. Then, the researchers use these generated samples to fine-tune the model using an improved version of a technique called DPO (Differentiable Parameterized Objective).

The researchers evaluate their approach on two challenging fact-checking datasets and compare it to traditional fine-tuning and few-shot learning methods. They find that their approach not only maintains accuracy comparable to or even better than traditional methods, but also generates fluent explanation text. Additionally, the approach exhibits strong generalization performance.

This is the first time researchers have leveraged self-supervised learning for fact-checking and combined contrastive learning with improved DPO to fine-tune LLMs, as demonstrated in the experiments.

Technical Explanation

The paper proposes a self-instruction-based fine-tuning approach for fact-checking using large language models (LLMs). The method consists of two key components:

Data Augmentation: The researchers start by instructing the model to generate both positive and negative explanations based on claim-evidence pairs and labels. They then sample the dataset according to their customized difficulty standards, creating a more diverse and challenging training set.
Improved DPO Fine-tuning: The researchers employ an improved version of the Differentiable Parameterized Objective (DPO) fine-tuning technique to train the model using the generated samples.

The authors fine-tune the smallest-scale LLaMA-7B model and evaluate it on the FEVEROUS and HOVER fact-checking datasets. They compare their approach to four fine-tuning methods and three few-shot learning methods.

The experiments demonstrate that the proposed approach not only retains accuracy comparable to or even surpassing traditional fine-tuning methods, but also generates fluent explanation text. Moreover, the method exhibits high generalization performance.

Critical Analysis

The paper presents a novel and promising approach to fact-checking using LLMs. The researchers' focus on balancing accuracy and explainability is commendable, as explainability is a crucial aspect of fact-checking systems.

However, the paper does not provide much detail on the specific implementation of the data augmentation and improved DPO fine-tuning techniques. It would be helpful to have a more comprehensive description of these methods to understand their full scope and potential limitations.

Additionally, the paper could have explored the scalability and computational efficiency of the proposed approach, as efficiency is a key concern for fact-checking systems that need to operate in real-time.

Finally, the researchers could have discussed the potential ethical implications of their approach, such as its use in multimodal fact-checking systems or its impact on the broader landscape of automated fact-checking.

Conclusion

The paper presents a novel self-instruction-based fine-tuning approach for fact-checking using large language models. This method aims to balance accuracy and explainability, addressing the limitations of existing commercial LLM-based fact-checking techniques.

The key innovations of the proposed approach are the data augmentation and improved DPO fine-tuning techniques, which demonstrate strong performance on challenging fact-checking datasets. This work is a significant step forward in leveraging the power of LLMs for reliable and explainable fact-checking, with potential applications in various domains, such as public health and multimodal fact-checking.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Surprising Efficacy of Fine-Tuned Transformers for Fact-Checking over Larger Language Models

Vinay Setty

In this paper, we explore the challenges associated with establishing an end-to-end fact-checking pipeline in a real-world context, covering over 90 languages. Our real-world experimental benchmarks demonstrate that fine-tuning Transformer models specifically for fact-checking tasks, such as claim detection and veracity prediction, provide superior performance over large language models (LLMs) like GPT-4, GPT-3.5-Turbo, and Mistral-7b. However, we illustrate that LLMs excel in generative tasks such as question decomposition for evidence retrieval. Through extensive evaluation, we show the efficacy of fine-tuned models for fact-checking in a multilingual setting and complex claims that include numerical quantities.

5/1/2024

cs.CL cs.AI

Tell Me Why: Explainable Public Health Fact-Checking with Large Language Models

Majid Zarharan, Pascal Wullschleger, Babak Behkam Kia, Mohammad Taher Pilehvar, Jennifer Foster

This paper presents a comprehensive analysis of explainable fact-checking through a series of experiments, focusing on the ability of large language models to verify public health claims and provide explanations or justifications for their veracity assessments. We examine the effectiveness of zero/few-shot prompting and parameter-efficient fine-tuning across various open and closed-source models, examining their performance in both isolated and joint tasks of veracity prediction and explanation generation. Importantly, we employ a dual evaluation approach comprising previously established automatic metrics and a novel set of criteria through human evaluation. Our automatic evaluation indicates that, within the zero-shot scenario, GPT-4 emerges as the standout performer, but in few-shot and parameter-efficient fine-tuning contexts, open-source models demonstrate their capacity to not only bridge the performance gap but, in some instances, surpass GPT-4. Human evaluation reveals yet more nuance as well as indicating potential problems with the gold explanations.

5/16/2024

cs.CL

↗️

Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers

Yuxia Wang, Revanth Gangi Reddy, Zain Muhammad Mujahid, Arnav Arora, Aleksandr Rubashevskii, Jiahui Geng, Osama Mohammed Afzal, Liangming Pan, Nadav Borenstein, Aditya Pillai, Isabelle Augenstein, Iryna Gurevych, Preslav Nakov

The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs. In this work, we present a holistic end-to-end solution for annotating the factuality of LLM-generated responses, which encompasses a multi-stage annotation scheme designed to yield detailed labels concerning the verifiability and factual inconsistencies found in LLM outputs. We further construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document, aiming to facilitate the evaluation of automatic fact-checking systems. Preliminary experiments show that FacTool, FactScore and Perplexity.ai are struggling to identify false claims, with the best F1=0.63 by this annotation solution based on GPT-4. Annotation tool, benchmark and code are available at https://github.com/yuxiaw/Factcheck-GPT.

4/17/2024

cs.CL

Learning to Generate Answers with Citations via Factual Consistency Models

Rami Aly, Zhiqiang Tang, Samson Tan, George Karypis

Large Language Models (LLMs) frequently hallucinate, impeding their reliability in mission-critical situations. One approach to address this issue is to provide citations to relevant sources alongside generated content, enhancing the verifiability of generations. However, citing passages accurately in answers remains a substantial challenge. This paper proposes a weakly-supervised fine-tuning method leveraging factual consistency models (FCMs). Our approach alternates between generating texts with citations and supervised fine-tuning with FCM-filtered citation data. Focused learning is integrated into the objective, directing the fine-tuning process to emphasise the factual unit tokens, as measured by an FCM. Results on the ALCE few-shot citation benchmark with various instruction-tuned LLMs demonstrate superior performance compared to in-context learning, vanilla supervised fine-tuning, and state-of-the-art methods, with an average improvement of $34.1$, $15.5$, and $10.5$ citation F$_1$ points, respectively. Moreover, in a domain transfer setting we show that the obtained citation generation ability robustly transfers to unseen datasets. Notably, our citation improvements contribute to the lowest factual error rate across baselines.

6/21/2024

cs.CL