Self-Alignment for Factuality: Mitigating Hallucinations in LLMs via Self-Evaluation

2402.09267

Published 6/12/2024 by Xiaoying Zhang, Baolin Peng, Ye Tian, Jingyan Zhou, Lifeng Jin, Linfeng Song, Haitao Mi, Helen Meng

cs.CL cs.AI

⚙️

Abstract

Despite showing increasingly human-like abilities, large language models (LLMs) often struggle with factual inaccuracies, i.e. hallucinations, even when they hold relevant knowledge. To address these hallucinations, current approaches typically necessitate high-quality human factuality annotations. In this work, we explore Self-Alignment for Factuality, where we leverage the self-evaluation capability of an LLM to provide training signals that steer the model towards factuality. Specifically, we incorporate Self-Eval, a self-evaluation component, to prompt an LLM to validate the factuality of its own generated responses solely based on its internal knowledge. Additionally, we design Self-Knowledge Tuning (SK-Tuning) to augment the LLM's self-evaluation ability by improving the model's confidence estimation and calibration. We then utilize these self-annotated responses to fine-tune the model via Direct Preference Optimization algorithm. We show that the proposed self-alignment approach substantially enhances factual accuracy over Llama family models across three key knowledge-intensive tasks on TruthfulQA and BioGEN.

Create account to get full access

Overview

Large language models (LLMs) often struggle with factual inaccuracies, known as "hallucinations," even when they possess relevant knowledge.
Current approaches to address hallucinations typically require high-quality human factuality annotations, which can be time-consuming and expensive.
This paper explores a novel approach called "Self-Alignment for Factuality," which leverages the self-evaluation capabilities of an LLM to provide training signals that steer the model towards factuality.

Plain English Explanation

Large language models, which are advanced AI systems that can generate human-like text, often make factual mistakes or produce information that is not entirely accurate, even when they have the relevant knowledge. To fix this problem, researchers usually rely on high-quality annotations from human experts, which can be a lot of work.

In this paper, the researchers explore a different approach called "Self-Alignment for Factuality." Instead of using human annotations, they find a way for the language model to evaluate its own responses and provide feedback to itself. This allows the model to learn to be more accurate and factual without needing as much human intervention.

Specifically, the researchers incorporate a "Self-Eval" component that prompts the language model to validate the factuality of its own generated responses based on its internal knowledge. They also design a "Self-Knowledge Tuning" (SK-Tuning) technique to improve the model's ability to estimate and calibrate its own confidence in its responses.

The researchers then use these self-annotated responses to fine-tune the language model, using a technique called "Direct Preference Optimization." This helps the model become more focused on generating factual and accurate information.

The researchers show that this self-alignment approach can significantly improve the factual accuracy of language models on several knowledge-intensive tasks, compared to previous methods.

Technical Explanation

To address the problem of factual inaccuracies, or "hallucinations," in large language models (LLMs), the researchers propose a novel approach called "Self-Alignment for Factuality." This approach leverages the self-evaluation capabilities of the LLM to provide training signals that steer the model towards factuality, without relying on high-quality human annotations.

Specifically, the researchers incorporate a "Self-Eval" component that prompts the LLM to validate the factuality of its own generated responses solely based on its internal knowledge. This allows the model to self-assess the accuracy of its outputs. Additionally, the researchers design a "Self-Knowledge Tuning" (SK-Tuning) technique to improve the model's confidence estimation and calibration, further enhancing its self-evaluation abilities.

The researchers then utilize these self-annotated responses to fine-tune the LLM using a "Direct Preference Optimization" algorithm. This approach aligns the model's preferences towards generating more factual and accurate outputs.

The researchers evaluate their approach on three knowledge-intensive tasks using the TruthfulQA and BioGEN datasets. They demonstrate that the proposed self-alignment approach substantially enhances the factual accuracy of LLaMA family models, outperforming previous methods like FLAME and TruthX.

Critical Analysis

The researchers provide a compelling approach to address the problem of hallucinations in large language models. By leveraging the model's self-evaluation capabilities, they offer a novel solution that reduces the reliance on human annotations, which can be time-consuming and expensive.

However, the researchers acknowledge that their approach may not be a complete solution to the hallucination problem. The self-evaluation and self-knowledge tuning techniques could be further improved, and there may be additional factors that contribute to hallucinations that are not addressed in this work.

Additionally, the researchers note that their approach is evaluated on specific knowledge-intensive tasks, and its effectiveness may vary across different types of language models and application domains. Further research is needed to explore the generalizability of the self-alignment approach.

It would also be valuable to investigate the long-term stability of the factual accuracy improvements achieved through the self-alignment method, as language models can still exhibit drift over time or in different contexts.

Overall, the researchers present a promising direction for addressing the hallucination problem in large language models, but additional research and refinement may be necessary to fully realize the potential of this approach.

Conclusion

The paper introduces a novel approach called "Self-Alignment for Factuality" to address the problem of factual inaccuracies, or hallucinations, in large language models. By leveraging the model's self-evaluation capabilities, the researchers are able to provide training signals that steer the model towards more factual and accurate outputs, without relying on high-quality human annotations.

The key innovations of this approach include the incorporation of a "Self-Eval" component and the design of "Self-Knowledge Tuning" (SK-Tuning) techniques, which together enhance the model's ability to assess the factuality of its own responses. The researchers then utilize these self-annotated responses to fine-tune the model using a "Direct Preference Optimization" algorithm.

The results demonstrate that the proposed self-alignment approach can substantially improve the factual accuracy of LLaMA family models across several knowledge-intensive tasks, outperforming previous methods like FLAME and TruthX.

While the researchers acknowledge that their approach may not be a complete solution to the hallucination problem, it represents a promising step forward in developing more reliable and factual large language models. Further research and refinement of the self-alignment techniques could lead to even more significant advancements in this critical area of AI development.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

FLAME: Factuality-Aware Alignment for Large Language Models

Sheng-Chieh Lin, Luyu Gao, Barlas Oguz, Wenhan Xiong, Jimmy Lin, Wen-tau Yih, Xilun Chen

Alignment is a standard procedure to fine-tune pre-trained large language models (LLMs) to follow natural language instructions and serve as helpful AI assistants. We have observed, however, that the conventional alignment process fails to enhance the factual accuracy of LLMs, and often leads to the generation of more false facts (i.e. hallucination). In this paper, we study how to make the LLM alignment process more factual, by first identifying factors that lead to hallucination in both alignment steps: supervised fine-tuning (SFT) and reinforcement learning (RL). In particular, we find that training the LLM on new knowledge or unfamiliar texts can encourage hallucination. This makes SFT less factual as it trains on human labeled data that may be novel to the LLM. Furthermore, reward functions used in standard RL can also encourage hallucination, because it guides the LLM to provide more helpful responses on a diverse set of instructions, often preferring longer and more detailed responses. Based on these observations, we propose factuality-aware alignment, comprised of factuality-aware SFT and factuality-aware RL through direct preference optimization. Experiments show that our proposed factuality-aware alignment guides LLMs to output more factual responses while maintaining instruction-following capability.

5/3/2024

cs.CL cs.AI

🧠

Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall

Jiaqing Yuan, Lin Pan, Chung-Wei Hang, Jiang Guo, Jiarong Jiang, Bonan Min, Patrick Ng, Zhiguo Wang

Large language models (LLMs) have shown remarkable performance on a variety of NLP tasks, and are being rapidly adopted in a wide range of use cases. It is therefore of vital importance to holistically evaluate the factuality of their generated outputs, as hallucinations remain a challenging issue. In this work, we focus on assessing LLMs' ability to recall factual knowledge learned from pretraining, and the factors that affect this ability. To that end, we construct FACT-BENCH, a representative benchmark covering 20 domains, 134 property types, 3 answer types, and different knowledge popularity levels. We benchmark 31 models from 10 model families and provide a holistic assessment of their strengths and weaknesses. We observe that instruction-tuning hurts knowledge recall, as pretraining-only models consistently outperform their instruction-tuned counterparts, and positive effects of model scaling, as larger models outperform smaller ones for all model families. However, the best performance from GPT-4 still represents a large gap with the upper-bound. We additionally study the role of in-context exemplars using counterfactual demonstrations, which lead to significant degradation of factual knowledge recall for large models. By further decoupling model known and unknown knowledge, we find the degradation is attributed to exemplars that contradict a model's known knowledge, as well as the number of such exemplars. Lastly, we fine-tune LLaMA-7B in different settings of known and unknown knowledge. In particular, fine-tuning on a model's known knowledge is beneficial, and consistently outperforms fine-tuning on unknown and mixed knowledge. We will make our benchmark publicly available.

4/26/2024

cs.CL cs.AI cs.LG

💬

Mitigating Hallucinations in Large Language Models via Self-Refinement-Enhanced Knowledge Retrieval

Mengjia Niu, Hao Li, Jie Shi, Hamed Haddadi, Fan Mo

Large language models (LLMs) have demonstrated remarkable capabilities across various domains, although their susceptibility to hallucination poses significant challenges for their deployment in critical areas such as healthcare. To address this issue, retrieving relevant facts from knowledge graphs (KGs) is considered a promising method. Existing KG-augmented approaches tend to be resource-intensive, requiring multiple rounds of retrieval and verification for each factoid, which impedes their application in real-world scenarios. In this study, we propose Self-Refinement-Enhanced Knowledge Graph Retrieval (Re-KGR) to augment the factuality of LLMs' responses with less retrieval efforts in the medical field. Our approach leverages the attribution of next-token predictive probability distributions across different tokens, and various model layers to primarily identify tokens with a high potential for hallucination, reducing verification rounds by refining knowledge triples associated with these tokens. Moreover, we rectify inaccurate content using retrieved knowledge in the post-processing stage, which improves the truthfulness of generated responses. Experimental results on a medical dataset demonstrate that our approach can enhance the factual capability of LLMs across various foundational models as evidenced by the highest scores on truthfulness.

5/13/2024

cs.CL cs.LG

TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space

Shaolei Zhang, Tian Yu, Yang Feng

Large Language Models (LLMs) sometimes suffer from producing hallucinations, especially LLMs may generate untruthful responses despite knowing the correct knowledge. Activating the truthfulness within LLM is the key to fully unlocking LLM's knowledge potential. In this paper, we propose TruthX, an inference-time intervention method to activate the truthfulness of LLM by identifying and editing the features within LLM's internal representations that govern the truthfulness. TruthX employs an auto-encoder to map LLM's representations into semantic and truthful latent spaces respectively, and applies contrastive learning to identify a truthful editing direction within the truthful space. During inference, by editing LLM's internal representations in truthful space, TruthX effectively enhances the truthfulness of LLM. Experiments show that TruthX improves the truthfulness of 13 advanced LLMs by an average of 20% on TruthfulQA benchmark. Further analyses suggest that TruthX can control LLM to produce truthful or hallucinatory responses via editing only one vector in LLM's internal representations.

6/6/2024

cs.CL cs.AI cs.LG