FLAME: Factuality-Aware Alignment for Large Language Models

2405.01525

Published 5/3/2024 by Sheng-Chieh Lin, Luyu Gao, Barlas Oguz, Wenhan Xiong, Jimmy Lin, Wen-tau Yih, Xilun Chen

💬

Abstract

Alignment is a standard procedure to fine-tune pre-trained large language models (LLMs) to follow natural language instructions and serve as helpful AI assistants. We have observed, however, that the conventional alignment process fails to enhance the factual accuracy of LLMs, and often leads to the generation of more false facts (i.e. hallucination). In this paper, we study how to make the LLM alignment process more factual, by first identifying factors that lead to hallucination in both alignment steps: supervised fine-tuning (SFT) and reinforcement learning (RL). In particular, we find that training the LLM on new knowledge or unfamiliar texts can encourage hallucination. This makes SFT less factual as it trains on human labeled data that may be novel to the LLM. Furthermore, reward functions used in standard RL can also encourage hallucination, because it guides the LLM to provide more helpful responses on a diverse set of instructions, often preferring longer and more detailed responses. Based on these observations, we propose factuality-aware alignment, comprised of factuality-aware SFT and factuality-aware RL through direct preference optimization. Experiments show that our proposed factuality-aware alignment guides LLMs to output more factual responses while maintaining instruction-following capability.

Create account to get full access

Overview

Large language models (LLMs) are often fine-tuned through a process called "alignment" to make them better at following natural language instructions and serving as helpful AI assistants.
However, the conventional alignment process has been observed to fail in enhancing the factual accuracy of LLMs, and can even lead to the generation of more false facts (i.e., hallucination).
This paper examines how to make the LLM alignment process more factual by identifying the factors that lead to hallucination in both supervised fine-tuning (SFT) and reinforcement learning (RL) alignment steps.

Plain English Explanation

The paper explores ways to improve the process of training large language models (LLMs) to be better at following instructions and helping people, while also being more accurate in the information they provide. The current approach, called "alignment," can sometimes make the models generate more false information, which is a problem.

The researchers looked at what causes this issue, finding that training the models on new or unfamiliar knowledge or texts can encourage them to make up information (hallucination). This makes the supervised fine-tuning step less accurate, as the models are learning from data that may be novel to them.

Additionally, the standard reward functions used in reinforcement learning can also lead to more hallucination, as they incentivize the models to provide longer and more detailed responses, even if the information is not entirely factual.

Based on these insights, the researchers propose a "factuality-aware" approach to alignment, which involves modifications to both the supervised fine-tuning and reinforcement learning stages. Experiments show that this new method helps the LLMs provide more factual responses while still maintaining their ability to follow instructions well.

Technical Explanation

The paper begins by observing that the conventional LLM alignment process, which involves supervised fine-tuning (SFT) and reinforcement learning (RL), fails to enhance the factual accuracy of the models and can even lead to increased hallucination (generation of false facts).

To address this issue, the authors first identify the factors that contribute to hallucination in both the SFT and RL alignment steps. They find that training the LLM on new or unfamiliar knowledge or texts can encourage hallucination, making the SFT process less factual. Additionally, the standard reward functions used in RL can also incentivize the model to provide longer and more detailed responses, even if the information is not entirely accurate.

Based on these insights, the researchers propose a "factuality-aware" approach to alignment, which includes factuality-aware SFT and factuality-aware RL through direct preference optimization. Experiments show that this new method helps LLMs generate more factual responses while maintaining their instruction-following capabilities.

Critical Analysis

The paper provides a thoughtful analysis of the factors that contribute to hallucination in the LLM alignment process and proposes a novel approach to address this issue. The researchers' insights about the role of unfamiliar knowledge and standard reward functions in encouraging hallucination are particularly valuable.

However, the paper could have delved deeper into the potential limitations or caveats of their proposed factuality-aware alignment approach. For example, it would be interesting to understand how this method performs on a wider range of tasks and datasets, and whether there are any trade-offs in terms of instruction-following or other capabilities.

Additionally, the paper could have discussed the broader implications of this research, such as its potential impact on the development of more trustworthy and reliable AI assistants, and the challenges involved in balancing factual accuracy with other desirable qualities like helpfulness and engagement.

Conclusion

This paper presents a critical examination of the conventional LLM alignment process and proposes a new factuality-aware approach to address the issue of hallucination. By identifying the factors that contribute to the generation of false facts, the researchers have developed a method that can help LLMs provide more accurate information while maintaining their ability to follow natural language instructions.

The findings of this study have important implications for the development of AI assistants that are not only helpful, but also trustworthy and reliable. As the use of LLMs continues to grow in various applications, ensuring their factual accuracy will be crucial for building public confidence and ensuring the responsible deployment of these powerful technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

⚙️

Self-Alignment for Factuality: Mitigating Hallucinations in LLMs via Self-Evaluation

Xiaoying Zhang, Baolin Peng, Ye Tian, Jingyan Zhou, Lifeng Jin, Linfeng Song, Haitao Mi, Helen Meng

Despite showing increasingly human-like abilities, large language models (LLMs) often struggle with factual inaccuracies, i.e. hallucinations, even when they hold relevant knowledge. To address these hallucinations, current approaches typically necessitate high-quality human factuality annotations. In this work, we explore Self-Alignment for Factuality, where we leverage the self-evaluation capability of an LLM to provide training signals that steer the model towards factuality. Specifically, we incorporate Self-Eval, a self-evaluation component, to prompt an LLM to validate the factuality of its own generated responses solely based on its internal knowledge. Additionally, we design Self-Knowledge Tuning (SK-Tuning) to augment the LLM's self-evaluation ability by improving the model's confidence estimation and calibration. We then utilize these self-annotated responses to fine-tune the model via Direct Preference Optimization algorithm. We show that the proposed self-alignment approach substantially enhances factual accuracy over Llama family models across three key knowledge-intensive tasks on TruthfulQA and BioGEN.

6/12/2024

cs.CL cs.AI

Beyond Under-Alignment: Atomic Preference Enhanced Factuality Tuning for Large Language Models

Hongbang Yuan, Yubo Chen, Pengfei Cao, Zhuoran Jin, Kang Liu, Jun Zhao

Large language models (LLMs) have achieved remarkable success but still tend to generate factually erroneous responses, a phenomenon known as hallucination. A recent trend is to use preference learning to fine-tune models to align with factuality. However, existing work primarily evaluates fine-tuned models on in-domain (ID) datasets and the factuality on out-of-domain (OOD) datasets remains underexplored. In this paper, we conduct a comprehensive evaluation of the factuality of different models tuned by various preference learning algorithms and demonstrate that their performance on OOD datasets either increases minimally or decreases. Subsequently, we reveal that the main cause of model's failure to uphold factuality under a distribution shift is textbf{under-alignment}, rather than textbf{over-alignment}, by analyzing the token distribution shift of the models before and after tuning. Finally, we propose textbf{APEFT} (textbf{A}tomic textbf{P}reference textbf{E}nhanced textbf{F}actuality textbf{T}uning), a framework that enhances model's awareness of factuality at the granularity of individual facts. Extensive experiments demonstrate that APEFT improves model performance by an average of $boldsymbol{3.45%}$ on both ID and OOD datasets, which is highly effective.

6/28/2024

cs.CL cs.AI

Mitigating Large Language Model Hallucination with Faithful Finetuning

Minda Hu, Bowei He, Yufei Wang, Liangyou Li, Chen Ma, Irwin King

Large language models (LLMs) have demonstrated remarkable performance on various natural language processing tasks. However, they are prone to generating fluent yet untruthful responses, known as hallucinations. Hallucinations can lead to the spread of misinformation and cause harm in critical applications. Mitigating hallucinations is challenging as they arise from factors such as noisy data, model overconfidence, lack of knowledge, and the generation process itself. Recent efforts have attempted to address this issue through representation editing and decoding algorithms, reducing hallucinations without major structural changes or retraining. However, these approaches either implicitly edit LLMs' behavior in latent space or suppress the tendency to output unfaithful results during decoding instead of explicitly modeling on hallucination. In this work, we introduce Faithful Finetuning (F2), a novel method that explicitly models the process of faithful question answering through carefully designed loss functions during fine-tuning. We conduct extensive experiments on popular datasets and demonstrate that F2 achieves significant improvements over vanilla models and baselines.

6/18/2024

cs.CL

🏋️

Beyond Imitation: Leveraging Fine-grained Quality Signals for Alignment

Geyang Guo, Ranchi Zhao, Tianyi Tang, Wayne Xin Zhao, Ji-Rong Wen

Alignment with human preference is a desired property of large language models (LLMs). Currently, the main alignment approach is based on reinforcement learning from human feedback (RLHF). Despite the effectiveness of RLHF, it is intricate to implement and train, thus recent studies explore how to develop alternative alignment approaches based on supervised fine-tuning (SFT). A major limitation of SFT is that it essentially does imitation learning, which cannot fully understand what are the expected behaviors. To address this issue, we propose an improved alignment approach named FIGA. Different from prior methods, we incorporate fine-grained (i.e., token or phrase level) quality signals that are derived by contrasting good and bad responses. Our approach has made two major contributions. Firstly, we curate a refined alignment dataset that pairs initial responses and the corresponding revised ones. Secondly, we devise a new loss function can leverage fine-grained quality signals to instruct the learning of LLMs for alignment. Extensive experiments have demonstrated the effectiveness of our approaches by comparing a number of competitive baselines.

4/16/2024

cs.CL