RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models

Read original: arXiv:2401.00396 - Published 5/20/2024 by Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, Kashun Shum, Randy Zhong, Juntong Song, Tong Zhang

RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models

Overview

• This paper introduces a new dataset called RAGTruth, which is designed to help develop more trustworthy retrieval-augmented language models (RALMs) that can avoid hallucinating information.

• RALMs are a type of AI system that combines large language models (LLMs) with information retrieval, aiming to enhance the reliability and trustworthiness of model outputs. However, these models can still produce hallucinated or fabricated information.

• The RAGTruth dataset provides a benchmark for evaluating and improving the trustworthiness of RALMs, with the goal of developing more reliable models for applications like question answering and task-oriented dialog.

Plain English Explanation

The paper introduces a new dataset called RAGTruth that is designed to help improve the trustworthiness of a type of AI system called a retrieval-augmented language model (RALM). RALMs combine large language models (which are trained on a lot of text data) with information retrieval, with the goal of making the outputs of these models more reliable and trustworthy.

However, even with this approach, RALMs can still sometimes produce information that is made up or fabricated, which is called "hallucination." The RAGTruth dataset is meant to provide a way to test and improve RALMs so that they are less likely to hallucinate information and instead provide outputs that are grounded in real, factual information.

The hope is that by using the RAGTruth dataset, researchers and developers can create more trustworthy RALMs that can be used for things like answering questions or having conversations, where it's important that the information provided is accurate and reliable.

Technical Explanation

The paper presents a new dataset called RAGTruth that is designed to help develop more trustworthy retrieval-augmented language models (RALMs). RALMs combine large language models (LLMs) with information retrieval, with the goal of enhancing the reliability and trustworthiness of model outputs.

However, even with this retrieval-augmentation approach, RALMs can still hallucinate or fabricate information that is not grounded in the retrieved evidence. The RAGTruth dataset aims to provide a benchmark for evaluating and improving the trustworthiness of RALMs, with a focus on task-oriented dialog and question answering applications.

The dataset consists of thousands of question-answer pairs, with associated passages that either support or contradict the answer. This allows for the evaluation of how well RALMs can distinguish between factual and hallucinated information. The authors also provide baseline results using state-of-the-art RALM models, demonstrating the challenges of this task and the need for further research and development in this area.

Critical Analysis

The RAGTruth dataset and the associated research presented in this paper represent an important step forward in the development of more trustworthy and reliable retrieval-augmented language models. By providing a benchmark for evaluating the ability of RALMs to distinguish between factual and hallucinated information, the authors have laid the groundwork for improving the trustworthiness of these models.

However, the authors acknowledge several limitations and avenues for further research. For example, the dataset is focused on task-oriented dialog and question answering, and it remains to be seen how well the insights from this work will translate to other RALM applications, such as text generation or open-domain question answering.

Additionally, the authors note that the current state-of-the-art RALM models struggle to achieve high performance on the RAGTruth benchmark, suggesting that more advanced techniques may be needed to address the hallucination problem. Further research into areas like conformal prediction and other uncertainty-aware modeling approaches could help address these challenges.

Overall, the RAGTruth dataset and the insights from this paper represent an important contribution to the field of trustworthy AI, and will hopefully spur further advancements in the development of reliable and trustworthy retrieval-augmented language models.

Conclusion

The paper introduces the RAGTruth dataset, which is designed to help develop more trustworthy retrieval-augmented language models (RALMs). RALMs combine large language models with information retrieval to enhance the reliability and trustworthiness of their outputs, but they can still produce hallucinated or fabricated information.

The RAGTruth dataset provides a benchmark for evaluating and improving the ability of RALMs to distinguish between factual and hallucinated information, with a focus on task-oriented dialog and question answering applications. By using this dataset, researchers and developers can work towards creating more trustworthy and reliable AI systems that can be deployed in real-world settings where accuracy and trustworthiness are critical.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models

Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, Kashun Shum, Randy Zhong, Juntong Song, Tong Zhang

Retrieval-augmented generation (RAG) has become a main technique for alleviating hallucinations in large language models (LLMs). Despite the integration of RAG, LLMs may still present unsupported or contradictory claims to the retrieved contents. In order to develop effective hallucination prevention strategies under RAG, it is important to create benchmark datasets that can measure the extent of hallucination. This paper presents RAGTruth, a corpus tailored for analyzing word-level hallucinations in various domains and tasks within the standard RAG frameworks for LLM applications. RAGTruth comprises nearly 18,000 naturally generated responses from diverse LLMs using RAG. These responses have undergone meticulous manual annotations at both the individual cases and word levels, incorporating evaluations of hallucination intensity. We not only benchmark hallucination frequencies across different LLMs, but also critically assess the effectiveness of several existing hallucination detection methodologies. Furthermore, we show that using a high-quality dataset such as RAGTruth, it is possible to finetune a relatively small LLM and achieve a competitive level of performance in hallucination detection when compared to the existing prompt-based approaches using state-of-the-art large language models such as GPT-4.

5/20/2024

🐍

RAGged Edges: The Double-Edged Sword of Retrieval-Augmented Chatbots

Philip Feldman, James R. Foulds, Shimei Pan

Large language models (LLMs) like ChatGPT demonstrate the remarkable progress of artificial intelligence. However, their tendency to hallucinate -- generate plausible but false information -- poses a significant challenge. This issue is critical, as seen in recent court cases where ChatGPT's use led to citations of non-existent legal rulings. This paper explores how Retrieval-Augmented Generation (RAG) can counter hallucinations by integrating external knowledge with prompts. We empirically evaluate RAG against standard LLMs using prompts designed to induce hallucinations. Our results show that RAG increases accuracy in some cases, but can still be misled when prompts directly contradict the model's pre-trained understanding. These findings highlight the complex nature of hallucinations and the need for more robust solutions to ensure LLM reliability in real-world applications. We offer practical recommendations for RAG deployment and discuss implications for the development of more trustworthy LLMs.

6/13/2024

Reducing hallucination in structured outputs via Retrieval-Augmented Generation

Patrice B'echard, Orlando Marquez Ayala

A common and fundamental limitation of Generative AI (GenAI) is its propensity to hallucinate. While large language models (LLM) have taken the world by storm, without eliminating or at least reducing hallucinations, real-world GenAI systems may face challenges in user adoption. In the process of deploying an enterprise application that produces workflows based on natural language requirements, we devised a system leveraging Retrieval Augmented Generation (RAG) to greatly improve the quality of the structured output that represents such workflows. Thanks to our implementation of RAG, our proposed system significantly reduces hallucinations in the output and improves the generalization of our LLM in out-of-domain settings. In addition, we show that using a small, well-trained retriever encoder can reduce the size of the accompanying LLM, thereby making deployments of LLM-based systems less resource-intensive.

4/15/2024

🛸

LRP4RAG: Detecting Hallucinations in Retrieval-Augmented Generation via Layer-wise Relevance Propagation

Haichuan Hu, Yuhan Sun, Quanjun Zhang

Retrieval-Augmented Generation (RAG) has become a primary technique for mitigating hallucinations in large language models (LLMs). However, incomplete knowledge extraction and insufficient understanding can still mislead LLMs to produce irrelevant or even contradictory responses, which means hallucinations persist in RAG. In this paper, we propose LRP4RAG, a method based on the Layer-wise Relevance Propagation (LRP) algorithm for detecting hallucinations in RAG. Specifically, we first utilize LRP to compute the relevance between the input and output of the RAG generator. We then apply further extraction and resampling to the relevance matrix. The processed relevance data are input into multiple classifiers to determine whether the output contains hallucinations. To the best of our knowledge, this is the first time that LRP has been used for detecting RAG hallucinations, and extensive experiments demonstrate that LRP4RAG outperforms existing baselines.

8/30/2024