GraphEval: A Knowledge-Graph Based LLM Hallucination Evaluation Framework

Read original: arXiv:2407.10793 - Published 7/16/2024 by Hannah Sansford, Nicholas Richardson, Hermina Petric Maretic, Juba Nait Saada

GraphEval: A Knowledge-Graph Based LLM Hallucination Evaluation Framework

Overview

• The paper introduces GraphEval, a framework for evaluating hallucinations (i.e., factually incorrect information) produced by large language models (LLMs).

• GraphEval leverages knowledge graphs to assess the factual correctness of LLM outputs by comparing them to established knowledge.

• The framework aims to provide a more comprehensive and automated approach to hallucination detection compared to existing methods.

Plain English Explanation

• Large language models (LLMs) are powerful AI systems that can generate human-like text. However, they can sometimes produce factually incorrect information, known as hallucinations.

• Evaluating the truthfulness of LLM outputs is crucial, as these models are being used for important applications like question-answering and content generation.

• The authors of this paper have developed a new framework called GraphEval that uses knowledge graphs to automatically detect hallucinations in LLM outputs.

• Knowledge graphs are structured databases that store factual information about the world. By comparing the LLM outputs to the knowledge in these graphs, GraphEval can identify when the model has generated something that contradicts established facts.

• This approach is more comprehensive and scalable than manual evaluation or existing hallucination detection methods, which tend to be limited in scope or require significant human effort.

• By providing a robust and automated way to assess the factual correctness of LLM outputs, GraphEval can help ensure the reliability and trustworthiness of these powerful AI systems.

Technical Explanation

• The GraphEval framework leverages knowledge graphs to detect hallucinations in the outputs of large language models (LLMs).

• Knowledge graphs are structured databases that represent factual information about the world in the form of entities (e.g., people, places, concepts) and the relationships between them.

• GraphEval compares the factual claims made in the LLM output to the knowledge encoded in the knowledge graph. If the LLM output contradicts the information in the graph, it is flagged as a hallucination.

• The framework uses a combination of relation extraction, entity linking, and graph-based reasoning to perform this analysis in an automated and scalable way.

• GraphEval's hallucination detection performance is evaluated on a dataset of LLM outputs and compared to other hallucination detection methods, demonstrating its effectiveness.

• The authors also show how GraphEval can be used to mitigate hallucinations in LLMs by providing feedback to the model during training.

Critical Analysis

• The GraphEval framework represents a valuable contribution to the field of LLM reliability and trustworthiness, as it provides a more comprehensive and automated approach to hallucination detection compared to existing methods.

• However, the authors acknowledge that the framework's performance is still limited by the completeness and accuracy of the underlying knowledge graphs. Incomplete or biased knowledge graphs could lead to false positives or false negatives in hallucination detection.

• Additionally, the authors note that GraphEval may struggle with hallucinations that are not directly contradicted by the knowledge graph, such as those involving plausible but factually incorrect information.

• Further research is needed to explore ways to enhance knowledge verification and improve the detection of more nuanced forms of hallucination.

Conclusion

• The GraphEval framework represents an important step forward in the quest to ensure the reliability and trustworthiness of large language models.

• By leveraging knowledge graphs to automatically detect hallucinations, GraphEval provides a scalable and comprehensive approach to evaluating the factual correctness of LLM outputs.

• While the framework has some limitations, it demonstrates the potential of knowledge-based approaches to enhance the safety and reliability of these powerful AI systems, which are increasingly being deployed in real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

GraphEval: A Knowledge-Graph Based LLM Hallucination Evaluation Framework

Hannah Sansford, Nicholas Richardson, Hermina Petric Maretic, Juba Nait Saada

Methods to evaluate Large Language Model (LLM) responses and detect inconsistencies, also known as hallucinations, with respect to the provided knowledge, are becoming increasingly important for LLM applications. Current metrics fall short in their ability to provide explainable decisions, systematically check all pieces of information in the response, and are often too computationally expensive to be used in practice. We present GraphEval: a hallucination evaluation framework based on representing information in Knowledge Graph (KG) structures. Our method identifies the specific triples in the KG that are prone to hallucinations and hence provides more insight into where in the response a hallucination has occurred, if at all, than previous methods. Furthermore, using our approach in conjunction with state-of-the-art natural language inference (NLI) models leads to an improvement in balanced accuracy on various hallucination benchmarks, compared to using the raw NLI models. Lastly, we explore the use of GraphEval for hallucination correction by leveraging the structure of the KG, a method we name GraphCorrect, and demonstrate that the majority of hallucinations can indeed be rectified.

7/16/2024

🔍

Combining LLMs and Knowledge Graphs to Reduce Hallucinations in Question Answering

Larissa Pusch, Tim O. F. Conrad

Advancements in natural language processing have revolutionized the way we can interact with digital information systems, such as databases, making them more accessible. However, challenges persist, especially when accuracy is critical, as in the biomedical domain. A key issue is the hallucination problem, where models generate information unsupported by the underlying data, potentially leading to dangerous misinformation. This paper presents a novel approach designed to bridge this gap by combining Large Language Models (LLM) and Knowledge Graphs (KG) to improve the accuracy and reliability of question-answering systems, on the example of a biomedical KG. Built on the LangChain framework, our method incorporates a query checker that ensures the syntactical and semantic validity of LLM-generated queries, which are then used to extract information from a Knowledge Graph, substantially reducing errors like hallucinations. We evaluated the overall performance using a new benchmark dataset of 50 biomedical questions, testing several LLMs, including GPT-4 Turbo and llama3:70b. Our results indicate that while GPT-4 Turbo outperforms other models in generating accurate queries, open-source models like llama3:70b show promise with appropriate prompt engineering. To make this approach accessible, a user-friendly web-based interface has been developed, allowing users to input natural language queries, view generated and corrected Cypher queries, and verify the resulting paths for accuracy. Overall, this hybrid approach effectively addresses common issues such as data gaps and hallucinations, offering a reliable and intuitive solution for question answering systems. The source code for generating the results of this paper and for the user-interface can be found in our Git repository: https://git.zib.de/lpusch/cyphergenkg-gui

9/9/2024

💬

Mitigating Hallucinations in Large Language Models via Self-Refinement-Enhanced Knowledge Retrieval

Mengjia Niu, Hao Li, Jie Shi, Hamed Haddadi, Fan Mo

Large language models (LLMs) have demonstrated remarkable capabilities across various domains, although their susceptibility to hallucination poses significant challenges for their deployment in critical areas such as healthcare. To address this issue, retrieving relevant facts from knowledge graphs (KGs) is considered a promising method. Existing KG-augmented approaches tend to be resource-intensive, requiring multiple rounds of retrieval and verification for each factoid, which impedes their application in real-world scenarios. In this study, we propose Self-Refinement-Enhanced Knowledge Graph Retrieval (Re-KGR) to augment the factuality of LLMs' responses with less retrieval efforts in the medical field. Our approach leverages the attribution of next-token predictive probability distributions across different tokens, and various model layers to primarily identify tokens with a high potential for hallucination, reducing verification rounds by refining knowledge triples associated with these tokens. Moreover, we rectify inaccurate content using retrieved knowledge in the post-processing stage, which improves the truthfulness of generated responses. Experimental results on a medical dataset demonstrate that our approach can enhance the factual capability of LLMs across various foundational models as evidenced by the highest scores on truthfulness.

5/13/2024

Leveraging Graph Structures to Detect Hallucinations in Large Language Models

Noa Nonkes, Sergei Agaronian, Evangelos Kanoulas, Roxana Petcu

Large language models are extensively applied across a wide range of tasks, such as customer support, content creation, educational tutoring, and providing financial guidance. However, a well-known drawback is their predisposition to generate hallucinations. This damages the trustworthiness of the information these models provide, impacting decision-making and user confidence. We propose a method to detect hallucinations by looking at the structure of the latent space and finding associations within hallucinated and non-hallucinated generations. We create a graph structure that connects generations that lie closely in the embedding space. Moreover, we employ a Graph Attention Network which utilizes message passing to aggregate information from neighboring nodes and assigns varying degrees of importance to each neighbor based on their relevance. Our findings show that 1) there exists a structure in the latent space that differentiates between hallucinated and non-hallucinated generations, 2) Graph Attention Networks can learn this structure and generalize it to unseen generations, and 3) the robustness of our method is enhanced when incorporating contrastive learning. When evaluated against evidence-based benchmarks, our model performs similarly without access to search-based methods.

7/8/2024