WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia

Read original: arXiv:2406.13805 - Published 6/21/2024 by Yufang Hou, Alessandra Pascale, Javier Carnerero-Cano, Tigran Tchrakian, Radu Marinescu, Elizabeth Daly, Inkit Padhi, Prasanna Sattigeri

WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia

Overview

• This paper introduces WikiContradict, a benchmark for evaluating large language models (LLMs) on their ability to handle real-world knowledge conflicts from Wikipedia.

• The benchmark consists of a dataset of contradictory claims extracted from Wikipedia, which are used to test how well LLMs can identify and resolve these contradictions.

Plain English Explanation

• Wikipedia, the online encyclopedia, is a valuable resource of information. However, it is written by many different people, and sometimes the information can be contradictory or inconsistent.

• The researchers created a dataset called WikiContradict that contains examples of these contradictory claims from Wikipedia. This dataset can be used to test how well AI language models, like GPT-3 or BERT, can identify and resolve these contradictions.

• Identifying and resolving contradictions is an important skill for AI systems, as they need to be able to reason about and make sense of the world, even when information is inconsistent or conflicting. Link to "ContradoC: Understanding Self-Contradictions in Documents for Robust Language Models"

• By testing LLMs on the WikiContradict benchmark, researchers can better understand the strengths and limitations of these models when it comes to dealing with real-world knowledge conflicts. Link to "ClashEval: Quantifying the Tug-of-War Between LLMs' Internal Knowledge and Reasoning Skills"

Technical Explanation

• The researchers extracted contradictory claims from Wikipedia by identifying sentences that express opposite or mutually exclusive information about the same topic.

• They used a two-step process to collect these contradictory claims: first, they identified candidate sentences that express contradictory information, and then they manually verified these candidates to ensure they truly represent a real-world knowledge conflict.

• The resulting WikiContradict dataset contains over 12,000 pairs of contradictory claims, covering a wide range of topics. This dataset can be used to evaluate how well LLMs can identify and resolve these contradictions. Link to "Untangle the Knot: Interweaving Conflicting Knowledge and Reasoning Skills for Robust Language Models"

• The researchers also provide baseline results using several state-of-the-art LLMs, showing that while these models perform reasonably well on the task, there is still significant room for improvement. Link to "Why So Gullible? Enhancing the Robustness of Retrieval-Augmented Language Models" and Link to "ConflARE: Conformal Large Language Model Retrieval"

Critical Analysis

• The researchers acknowledge that the WikiContradict dataset is not exhaustive and may not capture all types of knowledge conflicts that exist in the real world. There may be other forms of contradictions or inconsistencies that are not represented in the dataset.

• Additionally, the manual verification process used to curate the dataset could introduce some bias, as the researchers' own interpretations and judgments may influence the final selection of contradictory claims.

• It would be valuable to see further research on how LLMs perform on this benchmark over time, as the models continue to improve and new techniques are developed to address the challenges of resolving knowledge conflicts.

Conclusion

• The WikiContradict benchmark provides a valuable tool for evaluating the ability of LLMs to handle real-world knowledge conflicts, which is an important capability for AI systems that aim to understand and reason about the world.

• By testing LLMs on this benchmark, researchers can gain insights into the strengths and limitations of these models, and work towards developing more robust and reliable AI systems that can effectively deal with contradictory information.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia

Yufang Hou, Alessandra Pascale, Javier Carnerero-Cano, Tigran Tchrakian, Radu Marinescu, Elizabeth Daly, Inkit Padhi, Prasanna Sattigeri

Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs), such as hallucinations and outdated information. However, it remains unclear how LLMs handle knowledge conflicts arising from different augmented retrieved passages, especially when these passages originate from the same source and have equal trustworthiness. In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions that have varying answers based on contradictory passages from Wikipedia, a dataset widely regarded as a high-quality pre-training resource for most LLMs. Specifically, we introduce WikiContradict, a benchmark consisting of 253 high-quality, human-annotated instances designed to assess LLM performance when augmented with retrieved passages containing real-world knowledge conflicts. We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages. Through rigorous human evaluations on a subset of WikiContradict instances involving 5 LLMs and over 3,500 judgements, we shed light on the behaviour and limitations of these models. For instance, when provided with two passages containing contradictory facts, all models struggle to generate answers that accurately reflect the conflicting nature of the context, especially for implicit conflicts requiring reasoning. Since human evaluation is costly, we also introduce an automated model that estimates LLM performance using a strong open-source language model, achieving an F-score of 0.8. Using this automated metric, we evaluate more than 1,500 answers from seven LLMs across all WikiContradict instances. To facilitate future work, we release WikiContradict on: https://ibm.biz/wikicontradict.

6/21/2024

ConflictBank: A Benchmark for Evaluating the Influence of Knowledge Conflicts in LLM

Zhaochen Su, Jun Zhang, Xiaoye Qu, Tong Zhu, Yanshu Li, Jiashuo Sun, Juntao Li, Min Zhang, Yu Cheng

Large language models (LLMs) have achieved impressive advancements across numerous disciplines, yet the critical issue of knowledge conflicts, a major source of hallucinations, has rarely been studied. Only a few research explored the conflicts between the inherent knowledge of LLMs and the retrieved contextual knowledge. However, a thorough assessment of knowledge conflict in LLMs is still missing. Motivated by this research gap, we present ConflictBank, the first comprehensive benchmark developed to systematically evaluate knowledge conflicts from three aspects: (i) conflicts encountered in retrieved knowledge, (ii) conflicts within the models' encoded knowledge, and (iii) the interplay between these conflict forms. Our investigation delves into four model families and twelve LLM instances, meticulously analyzing conflicts stemming from misinformation, temporal discrepancies, and semantic divergences. Based on our proposed novel construction framework, we create 7,453,853 claim-evidence pairs and 553,117 QA pairs. We present numerous findings on model scale, conflict causes, and conflict types. We hope our ConflictBank benchmark will help the community better understand model behavior in conflicts and develop more reliable LLMs.

8/23/2024

💬

Resolving Knowledge Conflicts in Large Language Models

Yike Wang, Shangbin Feng, Heng Wang, Weijia Shi, Vidhisha Balachandran, Tianxing He, Yulia Tsvetkov

Large language models (LLMs) often encounter knowledge conflicts, scenarios where discrepancy arises between the internal parametric knowledge of LLMs and non-parametric information provided in the prompt context. In this work we ask what are the desiderata for LLMs when a knowledge conflict arises and whether existing LLMs fulfill them. We posit that LLMs should 1) identify knowledge conflicts, 2) pinpoint conflicting information segments, and 3) provide distinct answers or viewpoints in conflicting scenarios. To this end, we introduce KNOWLEDGE CONFLICT, an evaluation framework for simulating contextual knowledge conflicts and quantitatively evaluating to what extent LLMs achieve these goals. KNOWLEDGE CONFLICT includes diverse and complex situations of knowledge conflict, knowledge from diverse entities and domains, two synthetic conflict creation methods, and settings with progressively increasing difficulty to reflect realistic knowledge conflicts. Extensive experiments with the KNOWLEDGE CONFLICT framework reveal that while LLMs perform well in identifying the existence of knowledge conflicts, they struggle to determine the specific conflicting knowledge and produce a response with distinct answers amidst conflicting information. To address these challenges, we propose new instruction-based approaches that augment LLMs to better achieve the three goals. Further analysis shows that abilities to tackle knowledge conflicts are greatly impacted by factors such as knowledge domain and prompt text, while generating robust responses to knowledge conflict scenarios remains an open research question.

9/6/2024

Knowledge Conflicts for LLMs: A Survey

Rongwu Xu, Zehan Qi, Zhijiang Guo, Cunxiang Wang, Hongru Wang, Yue Zhang, Wei Xu

This survey provides an in-depth analysis of knowledge conflicts for large language models (LLMs), highlighting the complex challenges they encounter when blending contextual and parametric knowledge. Our focus is on three categories of knowledge conflicts: context-memory, inter-context, and intra-memory conflict. These conflicts can significantly impact the trustworthiness and performance of LLMs, especially in real-world applications where noise and misinformation are common. By categorizing these conflicts, exploring the causes, examining the behaviors of LLMs under such conflicts, and reviewing available solutions, this survey aims to shed light on strategies for improving the robustness of LLMs, thereby serving as a valuable resource for advancing research in this evolving area.

6/26/2024