What Evidence Do Language Models Find Convincing?

Read original: arXiv:2402.11782 - Published 8/12/2024 by Alexander Wan, Eric Wallace, Dan Klein

What Evidence Do Language Models Find Convincing?

Overview

This paper investigates what types of evidence language models (LMs) find convincing when answering questions that have conflicting answers.
The authors create a new dataset called ConflictingQA to study this.
They find that LMs tend to be overly confident in their answers and are often swayed by superficially compelling but misleading evidence.

Plain English Explanation

Language models are AI systems that can generate human-like text. This paper explores what kinds of information these models find persuasive when answering questions that have conflicting answers.

The researchers built a new dataset called ConflictingQA that contains questions with multiple plausible answers, along with different types of evidence supporting each answer. They then had language models try to answer the questions and analyzed what evidence the models found most convincing.

The key finding is that language models often display overconfidence in their answers and can be swayed by superficially compelling but ultimately misleading evidence. For example, they may be persuaded by impressive-sounding language or cherry-picked facts, even if that evidence doesn't actually resolve the underlying conflict.

This suggests that while language models are powerful, they still struggle to reason carefully and distinguish truly decisive evidence from more superficial signals. Improving their ability to critically evaluate information will be an important challenge going forward.

Technical Explanation

The paper introduces a new dataset called ConflictingQA that contains questions with multiple plausible answers, along with different types of evidence supporting each answer. The dataset was designed to test how language models (LMs) reason about conflicting information.

The authors evaluated several prominent LMs, including GPT-3, BERT, and T5, on the ConflictingQA dataset. They found that the models often displayed overconfidence in their answers and were swayed by superficially compelling but ultimately misleading evidence. For example, LMs tended to be persuaded by evidence that used impressive-sounding language or cherry-picked facts, even when that evidence did not actually resolve the underlying conflict.

Further analysis revealed that the models' weaknesses stemmed from limitations in their ability to reason carefully about conflicting information and distinguish decisive evidence from more superficial signals. The authors suggest that improving LMs' critical thinking skills will be an important area for future research.

Critical Analysis

The paper provides valuable insights into the limitations of current language models when it comes to reasoning about conflicting information. The creation of the ConflictingQA dataset is a particularly useful contribution, as it offers a more nuanced testbed for evaluating LM capabilities beyond simple question answering.

That said, the paper could have explored some additional angles. For instance, it would have been interesting to see how the models' performance varied across different types of questions or subject domains. Additionally, the authors could have delved deeper into the specific reasoning failures that led the models to be swayed by misleading evidence.

Nonetheless, the core finding that LMs struggle to critically evaluate information and can be overly confident in their answers is an important one. As these models become more widely deployed, understanding their limitations and biases will be crucial. The insights from this paper can help guide future research to address these shortcomings and develop more robust and reliable language AI systems.

Conclusion

This paper sheds light on a key limitation of current language models: their tendency to be overconfident and easily persuaded by superficially compelling but ultimately misleading evidence. By creating the ConflictingQA dataset and evaluating prominent LMs, the authors have uncovered important insights about the models' reasoning capabilities and the challenges that must be addressed to improve their critical thinking skills.

As language models become increasingly ubiquitous, understanding their biases and limitations will be crucial. The findings from this paper suggest that developing more robust and discerning language AI systems will be an important area of future research, with significant implications for how these models are deployed and relied upon in real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

What Evidence Do Language Models Find Convincing?

Alexander Wan, Eric Wallace, Dan Klein

Retrieval-augmented language models are being increasingly tasked with subjective, contentious, and conflicting queries such as is aspartame linked to cancer. To resolve these ambiguous queries, one must search through a large range of websites and consider which, if any, of this evidence do I find convincing?. In this work, we study how LLMs answer this question. In particular, we construct ConflictingQA, a dataset that pairs controversial queries with a series of real-world evidence documents that contain different facts (e.g., quantitative results), argument styles (e.g., appeals to authority), and answers (Yes or No). We use this dataset to perform sensitivity and counterfactual analyses to explore which text features most affect LLM predictions. Overall, we find that current models rely heavily on the relevance of a website to the query, while largely ignoring stylistic features that humans find important such as whether a text contains scientific references or is written with a neutral tone. Taken together, these results highlight the importance of RAG corpus quality (e.g., the need to filter misinformation), and possibly even a shift in how LLMs are trained to better align with human judgements.

8/12/2024

💬

Large Language Models Help Humans Verify Truthfulness -- Except When They Are Convincingly Wrong

Chenglei Si, Navita Goyal, Sherry Tongshuang Wu, Chen Zhao, Shi Feng, Hal Daum'e III, Jordan Boyd-Graber

Large Language Models (LLMs) are increasingly used for accessing information on the web. Their truthfulness and factuality are thus of great interest. To help users make the right decisions about the information they get, LLMs should not only provide information but also help users fact-check it. Our experiments with 80 crowdworkers compare language models with search engines (information retrieval systems) at facilitating fact-checking. We prompt LLMs to validate a given claim and provide corresponding explanations. Users reading LLM explanations are significantly more efficient than those using search engines while achieving similar accuracy. However, they over-rely on the LLMs when the explanation is wrong. To reduce over-reliance on LLMs, we ask LLMs to provide contrastive information - explain both why the claim is true and false, and then we present both sides of the explanation to users. This contrastive explanation mitigates users' over-reliance on LLMs, but cannot significantly outperform search engines. Further, showing both search engine results and LLM explanations offers no complementary benefits compared to search engines alone. Taken together, our study highlights that natural language explanations by LLMs may not be a reliable replacement for reading the retrieved passages, especially in high-stakes settings where over-relying on wrong AI explanations could lead to critical consequences.

4/3/2024

Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions

Sebastian Heineking, Jonas Probst, Daniel Steinbach, Martin Potthast, Harrisen Scells

Evaluating the output of generative large language models (LLMs) is challenging and difficult to scale. Most evaluations of LLMs focus on tasks such as single-choice question-answering or text classification. These tasks are not suitable for assessing open-ended question-answering capabilities, which are critical in domains where expertise is required, such as health, and where misleading or incorrect answers can have a significant impact on a user's health. Using human experts to evaluate the quality of LLM answers is generally considered the gold standard, but expert annotation is costly and slow. We present a method for evaluating LLM answers that uses ranking signals as a substitute for explicit relevance judgements. Our scoring method correlates with the preferences of human experts. We validate it by investigating the well-known fact that the quality of generated answers improves with the size of the model as well as with more sophisticated prompting strategies.

8/20/2024

WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia

Yufang Hou, Alessandra Pascale, Javier Carnerero-Cano, Tigran Tchrakian, Radu Marinescu, Elizabeth Daly, Inkit Padhi, Prasanna Sattigeri

Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs), such as hallucinations and outdated information. However, it remains unclear how LLMs handle knowledge conflicts arising from different augmented retrieved passages, especially when these passages originate from the same source and have equal trustworthiness. In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions that have varying answers based on contradictory passages from Wikipedia, a dataset widely regarded as a high-quality pre-training resource for most LLMs. Specifically, we introduce WikiContradict, a benchmark consisting of 253 high-quality, human-annotated instances designed to assess LLM performance when augmented with retrieved passages containing real-world knowledge conflicts. We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages. Through rigorous human evaluations on a subset of WikiContradict instances involving 5 LLMs and over 3,500 judgements, we shed light on the behaviour and limitations of these models. For instance, when provided with two passages containing contradictory facts, all models struggle to generate answers that accurately reflect the conflicting nature of the context, especially for implicit conflicts requiring reasoning. Since human evaluation is costly, we also introduce an automated model that estimates LLM performance using a strong open-source language model, achieving an F-score of 0.8. Using this automated metric, we evaluate more than 1,500 answers from seven LLMs across all WikiContradict instances. To facilitate future work, we release WikiContradict on: https://ibm.biz/wikicontradict.

6/21/2024