Large Language Models Cannot Explain Themselves

2405.04382

Published 5/8/2024 by Advait Sarkar

💬

Abstract

Large language models can be prompted to produce text. They can also be prompted to produce explanations of their output. But these are not really explanations, because they do not accurately reflect the mechanical process underlying the prediction. The illusion that they reflect the reasoning process can result in significant harms. These explanations can be valuable, but for promoting critical thinking rather than for understanding the model. I propose a recontextualisation of these explanations, using the term exoplanations to draw attention to their exogenous nature. I discuss some implications for design and technology, such as the inclusion of appropriate guardrails and responses when models are prompted to generate explanations.

Create account to get full access

Overview

This paper argues that large language models (LLMs) cannot truly "explain" their inner workings and decision-making processes, despite claims that they can provide "explanations."
The authors suggest that the purported "explanations" from LLMs are an illusion, and that these models lack the necessary introspective capabilities to genuinely explain their reasoning.
The paper discusses the societal harms that can arise from the misuse of these "exoplanations" - explanations that appear to explain the model's behavior but do not actually reflect the model's true decision-making.

Plain English Explanation

Large language models (LLMs) like GPT-3 and BERT have become increasingly powerful at tasks like language generation, translation, and answering questions. These models are often touted as being able to "explain" their own inner workings and decision-making processes. However, this paper argues that this is an illusion.

The authors suggest that LLMs lack the necessary introspective capabilities to truly explain their reasoning. These models are essentially very complex statistical machines that have been trained on vast amounts of data to generate human-like text. But under the hood, they are not actually "understanding" language in the way a human does. They are simply pattern-matching and generating the most likely next words based on the training data.

When an LLM is asked to "explain" its reasoning, it will generate plausible-sounding text that appears to be an explanation. But this is not a genuine explanation of the model's inner workings - it is merely an exoplanation, a fabricated explanation that sounds convincing but does not reflect the actual decision-making process of the model.

The authors argue that the proliferation of these exoplanations can be harmful to society. People may mistakenly believe that the model's explanations are accurate and truthful, when in reality the model is simply generating plausible-sounding text without any true understanding. This could lead to erroneous conclusions and poor decision-making, with potentially serious consequences.

Technical Explanation

The paper first discusses the illusion of explanation created by large language models (LLMs). The authors argue that while these models can generate coherent and seemingly explanatory text, they do not actually have the introspective capabilities to truly explain their own decision-making processes.

The paper then explores the concept of exoplanations - explanations that appear to explain a model's behavior but do not reflect the model's actual decision-making. The authors suggest that these exoplanations can be harmful, as they may mislead users into thinking the model has a deeper understanding than it actually does.

The paper also discusses the potential for linguistic intentionality - the idea that LLMs may be able to generate text that appears to have intentional meaning, but is ultimately just a product of statistical modeling rather than true comprehension.

Critical Analysis

The paper raises valid concerns about the limitations of large language models and their purported ability to "explain" their own inner workings. The authors make a strong case that the "explanations" provided by these models are often simply fabricated narratives that sound plausible but do not reflect the actual decision-making process.

One potential limitation of the paper is that it does not provide empirical evidence to support its claims about the harms of exoplanations. While the authors make a compelling theoretical argument, more research may be needed to understand the real-world impact of these issues.

Additionally, the paper does not address potential ways to improve the transparency and interpretability of large language models. While the authors are correct that these models currently lack genuine explanatory capabilities, future research may find ways to enhance their introspective abilities and provide more truthful and meaningful explanations.

Overall, the paper provides a thought-provoking critique of the claims around the "explainability" of large language models, and highlights the importance of critically evaluating the capabilities and limitations of these powerful AI systems.

Conclusion

This paper challenges the notion that large language models can truly "explain" their inner workings and decision-making processes. The authors argue that the purported "explanations" from these models are an illusion, and that the models lack the necessary introspective capabilities to genuinely account for their reasoning.

The paper also discusses the societal harms that can arise from the proliferation of these "exoplanations" - explanations that appear to explain the model's behavior but do not actually reflect the model's true decision-making. This could lead to erroneous conclusions and poor decision-making, with potentially serious consequences.

Overall, this paper serves as an important reminder that the capabilities of large language models should not be overstated, and that a critical, skeptical approach is necessary when evaluating their outputs and claims about their inner workings.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Towards Uncovering How Large Language Model Works: An Explainability Perspective

Haiyan Zhao, Fan Yang, Bo Shen, Himabindu Lakkaraju, Mengnan Du

Large language models (LLMs) have led to breakthroughs in language tasks, yet the internal mechanisms that enable their remarkable generalization and reasoning abilities remain opaque. This lack of transparency presents challenges such as hallucinations, toxicity, and misalignment with human values, hindering the safe and beneficial deployment of LLMs. This paper aims to uncover the mechanisms underlying LLM functionality through the lens of explainability. First, we review how knowledge is architecturally composed within LLMs and encoded in their internal parameters via mechanistic interpretability techniques. Then, we summarize how knowledge is embedded in LLM representations by leveraging probing techniques and representation engineering. Additionally, we investigate the training dynamics through a mechanistic perspective to explain phenomena such as grokking and memorization. Lastly, we explore how the insights gained from these explanations can enhance LLM performance through model editing, improve efficiency through pruning, and better align with human values.

4/17/2024

cs.CL

Why Would You Suggest That? Human Trust in Language Model Responses

Manasi Sharma, Ho Chit Siu, Rohan Paleja, Jaime D. Pe~na

The emergence of Large Language Models (LLMs) has revealed a growing need for human-AI collaboration, especially in creative decision-making scenarios where trust and reliance are paramount. Through human studies and model evaluations on the open-ended News Headline Generation task from the LaMP benchmark, we analyze how the framing and presence of explanations affect user trust and model performance. Overall, we provide evidence that adding an explanation in the model response to justify its reasoning significantly increases self-reported user trust in the model when the user has the opportunity to compare various responses. Position and faithfulness of these explanations are also important factors. However, these gains disappear when users are shown responses independently, suggesting that humans trust all model responses, including deceptive ones, equitably when they are shown in isolation. Our findings urge future research to delve deeper into the nuanced evaluation of trust in human-machine teaming systems.

6/5/2024

cs.CL cs.AI cs.HC

💬

Can Language Models Explain Their Own Classification Behavior?

Dane Sherburn, Bilal Chughtai, Owain Evans

Large language models (LLMs) perform well at a myriad of tasks, but explaining the processes behind this performance is a challenge. This paper investigates whether LLMs can give faithful high-level explanations of their own internal processes. To explore this, we introduce a dataset, ArticulateRules, of few-shot text-based classification tasks generated by simple rules. Each rule is associated with a simple natural-language explanation. We test whether models that have learned to classify inputs competently (both in- and out-of-distribution) are able to articulate freeform natural language explanations that match their classification behavior. Our dataset can be used for both in-context and finetuning evaluations. We evaluate a range of LLMs, demonstrating that articulation accuracy varies considerably between models, with a particularly sharp increase from GPT-3 to GPT-4. We then investigate whether we can improve GPT-3's articulation accuracy through a range of methods. GPT-3 completely fails to articulate 7/10 rules in our test, even after additional finetuning on correct explanations. We release our dataset, ArticulateRules, which can be used to test self-explanation for LLMs trained either in-context or by finetuning.

5/14/2024

cs.LG cs.AI

💬

Large Language Models are In-context Teachers for Knowledge Reasoning

Jiachen Zhao, Zonghai Yao, Zhichao Yang, Hong Yu

Chain-of-thought (CoT) prompting teaches large language models (LLMs) in context to reason over queries that require more than mere information retrieval. However, human experts are usually required to craft demonstrations for in-context learning (ICL), which is expensive and has high variance. More importantly, how to craft helpful reasoning exemplars for ICL remains unclear. In this work, we investigate whether LLMs can be better in-context teachers for knowledge reasoning. We follow the ``encoding specificity'' hypothesis in human's memory retrieval to assume in-context exemplars at inference should match the encoding context in training data. We are thus motivated to propose Self-Explain to use one LLM's self-elicited explanations as in-context demonstrations for prompting it as they are generalized from the model's training examples. Self-Explain is shown to significantly outperform using human-crafted exemplars and other baselines. We further reveal that for in-context teaching, rationales by distinct teacher LLMs or human experts that more resemble the student LLM's self-explanations are better demonstrations, which supports our encoding specificity hypothesis. We then propose Teach-Back that aligns the teacher LLM with the student to enhance the in-context teaching performance. For example, Teach-Back enables a 7B model to teach the much larger GPT-3.5 in context, surpassing human teachers by around 5% in test accuracy on medical question answering.

6/18/2024

cs.CL