Are self-explanations from Large Language Models faithful?

2401.07927

Published 5/20/2024 by Andreas Madsen, Sarath Chandar, Siva Reddy

💬

Abstract

Instruction-tuned Large Language Models (LLMs) excel at many tasks and will even explain their reasoning, so-called self-explanations. However, convincing and wrong self-explanations can lead to unsupported confidence in LLMs, thus increasing risk. Therefore, it's important to measure if self-explanations truly reflect the model's behavior. Such a measure is called interpretability-faithfulness and is challenging to perform since the ground truth is inaccessible, and many LLMs only have an inference API. To address this, we propose employing self-consistency checks to measure faithfulness. For example, if an LLM says a set of words is important for making a prediction, then it should not be able to make its prediction without these words. While self-consistency checks are a common approach to faithfulness, they have not previously been successfully applied to LLM self-explanations for counterfactual, feature attribution, and redaction explanations. Our results demonstrate that faithfulness is explanation, model, and task-dependent, showing self-explanations should not be trusted in general. For example, with sentiment classification, counterfactuals are more faithful for Llama2, feature attribution for Mistral, and redaction for Falcon 40B.

Create account to get full access

Overview

Large language models (LLMs) excel at many tasks and can provide self-explanations for their reasoning
However, these self-explanations may be convincing but wrong, leading to unjustified confidence in the model
It's important to measure whether self-explanations truly reflect the model's behavior, a concept called "interpretability-faithfulness"
This is challenging because the ground truth is inaccessible and many LLMs only have an inference API
To address this, the researchers propose using self-consistency checks to measure faithfulness

Plain English Explanation

Large language models are incredibly capable at various tasks, and they can even explain their own reasoning, which is known as "self-explanations." However, these self-explanations can sometimes be convincing but actually wrong, causing people to trust the model's outputs more than they should.

To address this issue, the researchers wanted to find a way to measure whether these self-explanations genuinely reflect how the model is actually behaving. They call this "interpretability-faithfulness," but it's challenging to do because the true inner workings of the model are not accessible, and many of these powerful language models can only be used through an online interface.

To overcome this, the researchers suggest using "self-consistency checks." The idea is that if a model says certain words are important for making a prediction, then it shouldn't be able to make that prediction without those words. By checking this self-consistency, the researchers can get a sense of how faithful the model's self-explanations really are.

This approach of using self-consistency checks to evaluate faithfulness hasn't been successfully applied to the various types of self-explanations that large language models can provide, such as counterfactuals, feature attribution, and redaction explanations. The researchers' results show that faithfulness depends on the specific explanation, model, and task - so self-explanations shouldn't be blindly trusted in general.

Technical Explanation

The researchers investigate the "interpretability-faithfulness" of the self-explanations provided by large language models (LLMs). Interpretability-faithfulness refers to how well the model's self-explanations align with its actual behavior. This is important because convincing but incorrect self-explanations can lead to unjustified trust in the model's outputs.

Measuring interpretability-faithfulness is challenging, as the ground truth of the model's internal reasoning is inaccessible, and many LLMs only provide an inference API. To address this, the researchers propose using "self-consistency checks" - for example, verifying that an LLM cannot make a prediction without the words it claims are important.

While self-consistency checks are a common approach to evaluating faithfulness, the researchers note that this technique has not been successfully applied to the various types of self-explanations that LLMs can provide, such as counterfactuals, feature attribution, and redaction explanations.

The researchers' results demonstrate that faithfulness is dependent on the specific explanation, model, and task. For example, they find that for sentiment classification, counterfactuals are more faithful for the Llama2 model, feature attribution is more faithful for Mistral, and redaction is more faithful for Falcon 40B. This suggests that self-explanations should not be trusted in general without careful evaluation.

Critical Analysis

The researchers have made an important contribution by highlighting the need to carefully evaluate the faithfulness of self-explanations provided by large language models. Their proposed use of self-consistency checks is a reasonable approach, though it does have some limitations.

One potential issue is that self-consistency checks may not capture all aspects of faithfulness. For example, a model could pass a self-consistency test but still have subtle biases or flaws in its reasoning that are not detected. Additionally, the researchers note that the ground truth of the model's internal reasoning is inaccessible, which means there may be aspects of faithfulness that cannot be fully evaluated.

It's also worth considering whether self-explanations are the most appropriate way for large language models to communicate their reasoning. Alternative approaches, such as concept-based explanations, may be more faithful and interpretable. The researchers could have delved deeper into these alternative explanation methods and their potential advantages.

Overall, the researchers have raised an important issue and proposed a reasonable approach to address it. However, further research is needed to fully understand the limitations of self-explanations and develop more reliable methods for evaluating the faithfulness of large language models.

Conclusion

This research highlights the critical need to carefully evaluate the faithfulness of self-explanations provided by large language models. While these models excel at many tasks and can generate compelling self-explanations, the researchers demonstrate that these self-explanations may not always align with the model's true behavior.

By proposing the use of self-consistency checks to measure interpretability-faithfulness, the researchers have provided a valuable tool for assessing the reliability of large language model outputs. Their finding that faithfulness is dependent on the specific explanation, model, and task underscores the importance of not blindly trusting self-explanations and instead subjecting them to rigorous evaluation.

As large language models become increasingly influential in a wide range of applications, understanding their limitations and potential flaws is crucial. This research represents an important step towards developing more transparent and trustworthy AI systems that can be reliably deployed in real-world scenarios.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🌿

On Measuring Faithfulness or Self-consistency of Natural Language Explanations

Letitia Parcalabescu, Anette Frank

Large language models (LLMs) can explain their predictions through post-hoc or Chain-of-Thought (CoT) explanations. But an LLM could make up reasonably sounding explanations that are unfaithful to its underlying reasoning. Recent work has designed tests that aim to judge the faithfulness of post-hoc or CoT explanations. In this work we argue that these faithfulness tests do not measure faithfulness to the models' inner workings -- but rather their self-consistency at output level. Our contributions are three-fold: i) We clarify the status of faithfulness tests in view of model explainability, characterising them as self-consistency tests instead. This assessment we underline by ii) constructing a Comparative Consistency Bank for self-consistency tests that for the first time compares existing tests on a common suite of 11 open LLMs and 5 tasks -- including iii) our new self-consistency measure CC-SHAP. CC-SHAP is a fine-grained measure (not a test) of LLM self-consistency. It compares how a model's input contributes to the predicted answer and to generating the explanation. Our fine-grained CC-SHAP metric allows us iii) to compare LLM behaviour when making predictions and to analyse the effect of other consistency tests at a deeper level, which takes us one step further towards measuring faithfulness by bringing us closer to the internals of the model than strictly surface output-oriented tests. Our code is available at url{https://github.com/Heidelberg-NLP/CC-SHAP}

6/4/2024

cs.CL cs.AI cs.LG

💬

FaithLM: Towards Faithful Explanations for Large Language Models

Yu-Neng Chuang, Guanchu Wang, Chia-Yuan Chang, Ruixiang Tang, Shaochen Zhong, Fan Yang, Mengnan Du, Xuanting Cai, Xia Hu

Large Language Models (LLMs) have become proficient in addressing complex tasks by leveraging their extensive internal knowledge and reasoning capabilities. However, the black-box nature of these models complicates the task of explaining their decision-making processes. While recent advancements demonstrate the potential of leveraging LLMs to self-explain their predictions through natural language (NL) explanations, their explanations may not accurately reflect the LLMs' decision-making process due to a lack of fidelity optimization on the derived explanations. Measuring the fidelity of NL explanations is a challenging issue, as it is difficult to manipulate the input context to mask the semantics of these explanations. To this end, we introduce FaithLM to explain the decision of LLMs with NL explanations. Specifically, FaithLM designs a method for evaluating the fidelity of NL explanations by incorporating the contrary explanations to the query process. Moreover, FaithLM conducts an iterative process to improve the fidelity of derived explanations. Experiment results on three datasets from multiple domains demonstrate that FaithLM can significantly improve the fidelity of derived explanations, which also provides a better alignment with the ground-truth explanations.

6/27/2024

cs.CL cs.AI cs.LG

Evaluating Readability and Faithfulness of Concept-based Explanations

Meng Li, Haoran Jin, Ruixuan Huang, Zhihao Xu, Defu Lian, Zijia Lin, Di Zhang, Xiting Wang

Despite the surprisingly high intelligence exhibited by Large Language Models (LLMs), we are somehow intimidated to fully deploy them into real-life applications considering their black-box nature. Concept-based explanations arise as a promising avenue for explaining what the LLMs have learned, making them more transparent to humans. However, current evaluations for concepts tend to be heuristic and non-deterministic, e.g. case study or human evaluation, hindering the development of the field. To bridge the gap, we approach concept-based explanation evaluation via faithfulness and readability. We first introduce a formal definition of concept generalizable to diverse concept-based explanations. Based on this, we quantify faithfulness via the difference in the output upon perturbation. We then provide an automatic measure for readability, by measuring the coherence of patterns that maximally activate a concept. This measure serves as a cost-effective and reliable substitute for human evaluation. Finally, based on measurement theory, we describe a meta-evaluation method for evaluating the above measures via reliability and validity, which can be generalized to other tasks as well. Extensive experimental analysis has been conducted to validate and inform the selection of concept evaluation measures.

5/1/2024

cs.AI cs.HC

💬

Large Language Models Help Humans Verify Truthfulness -- Except When They Are Convincingly Wrong

Chenglei Si, Navita Goyal, Sherry Tongshuang Wu, Chen Zhao, Shi Feng, Hal Daum'e III, Jordan Boyd-Graber

Large Language Models (LLMs) are increasingly used for accessing information on the web. Their truthfulness and factuality are thus of great interest. To help users make the right decisions about the information they get, LLMs should not only provide information but also help users fact-check it. Our experiments with 80 crowdworkers compare language models with search engines (information retrieval systems) at facilitating fact-checking. We prompt LLMs to validate a given claim and provide corresponding explanations. Users reading LLM explanations are significantly more efficient than those using search engines while achieving similar accuracy. However, they over-rely on the LLMs when the explanation is wrong. To reduce over-reliance on LLMs, we ask LLMs to provide contrastive information - explain both why the claim is true and false, and then we present both sides of the explanation to users. This contrastive explanation mitigates users' over-reliance on LLMs, but cannot significantly outperform search engines. Further, showing both search engine results and LLM explanations offers no complementary benefits compared to search engines alone. Taken together, our study highlights that natural language explanations by LLMs may not be a reliable replacement for reading the retrieved passages, especially in high-stakes settings where over-relying on wrong AI explanations could lead to critical consequences.

4/3/2024

cs.CL cs.HC