Local Explanations and Self-Explanations for Assessing Faithfulness in black-box LLMs

Read original: arXiv:2409.13764 - Published 9/24/2024 by Christos Fragkathoulas, Odysseas S. Chlapanis

Local Explanations and Self-Explanations for Assessing Faithfulness in black-box LLMs

Overview

This paper explores methods for assessing the faithfulness of black-box large language models (LLMs) using local explanations and self-explanations.
The authors propose two key approaches: local explanations, which provide insights into individual model predictions, and self-explanations, where the model explains its own reasoning.
The goal is to determine how well these techniques can capture the internal decision-making process of LLMs, which are often opaque "black boxes."

Plain English Explanation

The paper looks at ways to better understand how large language models (LLMs) make their decisions. LLMs are powerful AI systems that can generate human-like text, but they are often treated like "black boxes" - it's not always clear how they arrive at their outputs.

The researchers tested two methods to shed light on this "black box":

Local Explanations: These provide insights into individual predictions made by the LLM. For example, the model could explain why it chose a particular word in a sentence.
Self-Explanations: Here, the LLM itself tries to explain its own reasoning process. The model essentially gives a step-by-step account of how it reached a certain conclusion.

The idea is that these techniques can help assess how "faithful" the LLM is - in other words, how well the explanations match the model's actual internal decision-making. This is important because it allows us to better trust and understand the behavior of these powerful but opaque AI systems.

Technical Explanation

The paper proposes two complementary approaches to evaluate the faithfulness of black-box LLMs:

Local Explanations: The authors use perturbation-based methods to generate local explanations for individual model predictions. These explanations capture how the model's output changes in response to small changes in the input. By analyzing these local changes, the researchers can gain insight into the model's internal decision-making process.
Self-Explanations: The authors also explore having the LLM itself provide step-by-step "self-explanations" for its predictions. This allows the model to directly articulate its reasoning, potentially revealing more about its inner workings.

The paper evaluates these techniques across several LLM tasks, including question answering, commonsense reasoning, and natural language inference. The authors analyze the faithfulness of the local and self-explanations by comparing them to ground truth information about the models' true decision-making.

Critical Analysis

The paper provides a valuable contribution by exploring techniques to better understand the inner workings of black-box LLMs. The authors acknowledge some key limitations:

The local explanation and self-explanation approaches may not fully capture the complex and distributed decision-making processes within large neural networks.
There are open questions around how to best elicit and interpret the self-explanations generated by LLMs.
The proposed methods may be computationally expensive or difficult to scale to very large models.

Additionally, one could argue that the paper focuses primarily on evaluating faithfulness, but does not address how these techniques could be used to actually improve the transparency and interpretability of LLMs. Further research is needed to understand how explanations can be leveraged to enhance model development and deployment.

Conclusion

This paper makes an important step towards assessing the faithfulness of black-box LLMs using local explanations and self-explanations. By shedding light on the internal decision-making of these powerful AI systems, the proposed techniques could enhance trust, accountability, and responsible development of large language models. However, the authors acknowledge limitations that suggest further research is needed to fully unlock the potential of explanation-based approaches for understanding and improving LLMs.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Local Explanations and Self-Explanations for Assessing Faithfulness in black-box LLMs

Christos Fragkathoulas, Odysseas S. Chlapanis

This paper introduces a novel task to assess the faithfulness of large language models (LLMs) using local perturbations and self-explanations. Many LLMs often require additional context to answer certain questions correctly. For this purpose, we propose a new efficient alternative explainability technique, inspired by the commonly used leave-one-out approach. Using this approach, we identify the sufficient and necessary parts for the LLM to generate correct answers, serving as explanations. We propose a metric for assessing faithfulness that compares these crucial parts with the self-explanations of the model. Using the Natural Questions dataset, we validate our approach, demonstrating its effectiveness in explaining model decisions and assessing faithfulness.

9/24/2024

💬

Are self-explanations from Large Language Models faithful?

Andreas Madsen, Sarath Chandar, Siva Reddy

Instruction-tuned Large Language Models (LLMs) excel at many tasks and will even explain their reasoning, so-called self-explanations. However, convincing and wrong self-explanations can lead to unsupported confidence in LLMs, thus increasing risk. Therefore, it's important to measure if self-explanations truly reflect the model's behavior. Such a measure is called interpretability-faithfulness and is challenging to perform since the ground truth is inaccessible, and many LLMs only have an inference API. To address this, we propose employing self-consistency checks to measure faithfulness. For example, if an LLM says a set of words is important for making a prediction, then it should not be able to make its prediction without these words. While self-consistency checks are a common approach to faithfulness, they have not previously been successfully applied to LLM self-explanations for counterfactual, feature attribution, and redaction explanations. Our results demonstrate that faithfulness is explanation, model, and task-dependent, showing self-explanations should not be trusted in general. For example, with sentiment classification, counterfactuals are more faithful for Llama2, feature attribution for Mistral, and redaction for Falcon 40B.

5/20/2024

💬

FaithLM: Towards Faithful Explanations for Large Language Models

Yu-Neng Chuang, Guanchu Wang, Chia-Yuan Chang, Ruixiang Tang, Shaochen Zhong, Fan Yang, Mengnan Du, Xuanting Cai, Xia Hu

Large Language Models (LLMs) have become proficient in addressing complex tasks by leveraging their extensive internal knowledge and reasoning capabilities. However, the black-box nature of these models complicates the task of explaining their decision-making processes. While recent advancements demonstrate the potential of leveraging LLMs to self-explain their predictions through natural language (NL) explanations, their explanations may not accurately reflect the LLMs' decision-making process due to a lack of fidelity optimization on the derived explanations. Measuring the fidelity of NL explanations is a challenging issue, as it is difficult to manipulate the input context to mask the semantics of these explanations. To this end, we introduce FaithLM to explain the decision of LLMs with NL explanations. Specifically, FaithLM designs a method for evaluating the fidelity of NL explanations by incorporating the contrary explanations to the query process. Moreover, FaithLM conducts an iterative process to improve the fidelity of derived explanations. Experiment results on three datasets from multiple domains demonstrate that FaithLM can significantly improve the fidelity of derived explanations, which also provides a better alignment with the ground-truth explanations.

6/27/2024

🌿

On Measuring Faithfulness or Self-consistency of Natural Language Explanations

Letitia Parcalabescu, Anette Frank

Large language models (LLMs) can explain their predictions through post-hoc or Chain-of-Thought (CoT) explanations. But an LLM could make up reasonably sounding explanations that are unfaithful to its underlying reasoning. Recent work has designed tests that aim to judge the faithfulness of post-hoc or CoT explanations. In this work we argue that these faithfulness tests do not measure faithfulness to the models' inner workings -- but rather their self-consistency at output level. Our contributions are three-fold: i) We clarify the status of faithfulness tests in view of model explainability, characterising them as self-consistency tests instead. This assessment we underline by ii) constructing a Comparative Consistency Bank for self-consistency tests that for the first time compares existing tests on a common suite of 11 open LLMs and 5 tasks -- including iii) our new self-consistency measure CC-SHAP. CC-SHAP is a fine-grained measure (not a test) of LLM self-consistency. It compares how a model's input contributes to the predicted answer and to generating the explanation. Our fine-grained CC-SHAP metric allows us iii) to compare LLM behaviour when making predictions and to analyse the effect of other consistency tests at a deeper level, which takes us one step further towards measuring faithfulness by bringing us closer to the internals of the model than strictly surface output-oriented tests. Our code is available at url{https://github.com/Heidelberg-NLP/CC-SHAP}

9/20/2024