The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in Large Language Models

Read original: arXiv:2404.03189 - Published 6/10/2024 by Noah Y. Siegel, Oana-Maria Camburu, Nicolas Heess, Maria Perez-Ortiz

The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in Large Language Models

Overview

This paper proposes a new metric, called "Probability Ratio" (PR), to better evaluate the faithfulness of free-text explanations generated by large language models.
Current faithfulness metrics focus solely on the correctness of the generated explanations, but the authors argue that the probabilities associated with the explanations also matter.
PR aims to capture how well the explanation matches the probabilities of the original model's predictions, not just the top prediction.

Plain English Explanation

The paper is about improving the way we evaluate the explanations that large language models provide for their predictions. These models, like GPT-3, are often used to generate text that explains their reasoning, but current methods for measuring how well the explanations match the model's true reasoning have some limitations.

The key issue is that existing faithfulness metrics only look at whether the explanation correctly identifies the top prediction made by the model. However, the authors argue that the probabilities associated with the model's predictions are also important. For example, if a model is very confident about its top prediction, the explanation should reflect that high confidence. But if the model is only slightly more confident in one prediction over others, the explanation should convey that more nuanced probability distribution.

The new Probability Ratio (PR) metric proposed in the paper aims to capture this. It compares the probabilities in the explanation to the probabilities in the model's original predictions. This provides a more holistic measure of how well the explanation aligns with the model's internal reasoning, not just the final output.

Technical Explanation

The paper first reviews existing faithfulness metrics like FEQA, which focus on the correctness of the generated explanations. They argue these metrics are limited because they only assess whether the explanation identifies the top prediction, not the relative probabilities.

To address this, the authors introduce the Probability Ratio (PR) metric. PR compares the probabilities in the explanation to the probabilities in the model's original predictions. Specifically, it calculates the ratio between the probability of the explained prediction in the explanation, and the probability of that prediction in the model's output. This ratio is then averaged across all predictions to get the final PR score.

The paper evaluates PR on two large language models, GPT-3 and InstructGPT, across several explanation tasks. They find PR provides a more nuanced assessment of faithfulness compared to existing metrics. The results suggest PR can better identify cases where the explanation captures the model's true probability distribution, rather than just the top prediction.

Critical Analysis

The key strength of this work is that it addresses an important limitation in how we evaluate the faithfulness of language model explanations. Focusing only on the correctness of the top prediction ignores valuable information about the model's internal uncertainty and probability estimates.

That said, the paper does not provide extensive analysis of the limitations or potential issues with the PR metric. For example, it's unclear how PR would handle cases where the model's probability estimates are themselves unreliable or miscalibrated. Additionally, the experiments are fairly narrow in scope, so further testing would be needed to assess PR's generalizability.

Overall, this is a solid contribution that points the way towards more comprehensive faithfulness metrics. But there is still room for further research to fully understand the strengths, weaknesses, and appropriate applications of the PR approach.

Conclusion

This paper introduces a new metric called Probability Ratio (PR) that aims to provide a more faithful evaluation of the explanations generated by large language models. By considering the probabilities associated with the model's predictions, rather than just the top prediction, PR offers a more nuanced assessment of how well the explanation aligns with the model's internal reasoning.

The results suggest PR can identify cases where explanations capture the model's true probability distribution, going beyond simpler correctness-focused metrics. This is an important step towards better understanding and improving the transparency of these powerful language models. Further research is needed to fully explore the strengths and limitations of the PR approach, but this work represents a valuable contribution to the field.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in Large Language Models

Noah Y. Siegel, Oana-Maria Camburu, Nicolas Heess, Maria Perez-Ortiz

In order to oversee advanced AI systems, it is important to understand their underlying decision-making process. When prompted, large language models (LLMs) can provide natural language explanations or reasoning traces that sound plausible and receive high ratings from human annotators. However, it is unclear to what extent these explanations are faithful, i.e., truly capture the factors responsible for the model's predictions. In this work, we introduce Correlational Explanatory Faithfulness (CEF), a metric that can be used in faithfulness tests based on input interventions. Previous metrics used in such tests take into account only binary changes in the predictions. Our metric accounts for the total shift in the model's predicted label distribution, more accurately reflecting the explanations' faithfulness. We then introduce the Correlational Counterfactual Test (CCT) by instantiating CEF on the Counterfactual Test (CT) from Atanasova et al. (2023). We evaluate the faithfulness of free-text explanations generated by few-shot-prompted LLMs from the Llama2 family on three NLP tasks. We find that our metric measures aspects of faithfulness which the CT misses.

6/10/2024

Calibrating the Confidence of Large Language Models by Eliciting Fidelity

Mozhi Zhang, Mianqiu Huang, Rundong Shi, Linsen Guo, Chong Peng, Peng Yan, Yaqian Zhou, Xipeng Qiu

Large language models optimized with techniques like RLHF have achieved good alignment in being helpful and harmless. However, post-alignment, these language models often exhibit overconfidence, where the expressed confidence does not accurately calibrate with their correctness rate. In this paper, we decompose the language model confidence into the textit{Uncertainty} about the question and the textit{Fidelity} to the answer generated by language models. Then, we propose a plug-and-play method to estimate the confidence of language models. Our method has shown good calibration performance by conducting experiments with 6 RLHF-LMs on four MCQA datasets. Moreover, we propose two novel metrics, IPR and CE, to evaluate the calibration of the model, and we have conducted a detailed discussion on textit{Truly Well-Calibrated Confidence}. Our method could serve as a strong baseline, and we hope that this work will provide some insights into the model confidence calibration.

4/4/2024

💬

FaithLM: Towards Faithful Explanations for Large Language Models

Yu-Neng Chuang, Guanchu Wang, Chia-Yuan Chang, Ruixiang Tang, Shaochen Zhong, Fan Yang, Mengnan Du, Xuanting Cai, Xia Hu

Large Language Models (LLMs) have become proficient in addressing complex tasks by leveraging their extensive internal knowledge and reasoning capabilities. However, the black-box nature of these models complicates the task of explaining their decision-making processes. While recent advancements demonstrate the potential of leveraging LLMs to self-explain their predictions through natural language (NL) explanations, their explanations may not accurately reflect the LLMs' decision-making process due to a lack of fidelity optimization on the derived explanations. Measuring the fidelity of NL explanations is a challenging issue, as it is difficult to manipulate the input context to mask the semantics of these explanations. To this end, we introduce FaithLM to explain the decision of LLMs with NL explanations. Specifically, FaithLM designs a method for evaluating the fidelity of NL explanations by incorporating the contrary explanations to the query process. Moreover, FaithLM conducts an iterative process to improve the fidelity of derived explanations. Experiment results on three datasets from multiple domains demonstrate that FaithLM can significantly improve the fidelity of derived explanations, which also provides a better alignment with the ground-truth explanations.

6/27/2024

Local Explanations and Self-Explanations for Assessing Faithfulness in black-box LLMs

Christos Fragkathoulas, Odysseas S. Chlapanis

This paper introduces a novel task to assess the faithfulness of large language models (LLMs) using local perturbations and self-explanations. Many LLMs often require additional context to answer certain questions correctly. For this purpose, we propose a new efficient alternative explainability technique, inspired by the commonly used leave-one-out approach. Using this approach, we identify the sufficient and necessary parts for the LLM to generate correct answers, serving as explanations. We propose a metric for assessing faithfulness that compares these crucial parts with the self-explanations of the model. Using the Natural Questions dataset, we validate our approach, demonstrating its effectiveness in explaining model decisions and assessing faithfulness.

9/24/2024