Robust Infidelity: When Faithfulness Measures on Masked Language Models Are Misleading

Read original: arXiv:2308.06795 - Published 6/4/2024 by Evan Crothers, Herna Viktor, Nathalie Japkowicz

💬

Overview

The paper examines the use of iterative masking to measure the interpretability of neural text classifiers.
It proposes that this approach is better described as "sensitivity to iterative masking" and highlights potential issues with using it to compare interpretability across models.
The paper demonstrates that iterative masking can produce large variations in faithfulness scores for otherwise comparable Transformer-based text classifiers.
It also shows that iteratively masked samples can result in embeddings that fall outside the distribution seen during training, leading to unpredictable model behavior.
The paper explores task-specific considerations that can undermine the principled comparison of interpretability using iterative masking, such as similarities to salience-based adversarial attacks.

Plain English Explanation

Interpreting how neural networks make decisions is an important area of research. A common approach is to calculate "faithfulness" metrics by iteratively masking (hiding) parts of the input and seeing how the model's prediction changes.

The authors argue that this "iterative masking" approach is better described as measuring the model's "sensitivity" to those changes, rather than its "faithfulness" to the original input. They show that this sensitivity can vary a lot between otherwise similar Transformer-based text classifiers, making it difficult to compare their interpretability.

The reason for this is that the masked inputs can produce embeddings (numerical representations of the text) that are outside the distribution the model was trained on. This leads to unpredictable model behavior, undermining the usefulness of the faithfulness metric.

The authors also highlight how the task the model is trained on can affect the interpretation of these sensitivity measurements. For example, they note similarities to adversarial attacks that manipulate the input to change the model's prediction.

Overall, the paper provides important insights into the limitations of using iterative masking to compare the interpretability of neural text classifiers, and offers guidance on how these sensitivity measurements should be interpreted.

Technical Explanation

The paper investigates the use of iterative masking as a way to quantify the interpretability of neural text classifiers. Iterative masking involves repeatedly hiding (masking) parts of the input text and measuring how this affects the model's prediction.

The authors propose that this approach is better described as "sensitivity to iterative masking" rather than "faithfulness" to the original input. They demonstrate that this sensitivity can vary significantly between otherwise comparable Transformer-based text classifiers, undermining the usefulness of the faithfulness metric for comparing interpretability.

Through experiments, the researchers show that iteratively masked samples can produce embeddings that fall outside the distribution seen during the model's training. This leads to unpredictable model behavior, as the model is making predictions on inputs it was not prepared for.

The paper also explores task-specific considerations that can further complicate the interpretation of these sensitivity measurements. For example, the authors note similarities between iterative masking and salience-based adversarial attacks, which manipulate the input to change the model's prediction.

Overall, the findings suggest that while iterative masking can provide insights into a model's behavior, it may not be a reliable way to compare the interpretability of different text classifiers. The paper encourages careful interpretation of these sensitivity measures and highlights the need for further research into more robust methods for assessing model interpretability.

Critical Analysis

The paper raises important concerns about the use of iterative masking to quantify the interpretability of neural text classifiers. The authors make a compelling case that this approach is better described as measuring sensitivity to masking, rather than faithfulness to the original input.

Their experiments demonstrate that this sensitivity can vary significantly between otherwise comparable models, undermining the usefulness of the faithfulness metric for comparing interpretability. The finding that iteratively masked samples can produce embeddings outside the model's training distribution, leading to unpredictable behavior, is a crucial insight.

However, the paper could have explored these issues in greater depth. For example, it would be interesting to see how the observed behaviors vary across different model architectures, training datasets, or task domains. The authors also briefly mention similarities to adversarial attacks, but do not delve into the implications of this connection.

Additionally, while the paper highlights the limitations of iterative masking, it does not propose any alternative methods for assessing model interpretability. Readers are left wondering what approaches might be more suitable for this purpose. Further research is needed to develop robust and reliable techniques for evaluating the interpretability of text classifiers.

Overall, the paper provides valuable insights and raises important questions about the use of iterative masking for quantifying interpretability. Its findings should encourage researchers and practitioners to think critically about the limitations of this approach and explore alternative methods for assessing the transparency and explainability of neural text models.

Conclusion

This paper challenges the common practice of using iterative masking to quantify the interpretability of neural text classifiers. The authors argue that this approach is better described as measuring the model's sensitivity to masking, rather than its faithfulness to the original input.

Through experiments, the researchers demonstrate that this sensitivity can vary significantly between otherwise comparable Transformer-based models, undermining the usefulness of the faithfulness metric for comparing interpretability. They also show that iteratively masked samples can produce embeddings outside the model's training distribution, leading to unpredictable behavior.

The paper highlights task-specific considerations, such as similarities to adversarial attacks, that further complicate the interpretation of these sensitivity measurements. Overall, the findings suggest that while iterative masking can provide useful insights, it may not be a reliable way to assess and compare the interpretability of different text classifiers.

The paper encourages careful interpretation of these sensitivity measures and emphasizes the need for further research into more robust and principled methods for evaluating the transparency and explainability of neural text models. As the use of these models continues to grow, developing reliable interpretability assessment tools will be crucial for building trust and ensuring their responsible deployment.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Robust Infidelity: When Faithfulness Measures on Masked Language Models Are Misleading

Evan Crothers, Herna Viktor, Nathalie Japkowicz

A common approach to quantifying neural text classifier interpretability is to calculate faithfulness metrics based on iteratively masking salient input tokens and measuring changes in the model prediction. We propose that this property is better described as sensitivity to iterative masking, and highlight pitfalls in using this measure for comparing text classifier interpretability. We show that iterative masking produces large variation in faithfulness scores between otherwise comparable Transformer encoder text classifiers. We then demonstrate that iteratively masked samples produce embeddings outside the distribution seen during training, resulting in unpredictable behaviour. We further explore task-specific considerations that undermine principled comparison of interpretability using iterative masking, such as an underlying similarity to salience-based adversarial attacks. Our findings give insight into how these behaviours affect neural text classifiers, and provide guidance on how sensitivity to iterative masking should be interpreted.

6/4/2024

💬

Faithfulness Measurable Masked Language Models

Andreas Madsen, Siva Reddy, Sarath Chandar

A common approach to explaining NLP models is to use importance measures that express which tokens are important for a prediction. Unfortunately, such explanations are often wrong despite being persuasive. Therefore, it is essential to measure their faithfulness. One such metric is if tokens are truly important, then masking them should result in worse model performance. However, token masking introduces out-of-distribution issues, and existing solutions that address this are computationally expensive and employ proxy models. Furthermore, other metrics are very limited in scope. This work proposes an inherently faithfulness measurable model that addresses these challenges. This is achieved using a novel fine-tuning method that incorporates masking, such that masking tokens become in-distribution by design. This differs from existing approaches, which are completely model-agnostic but are inapplicable in practice. We demonstrate the generality of our approach by applying it to 16 different datasets and validate it using statistical in-distribution tests. The faithfulness is then measured with 9 different importance measures. Because masking is in-distribution, importance measures that themselves use masking become consistently more faithful. Additionally, because the model makes faithfulness cheap to measure, we can optimize explanations towards maximal faithfulness; thus, our model becomes indirectly inherently explainable.

8/29/2024

🔍

New!Faithfulness and the Notion of Adversarial Sensitivity in NLP Explanations

Supriya Manna, Niladri Sett

Faithfulness is arguably the most critical metric to assess the reliability of explainable AI. In NLP, current methods for faithfulness evaluation are fraught with discrepancies and biases, often failing to capture the true reasoning of models. We introduce Adversarial Sensitivity as a novel approach to faithfulness evaluation, focusing on the explainer's response when the model is under adversarial attack. Our method accounts for the faithfulness of explainers by capturing sensitivity to adversarial input changes. This work addresses significant limitations in existing evaluation techniques, and furthermore, quantifies faithfulness from a crucial yet underexplored paradigm.

9/27/2024

Transformer Circuit Faithfulness Metrics are not Robust

Joseph Miller, Bilal Chughtai, William Saunders

Mechanistic interpretability work attempts to reverse engineer the learned algorithms present inside neural networks. One focus of this work has been to discover 'circuits' -- subgraphs of the full model that explain behaviour on specific tasks. But how do we measure the performance of such circuits? Prior work has attempted to measure circuit 'faithfulness' -- the degree to which the circuit replicates the performance of the full model. In this work, we survey many considerations for designing experiments that measure circuit faithfulness by ablating portions of the model's computation. Concerningly, we find existing methods are highly sensitive to seemingly insignificant changes in the ablation methodology. We conclude that existing circuit faithfulness scores reflect both the methodological choices of researchers as well as the actual components of the circuit - the task a circuit is required to perform depends on the ablation used to test it. The ultimate goal of mechanistic interpretability work is to understand neural networks, so we emphasize the need for more clarity in the precise claims being made about circuits. We open source a library at https://github.com/UFO-101/auto-circuit that includes highly efficient implementations of a wide range of ablation methodologies and circuit discovery algorithms.

7/12/2024