Improving Interpretation Faithfulness for Vision Transformers

2311.17983

Published 5/6/2024 by Lijie Hu, Yixin Liu, Ninghao Liu, Mengdi Huai, Lichao Sun, Di Wang

👀

Abstract

Vision Transformers (ViTs) have achieved state-of-the-art performance for various vision tasks. One reason behind the success lies in their ability to provide plausible innate explanations for the behavior of neural architectures. However, ViTs suffer from issues with explanation faithfulness, as their focal points are fragile to adversarial attacks and can be easily changed with even slight perturbations on the input image. In this paper, we propose a rigorous approach to mitigate these issues by introducing Faithful ViTs (FViTs). Briefly speaking, an FViT should have the following two properties: (1) The top-$k$ indices of its self-attention vector should remain mostly unchanged under input perturbation, indicating stable explanations; (2) The prediction distribution should be robust to perturbations. To achieve this, we propose a new method called Denoised Diffusion Smoothing (DDS), which adopts randomized smoothing and diffusion-based denoising. We theoretically prove that processing ViTs directly with DDS can turn them into FViTs. We also show that Gaussian noise is nearly optimal for both $ell_2$ and $ell_infty$-norm cases. Finally, we demonstrate the effectiveness of our approach through comprehensive experiments and evaluations. Results show that FViTs are more robust against adversarial attacks while maintaining the explainability of attention, indicating higher faithfulness.

Create account to get full access

Overview

Vision Transformers (ViTs) have achieved state-of-the-art performance in various computer vision tasks
However, ViTs suffer from issues with explanation faithfulness, where their attention points can be easily changed by small perturbations to the input image
This paper proposes a new approach called Faithful ViTs (FViTs) to address these issues

Plain English Explanation

Faithful ViTs (FViTs): Robust and Explainable Vision Transformers

Vision Transformers (ViTs) are a type of artificial intelligence model that have become very good at various computer vision tasks, like identifying objects in images. One reason for their success is that they can provide explanations for their decision-making, by showing which parts of an image they are focusing on.

However, these explanations can be unreliable - even small changes to the input image can cause the model to suddenly focus on different parts of the image, making the explanations unstable and untrustworthy. This is an important issue, as we want AI systems to be transparent about how they arrive at their decisions.

To address this, the researchers in this paper propose a new type of Vision Transformer called a Faithful ViT (FViT). An FViT has two key properties:

Stable Explanations: The parts of the image that the FViT model focuses on (its "attention") should remain mostly unchanged, even if the input image is slightly perturbed or modified.
Robust Predictions: The FViT's final predictions about the contents of the image should also be stable and unchanged by small perturbations to the input.

To achieve these properties, the researchers developed a new technique called Denoised Diffusion Smoothing (DDS). DDS helps make ViT models more robust and faithful in their explanations, without significantly impacting their overall performance.

Technical Explanation

Denoised Diffusion Smoothing for Faithful Vision Transformers

The key technical contribution of this paper is the Denoised Diffusion Smoothing (DDS) method, which the authors use to transform standard ViT models into Faithful ViTs (FViTs).

DDS works by applying randomized smoothing and diffusion-based denoising to the ViT model. Randomized smoothing helps stabilize the model's attention weights, making them less sensitive to small input changes. Diffusion-based denoising then further improves the robustness of the model's predictions.

The authors show that processing ViTs directly with DDS can turn them into FViTs that satisfy the two key properties: stable attention explanations and robust predictions. They also provide theoretical analysis to show that Gaussian noise is nearly optimal for achieving these properties under both the ℓ₂ and ℓ∞ norms.

Through comprehensive experiments, the researchers demonstrate that FViTs are more robust against adversarial attacks while still maintaining the explainability of their attention mechanisms. This indicates that FViTs achieve higher faithfulness compared to standard ViTs.

Critical Analysis

Inherent Adversarial Robustness of Active Vision Systems

One potential limitation of this work is that the proposed DDS technique may come with a computational cost, as it requires additional processing steps beyond the standard ViT architecture. The authors do not provide a detailed analysis of the computational complexity or performance impact of DDS.

Additionally, while the experiments show FViTs are more robust to adversarial attacks, the paper does not explore the extent to which this robustness translates to real-world scenarios with natural distribution shifts or common visual distortions. Further testing in diverse, real-world settings would help better assess the practical benefits of FViTs.

The paper also does not address potential tradeoffs between the faithfulness of explanations and overall model performance. It's possible that the techniques used to improve faithfulness could negatively impact other metrics like accuracy or efficiency. Exploring these tradeoffs would provide a more holistic understanding of the benefits and limitations of FViTs.

Parameter-Efficient Fine-Tuning of Self-Supervised Vision Transformers and Exploring Self-Supervised Vision Transformers for Deepfake Detection provide additional context on the broader developments in Vision Transformer research that could inform future extensions of this work.

Conclusion

This paper introduces Faithful ViTs (FViTs), a new approach to improving the faithfulness and robustness of Vision Transformer (ViT) models. By applying Denoised Diffusion Smoothing (DDS), the researchers were able to create ViT models with two key properties:

Stable attention explanations, where the parts of the input image that the model focuses on remain mostly unchanged even with small perturbations.
Robust predictions, where the model's final output is also insensitive to minor changes in the input.

The results show that FViTs are more resistant to adversarial attacks while still maintaining the explainability of their attention mechanisms. This represents an important step towards building more trustworthy and transparent computer vision AI systems. Further research is needed to fully understand the practical implications and potential tradeoffs of this approach, but the work presented in this paper is a promising advance in the field.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Probabilistic Conceptual Explainers: Trustworthy Conceptual Explanations for Vision Foundation Models

Hengyi Wang, Shiwei Tan, Hao Wang

Vision transformers (ViTs) have emerged as a significant area of focus, particularly for their capacity to be jointly trained with large language models and to serve as robust vision foundation models. Yet, the development of trustworthy explanation methods for ViTs has lagged, particularly in the context of post-hoc interpretations of ViT predictions. Existing sub-image selection approaches, such as feature-attribution and conceptual models, fall short in this regard. This paper proposes five desiderata for explaining ViTs -- faithfulness, stability, sparsity, multi-level structure, and parsimony -- and demonstrates the inadequacy of current methods in meeting these criteria comprehensively. We introduce a variational Bayesian explanation framework, dubbed ProbAbilistic Concept Explainers (PACE), which models the distributions of patch embeddings to provide trustworthy post-hoc conceptual explanations. Our qualitative analysis reveals the distributions of patch-level concepts, elucidating the effectiveness of ViTs by modeling the joint distribution of patch embeddings and ViT's predictions. Moreover, these patch-level explanations bridge the gap between image-level and dataset-level explanations, thus completing the multi-level structure of PACE. Through extensive experiments on both synthetic and real-world datasets, we demonstrate that PACE surpasses state-of-the-art methods in terms of the defined desiderata.

6/21/2024

cs.LG cs.AI cs.CV stat.ML

👀

DiffiT: Diffusion Vision Transformers for Image Generation

Ali Hatamizadeh, Jiaming Song, Guilin Liu, Jan Kautz, Arash Vahdat

Diffusion models with their powerful expressivity and high sample quality have achieved State-Of-The-Art (SOTA) performance in the generative domain. The pioneering Vision Transformer (ViT) has also demonstrated strong modeling capabilities and scalability, especially for recognition tasks. In this paper, we study the effectiveness of ViTs in diffusion-based generative learning and propose a new model denoted as Diffusion Vision Transformers (DiffiT). Specifically, we propose a methodology for finegrained control of the denoising process and introduce the Time-dependant Multihead Self Attention (TMSA) mechanism. DiffiT is surprisingly effective in generating high-fidelity images with significantly better parameter efficiency. We also propose latent and image space DiffiT models and show SOTA performance on a variety of class-conditional and unconditional synthesis tasks at different resolutions. The Latent DiffiT model achieves a new SOTA FID score of 1.73 on ImageNet-256 dataset while having 19.85%, 16.88% less parameters than other Transformer-based diffusion models such as MDT and DiT, respectively. Code: https://github.com/NVlabs/DiffiT

4/3/2024

cs.CV cs.AI cs.LG

🏋️

ViTGAN: Training GANs with Vision Transformers

Kwonjoon Lee, Huiwen Chang, Lu Jiang, Han Zhang, Zhuowen Tu, Ce Liu

Recently, Vision Transformers (ViTs) have shown competitive performance on image recognition while requiring less vision-specific inductive biases. In this paper, we investigate if such performance can be extended to image generation. To this end, we integrate the ViT architecture into generative adversarial networks (GANs). For ViT discriminators, we observe that existing regularization methods for GANs interact poorly with self-attention, causing serious instability during training. To resolve this issue, we introduce several novel regularization techniques for training GANs with ViTs. For ViT generators, we examine architectural choices for latent and pixel mapping layers to facilitate convergence. Empirically, our approach, named ViTGAN, achieves comparable performance to the leading CNN-based GAN models on three datasets: CIFAR-10, CelebA, and LSUN bedroom.

5/30/2024

cs.CV cs.LG eess.IV

On the Faithfulness of Vision Transformer Explanations

Junyi Wu, Weitai Kang, Hao Tang, Yuan Hong, Yan Yan

To interpret Vision Transformers, post-hoc explanations assign salience scores to input pixels, providing human-understandable heatmaps. However, whether these interpretations reflect true rationales behind the model's output is still underexplored. To address this gap, we study the faithfulness criterion of explanations: the assigned salience scores should represent the influence of the corresponding input pixels on the model's predictions. To evaluate faithfulness, we introduce Salience-guided Faithfulness Coefficient (SaCo), a novel evaluation metric leveraging essential information of salience distribution. Specifically, we conduct pair-wise comparisons among distinct pixel groups and then aggregate the differences in their salience scores, resulting in a coefficient that indicates the explanation's degree of faithfulness. Our explorations reveal that current metrics struggle to differentiate between advanced explanation methods and Random Attribution, thereby failing to capture the faithfulness property. In contrast, our proposed SaCo offers a reliable faithfulness measurement, establishing a robust metric for interpretations. Furthermore, our SaCo demonstrates that the use of gradient and multi-layer aggregation can markedly enhance the faithfulness of attention-based explanation, shedding light on potential paths for advancing Vision Transformer explainability.

4/3/2024

cs.CV