Certified $ell_2$ Attribution Robustness via Uniformly Smoothed Attributions

Read original: arXiv:2405.06361 - Published 5/13/2024 by Fan Wang, Adams Wai-Kin Kong

🏋️

Overview

Model attribution is a popular tool to explain the rationales behind model predictions
Recent work suggests that these attributions are vulnerable to small changes (perturbations) in the input, which can fool the attributions while maintaining the original prediction
Empirical studies have shown that adversarial training can improve performance, but an effective certified defense method is needed to understand the robustness of attributions

Plain English Explanation

Model attribution is a technique used to explain the reasons behind the predictions made by machine learning models. However, recent research has found that these explanations can be easily fooled. Researchers have discovered that by making small, barely noticeable changes to the input, they can alter the explanations while keeping the original model prediction the same.

While studies have shown that training models with adversarial examples (inputs designed to trick the model) can improve their robustness, a more rigorous certified defense method is needed to truly understand how resilient these model explanations are.

In this work, the researchers propose using a technique called uniform smoothing to augment the standard model attributions. This involves adding random noise to the input in a controlled way, which helps ensure that the explanations remain stable even when the input is perturbed. The researchers prove that this approach can provide a guaranteed lower bound on the similarity between the explanations for the original and perturbed inputs, no matter how the input is changed within a certain region.

The researchers also develop alternative formulations of this certification process that can be used to determine the maximum size of perturbation or the minimum amount of smoothing required to protect the explanations. They evaluate their method on various datasets and show that it can effectively defend the model attributions against attacks, regardless of the model architecture, training procedure, or dataset size.

Technical Explanation

The researchers propose using a uniform smoothing technique to augment standard model attribution methods and make them more robust to input perturbations. Specifically, they add uniform random noise to the input and then compute the attributions on the noisy input.

They prove that, for all perturbations within a certain attack region, the cosine similarity between the smoothed attribution of the perturbed input and the unperturbed input is guaranteed to be lower bounded. This provides a certificate of robustness for the attributions.

The researchers also derive alternative formulations of this certification process that are equivalent to the original, but provide the maximum size of perturbation or the minimum smoothing radius required to ensure the attributions cannot be perturbed.

The proposed method is evaluated on three different datasets, and the results show that it can effectively protect the model attributions from attacks, regardless of the model architecture, training scheme, or dataset size.

Critical Analysis

The researchers have proposed an innovative technique to certify the robustness of model attributions, which is an important problem given the growing reliance on explainable AI and the potential risks of using unreliable explanations.

One potential limitation of the work is that the certification guarantee only holds for a specific attack region, and it's not clear how to determine the appropriate size of this region in practice. Additionally, the researchers do not explore the trade-offs between the degree of smoothing and the tightness of the robustness guarantee.

It would also be interesting to see how this approach compares to other certified defense methods in terms of computational efficiency and practical applicability.

Overall, this research represents an important step towards building more trustworthy and reliable explainable AI systems, but further work is needed to fully understand the strengths and limitations of the proposed approach.

Conclusion

This paper introduces a novel technique for certifying the robustness of model attributions, a crucial component of explainable AI. By using uniform smoothing to augment standard attribution methods, the researchers are able to provide a guaranteed lower bound on the similarity between the explanations for original and perturbed inputs.

The proposed approach is shown to be effective across a variety of datasets and model architectures, suggesting it could be a valuable tool for ensuring the reliability of model explanations in high-stakes applications. As AI systems become increasingly prominent in decision-making processes, developing robust and trustworthy explanations will be essential for building public confidence and acceptance of these technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏋️

Certified $ell_2$ Attribution Robustness via Uniformly Smoothed Attributions

Fan Wang, Adams Wai-Kin Kong

Model attribution is a popular tool to explain the rationales behind model predictions. However, recent work suggests that the attributions are vulnerable to minute perturbations, which can be added to input samples to fool the attributions while maintaining the prediction outputs. Although empirical studies have shown positive performance via adversarial training, an effective certified defense method is eminently needed to understand the robustness of attributions. In this work, we propose to use uniform smoothing technique that augments the vanilla attributions by noises uniformly sampled from a certain space. It is proved that, for all perturbations within the attack region, the cosine similarity between uniformly smoothed attribution of perturbed sample and the unperturbed sample is guaranteed to be lower bounded. We also derive alternative formulations of the certification that is equivalent to the original one and provides the maximum size of perturbation or the minimum smoothing radius such that the attribution can not be perturbed. We evaluate the proposed method on three datasets and show that the proposed method can effectively protect the attributions from attacks, regardless of the architecture of networks, training schemes and the size of the datasets.

5/13/2024

Certifying Adapters: Enabling and Enhancing the Certification of Classifier Adversarial Robustness

Jieren Deng, Hanbin Hong, Aaron Palmer, Xin Zhou, Jinbo Bi, Kaleel Mahmood, Yuan Hong, Derek Aguiar

Randomized smoothing has become a leading method for achieving certified robustness in deep classifiers against l_{p}-norm adversarial perturbations. Current approaches for achieving certified robustness, such as data augmentation with Gaussian noise and adversarial training, require expensive training procedures that tune large models for different Gaussian noise levels and thus cannot leverage high-performance pre-trained neural networks. In this work, we introduce a novel certifying adapters framework (CAF) that enables and enhances the certification of classifier adversarial robustness. Our approach makes few assumptions about the underlying training algorithm or feature extractor and is thus broadly applicable to different feature extractor architectures (e.g., convolutional neural networks or vision transformers) and smoothing algorithms. We show that CAF (a) enables certification in uncertified models pre-trained on clean datasets and (b) substantially improves the performance of certified classifiers via randomized smoothing and SmoothAdv at multiple radii in CIFAR-10 and ImageNet. We demonstrate that CAF achieves improved certified accuracies when compared to methods based on random or denoised smoothing, and that CAF is insensitive to certifying adapter hyperparameters. Finally, we show that an ensemble of adapters enables a single pre-trained feature extractor to defend against a range of noise perturbation scales.

5/28/2024

Adaptive Randomized Smoothing: Certifying Multi-Step Defences against Adversarial Examples

Saiyue Lyu, Shadab Shaikh, Frederick Shpilevskiy, Evan Shelhamer, Mathias L'ecuyer

We propose Adaptive Randomized Smoothing (ARS) to certify the predictions of our test-time adaptive models against adversarial examples. ARS extends the analysis of randomized smoothing using f-Differential Privacy to certify the adaptive composition of multiple steps. For the first time, our theory covers the sound adaptive composition of general and high-dimensional functions of noisy input. We instantiate ARS on deep image classification to certify predictions against adversarial examples of bounded $L_{infty}$ norm. In the $L_{infty}$ threat model, our flexibility enables adaptation through high-dimensional input-dependent masking. We design adaptivity benchmarks, based on CIFAR-10 and CelebA, and show that ARS improves accuracy by $2$ to $5%$ points. On ImageNet, ARS improves accuracy by $1$ to $3%$ points over standard RS without adaptivity.

6/18/2024

SPLITZ: Certifiable Robustness via Split Lipschitz Randomized Smoothing

Meiyu Zhong, Ravi Tandon

Certifiable robustness gives the guarantee that small perturbations around an input to a classifier will not change the prediction. There are two approaches to provide certifiable robustness to adversarial examples: a) explicitly training classifiers with small Lipschitz constants, and b) Randomized smoothing, which adds random noise to the input to create a smooth classifier. We propose textit{SPLITZ}, a practical and novel approach which leverages the synergistic benefits of both the above ideas into a single framework. Our main idea is to textit{split} a classifier into two halves, constrain the Lipschitz constant of the first half, and smooth the second half via randomization. Motivation for textit{SPLITZ} comes from the observation that many standard deep networks exhibit heterogeneity in Lipschitz constants across layers. textit{SPLITZ} can exploit this heterogeneity while inheriting the scalability of randomized smoothing. We present a principled approach to train textit{SPLITZ} and provide theoretical analysis to derive certified robustness guarantees during inference. We present a comprehensive comparison of robustness-accuracy tradeoffs and show that textit{SPLITZ} consistently improves upon existing state-of-the-art approaches on MNIST and CIFAR-10 datasets. For instance, with $ell_2$ norm perturbation budget of textbf{$epsilon=1$}, textit{SPLITZ} achieves $textbf{43.2%}$ top-1 test accuracy on CIFAR-10 dataset compared to state-of-art top-1 test accuracy $textbf{39.8%}

7/4/2024