Trustworthy Actionable Perturbations

2405.11195

Published 5/21/2024 by Jesse Friedbaum, Sudarshan Adiga, Ravi Tandon

Abstract

Counterfactuals, or modified inputs that lead to a different outcome, are an important tool for understanding the logic used by machine learning classifiers and how to change an undesirable classification. Even if a counterfactual changes a classifier's decision, however, it may not affect the true underlying class probabilities, i.e. the counterfactual may act like an adversarial attack and ``fool'' the classifier. We propose a new framework for creating modified inputs that change the true underlying probabilities in a beneficial way which we call Trustworthy Actionable Perturbations (TAP). This includes a novel verification procedure to ensure that TAP change the true class probabilities instead of acting adversarially. Our framework also includes new cost, reward, and goal definitions that are better suited to effectuating change in the real world. We present PAC-learnability results for our verification procedure and theoretically analyze our new method for measuring reward. We also develop a methodology for creating TAP and compare our results to those achieved by previous counterfactual methods.

Create account to get full access

Overview

This paper introduces a new approach called "Trustworthy Actionable Perturbations" (TAP) for generating adversarial examples that can be used to improve the robustness of machine learning models.
TAP aims to generate perturbations that are both effective at fooling the model and trustworthy, meaning they represent realistic changes that a user could plausibly make to an input.
The authors propose a framework for generating TAP and demonstrate its effectiveness on image and text classification tasks.

Plain English Explanation

Adversarial examples are small, often imperceptible changes to an input that can cause a machine learning model to make mistakes. While adversarial examples are useful for testing model robustness, they can also be concerning if they represent changes that a user could actually make in the real world.

The authors of this paper wanted to develop a way to generate "trustworthy" adversarial examples - changes to an input that are not only effective at fooling the model, but also represent realistic changes that a user could plausibly make. They call this approach "Trustworthy Actionable Perturbations" (TAP).

The key idea behind TAP is to constrain the adversarial perturbations to be within a "trust region" around the original input. This trust region represents the set of realistic changes that a user could make, such as adding or removing certain pixels in an image or swapping out words in a sentence. By limiting the perturbations to this trust region, the authors aim to generate adversarial examples that are both effective and trustworthy.

The authors demonstrate the effectiveness of TAP on both image and text classification tasks. They show that TAP can generate adversarial examples that are just as effective at fooling the model as traditional adversarial examples, but that also represent realistic changes that a user could make. This suggests that TAP could be a valuable tool for testing the real-world robustness of machine learning models.

Technical Explanation

The core of the TAP approach is the idea of a "trust region" around the original input, which represents the set of realistic changes that a user could make. The authors formalize this by defining a trust region function that maps the original input to a set of plausible perturbations.

To generate TAP, the authors propose an optimization-based framework that jointly optimizes for two objectives: (1) maximizing the model's loss on the perturbed input (i.e., making the perturbation effective at fooling the model) and (2) minimizing the distance between the perturbed input and the original input, subject to the perturbation being within the trust region.

The authors demonstrate the effectiveness of TAP on both image and text classification tasks. For images, they define the trust region as a set of pixel-level changes that preserve the overall structure and semantics of the image. For text, they define the trust region as a set of word-level changes that preserve the overall meaning and grammar of the input sentence.

Through extensive experiments, the authors show that TAP can generate adversarial examples that are just as effective at fooling the model as traditional adversarial examples, while also representing realistic changes that a user could make. This suggests that TAP could be a valuable tool for testing the real-world robustness of machine learning models.

Critical Analysis

One potential limitation of the TAP approach is that the definition of the trust region may not always capture all the realistic changes that a user could make. For example, in the image classification task, the authors only consider pixel-level changes, but a user could also make higher-level changes, such as adding or removing objects in the image. Extending the trust region to capture these types of changes could be an area for further research.

Additionally, the authors note that the TAP optimization problem can be challenging to solve in practice, particularly for complex models and large input spaces. They propose several techniques to make the optimization more efficient, but there may be room for further improvements in this area.

Finally, while the authors demonstrate the effectiveness of TAP on a range of tasks, it would be valuable to see how TAP performs on real-world applications with more complex and diverse inputs. Evaluating the trustworthiness and actionability of the generated perturbations in these settings could provide valuable insights and guide future research in this area.

Conclusion

Overall, this paper presents an interesting and important approach for generating adversarial examples that are both effective at fooling machine learning models and trustworthy, in the sense that they represent realistic changes that a user could make. The authors' framework for TAP, and their demonstration of its effectiveness on image and text classification tasks, suggest that this could be a valuable tool for improving the real-world robustness of machine learning models.

As the use of machine learning systems becomes more widespread, it is crucial that we develop techniques like TAP to ensure these systems are reliable and trustworthy in the face of adversarial attacks. The insights and methods presented in this paper could pave the way for further advancements in this important area of research.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Causal Action Influence Aware Counterfactual Data Augmentation

N'uria Armengol Urp'i, Marco Bagatella, Marin Vlastelica, Georg Martius

Offline data are both valuable and practical resources for teaching robots complex behaviors. Ideally, learning agents should not be constrained by the scarcity of available demonstrations, but rather generalize beyond the training distribution. However, the complexity of real-world scenarios typically requires huge amounts of data to prevent neural network policies from picking up on spurious correlations and learning non-causal relationships. We propose CAIAC, a data augmentation method that can create feasible synthetic transitions from a fixed dataset without having access to online environment interactions. By utilizing principled methods for quantifying causal influence, we are able to perform counterfactual reasoning by swapping $it{action}$-unaffected parts of the state-space between independent trajectories in the dataset. We empirically show that this leads to a substantial increase in robustness of offline learning algorithms against distributional shift.

5/30/2024

cs.LG cs.AI cs.RO

Cross-Input Certified Training for Universal Perturbations

Changming Xu, Gagandeep Singh

Existing work in trustworthy machine learning primarily focuses on single-input adversarial perturbations. In many real-world attack scenarios, input-agnostic adversarial attacks, e.g. universal adversarial perturbations (UAPs), are much more feasible. Current certified training methods train models robust to single-input perturbations but achieve suboptimal clean and UAP accuracy, thereby limiting their applicability in practical applications. We propose a novel method, CITRUS, for certified training of networks robust against UAP attackers. We show in an extensive evaluation across different datasets, architectures, and perturbation magnitudes that our method outperforms traditional certified training methods on standard accuracy (up to 10.3%) and achieves SOTA performance on the more practical certified UAP accuracy metric.

5/16/2024

cs.LG cs.CR

🖼️

Relevant Irrelevance: Generating Alterfactual Explanations for Image Classifiers

Silvan Mertes, Tobias Huber, Christina Karle, Katharina Weitz, Ruben Schlagowski, Cristina Conati, Elisabeth Andr'e

In this paper, we demonstrate the feasibility of alterfactual explanations for black box image classifiers. Traditional explanation mechanisms from the field of Counterfactual Thinking are a widely-used paradigm for Explainable Artificial Intelligence (XAI), as they follow a natural way of reasoning that humans are familiar with. However, most common approaches from this field are based on communicating information about features or characteristics that are especially important for an AI's decision. However, to fully understand a decision, not only knowledge about relevant features is needed, but the awareness of irrelevant information also highly contributes to the creation of a user's mental model of an AI system. To this end, a novel approach for explaining AI systems called alterfactual explanations was recently proposed on a conceptual level. It is based on showing an alternative reality where irrelevant features of an AI's input are altered. By doing so, the user directly sees which input data characteristics can change arbitrarily without influencing the AI's decision. In this paper, we show for the first time that it is possible to apply this idea to black box models based on neural networks. To this end, we present a GAN-based approach to generate these alterfactual explanations for binary image classifiers. Further, we present a user study that gives interesting insights on how alterfactual explanations can complement counterfactual explanations.

5/10/2024

cs.CV cs.AI cs.LG

🤯

Conformal Counterfactual Inference under Hidden Confounding

Zonghao Chen, Ruocheng Guo, Jean-Franc{c}ois Ton, Yang Liu

Personalized decision making requires the knowledge of potential outcomes under different treatments, and confidence intervals about the potential outcomes further enrich this decision-making process and improve its reliability in high-stakes scenarios. Predicting potential outcomes along with its uncertainty in a counterfactual world poses the foundamental challenge in causal inference. Existing methods that construct confidence intervals for counterfactuals either rely on the assumption of strong ignorability, or need access to un-identifiable lower and upper bounds that characterize the difference between observational and interventional distributions. To overcome these limitations, we first propose a novel approach wTCP-DR based on transductive weighted conformal prediction, which provides confidence intervals for counterfactual outcomes with marginal converage guarantees, even under hidden confounding. With less restrictive assumptions, our approach requires access to a fraction of interventional data (from randomized controlled trials) to account for the covariate shift from observational distributoin to interventional distribution. Theoretical results explicitly demonstrate the conditions under which our algorithm is strictly advantageous to the naive method that only uses interventional data. After ensuring valid intervals on counterfactuals, it is straightforward to construct intervals for individual treatment effects (ITEs). We demonstrate our method across synthetic and real-world data, including recommendation systems, to verify the superiority of our methods compared against state-of-the-art baselines in terms of both coverage and efficiency

5/22/2024

cs.LG