Illusory Attacks: Detectability Matters in Adversarial Attacks on Sequential Decision-Makers

Read original: arXiv:2207.10170 - Published 5/7/2024 by Tim Franzmeyer, Stephen McAleer, Jo~ao F. Henriques, Jakob N. Foerster, Philip H. S. Torr, Adel Bibi, Christian Schroeder de Witt

📶

Overview

Autonomous agents deployed in the real world need to be robust against adversarial attacks on their sensory inputs.
Existing observation-space attacks on reinforcement learning agents are effective but lack information-theoretic detectability constraints, making them detectable using automated means or human inspection.
The paper introduces eattacks{}, a novel form of adversarial attack that is both effective and of ε-bounded statistical detectability.
The authors propose a novel dual ascent algorithm to learn such attacks end-to-end and empirically find eattacks{} to be significantly harder to detect with automated methods and human participants.

Plain English Explanation

Autonomous agents, like self-driving cars or robots, need to be able to handle unexpected or malicious inputs that could try to trick them. The paper looks at a type of attack called an "observation-space attack," where an adversary tries to manipulate the sensor inputs of the agent to make it behave in a way that is harmful or undesirable.

The key insight is that while these existing attacks are effective, they have a weakness: they can be easily detected, either by automated systems or by humans inspecting the agent's behavior. This is problematic for the adversary because if their attack is detected, it could trigger security measures or other responses that they want to avoid.

To address this, the researchers introduce a new type of attack called eattacks{}. These attacks are designed to be both effective at manipulating the agent's behavior and also very difficult to detect, even with advanced methods. The key is that eattacks{} have a strict limit on how statistically different they are from normal, non-adversarial inputs.

The researchers develop a novel algorithm to learn these eattacks{} and show through experiments that they are indeed much harder to detect than existing attacks, both through automated systems and with human participants. This suggests that we need better ways to detect and defend against these types of stealthy adversarial attacks, as they could pose a significant threat to the safety and reliability of autonomous systems deployed in the real world.

Technical Explanation

The paper proposes a novel form of adversarial attack called eattacks{} that is both effective at manipulating the behavior of reinforcement learning agents and has ε-bounded statistical detectability. This is in contrast to existing observation-space attacks, which the authors find lack such detectability constraints, making them more easily identifiable.

The authors develop a dual ascent algorithm to learn these eattacks{} end-to-end. The core idea is to optimize the attack to maximize its effectiveness at fooling the agent, while simultaneously minimizing a statistical distance metric that captures the attack's detectability. This ensures the learned attacks are both potent and stealthy.

Through extensive experiments, the researchers demonstrate that eattacks{} are significantly harder to detect using automated anomaly detection methods compared to previous attack approaches. Additionally, a small user study with human participants suggests that eattacks{} are also more challenging for people to identify as adversarial.

The authors argue that these findings highlight the need for better anomaly detectors and more robust hardware- and system-level defenses against such stealthy adversarial attacks, as they pose a serious threat to the reliability and safety of autonomous agents deployed in the real world.

Critical Analysis

The paper makes a valuable contribution by introducing the concept of ε-bounded statistical detectability for adversarial attacks, which addresses a key weakness in existing observation-space attacks. By optimizing for both effectiveness and stealthiness, the eattacks{} approach represents an important advancement in the field of adversarial machine learning.

However, the user study with human participants was quite small in scale, with only 20 participants. While the results suggest eattacks{} are harder for humans to detect, a larger-scale study would be needed to draw more definitive conclusions. Additionally, the paper does not discuss the computational complexity or scalability of the dual ascent algorithm used to generate the eattacks{}, which could be an important practical consideration.

Further research could explore the generalizability of the eattacks{} approach to a wider range of reinforcement learning agents and environments, as well as investigate potential defenses that go beyond just anomaly detection, such as structural adversarial attacks or resilient adversarial detectors. There may also be opportunities to further study the human evaluation of such attacks and develop more robust approaches to assessing their impact.

Overall, this paper represents an important step forward in understanding and addressing the vulnerabilities of autonomous agents to sophisticated adversarial attacks, and highlights the need for continued research and development of effective defenses.

Conclusion

The paper introduces a novel form of adversarial attack called eattacks{} that is both effective at manipulating the behavior of reinforcement learning agents and has ε-bounded statistical detectability. This addresses a key weakness in existing observation-space attacks, which can be more easily identified through automated or human-based detection methods.

The researchers develop a dual ascent algorithm to learn these eattacks{} end-to-end, and their experiments demonstrate that the resulting attacks are significantly harder to detect than previous approaches, both through automated anomaly detection and with human participants. This suggests the need for better anomaly detectors and more robust hardware- and system-level defenses to protect autonomous agents deployed in the real world against such stealthy adversarial attacks.

The findings of this work highlight the ongoing challenges in building truly reliable and secure autonomous systems, and the importance of continued research and development in this critical area of study.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📶

Illusory Attacks: Detectability Matters in Adversarial Attacks on Sequential Decision-Makers

Tim Franzmeyer, Stephen McAleer, Jo~ao F. Henriques, Jakob N. Foerster, Philip H. S. Torr, Adel Bibi, Christian Schroeder de Witt

Autonomous agents deployed in the real world need to be robust against adversarial attacks on sensory inputs. Robustifying agent policies requires anticipating the strongest attacks possible. We demonstrate that existing observation-space attacks on reinforcement learning agents have a common weakness: while effective, their lack of information-theoretic detectability constraints makes them detectable using automated means or human inspection. Detectability is undesirable to adversaries as it may trigger security escalations. We introduce {epsilon}-illusory, a novel form of adversarial attack on sequential decision-makers that is both effective and of {epsilon}-bounded statistical detectability. We propose a novel dual ascent algorithm to learn such attacks end-to-end. Compared to existing attacks, we empirically find {epsilon}-illusory to be significantly harder to detect with automated methods, and a small study with human participants (IRB approval under reference R84123/RE001) suggests they are similarly harder to detect for humans. Our findings suggest the need for better anomaly detectors, as well as effective hardware- and system-level defenses. The project website can be found at https://tinyurl.com/illusory-attacks.

5/7/2024

Optimal Attack and Defense for Reinforcement Learning

Jeremy McMahan, Young Wu, Xiaojin Zhu, Qiaomin Xie

To ensure the usefulness of Reinforcement Learning (RL) in real systems, it is crucial to ensure they are robust to noise and adversarial attacks. In adversarial RL, an external attacker has the power to manipulate the victim agent's interaction with the environment. We study the full class of online manipulation attacks, which include (i) state attacks, (ii) observation attacks (which are a generalization of perceived-state attacks), (iii) action attacks, and (iv) reward attacks. We show the attacker's problem of designing a stealthy attack that maximizes its own expected reward, which often corresponds to minimizing the victim's value, is captured by a Markov Decision Process (MDP) that we call a meta-MDP since it is not the true environment but a higher level environment induced by the attacked interaction. We show that the attacker can derive optimal attacks by planning in polynomial time or learning with polynomial sample complexity using standard RL techniques. We argue that the optimal defense policy for the victim can be computed as the solution to a stochastic Stackelberg game, which can be further simplified into a partially-observable turn-based stochastic game (POTBSG). Neither the attacker nor the victim would benefit from deviating from their respective optimal policies, thus such solutions are truly robust. Although the defense problem is NP-hard, we show that optimal Markovian defenses can be computed (learned) in polynomial time (sample complexity) in many scenarios.

6/18/2024

🌀

Towards Imperceptible Backdoor Attack in Self-supervised Learning

Hanrong Zhang, Zhenting Wang, Tingxu Han, Mingyu Jin, Chenlu Zhan, Mengnan Du, Hongwei Wang, Shiqing Ma

Self-supervised learning models are vulnerable to backdoor attacks. Existing backdoor attacks that are effective in self-supervised learning often involve noticeable triggers, like colored patches, which are vulnerable to human inspection. In this paper, we propose an imperceptible and effective backdoor attack against self-supervised models. We first find that existing imperceptible triggers designed for supervised learning are not as effective in compromising self-supervised models. We then identify this ineffectiveness is attributed to the overlap in distributions between the backdoor and augmented samples used in self-supervised learning. Building on this insight, we design an attack using optimized triggers that are disentangled to the augmented transformation in the self-supervised learning, while also remaining imperceptible to human vision. Experiments on five datasets and seven SSL algorithms demonstrate our attack is highly effective and stealthy. It also has strong resistance to existing backdoor defenses. Our code can be found at https://github.com/Zhang-Henry/IMPERATIVE.

5/24/2024

🧠

Adversarial Imitation Learning from Visual Observations using Latent Information

Vittorio Giammarino, James Queeney, Ioannis Ch. Paschalidis

We focus on the problem of imitation learning from visual observations, where the learning agent has access to videos of experts as its sole learning source. The challenges of this framework include the absence of expert actions and the partial observability of the environment, as the ground-truth states can only be inferred from pixels. To tackle this problem, we first conduct a theoretical analysis of imitation learning in partially observable environments. We establish upper bounds on the suboptimality of the learning agent with respect to the divergence between the expert and the agent latent state-transition distributions. Motivated by this analysis, we introduce an algorithm called Latent Adversarial Imitation from Observations, which combines off-policy adversarial imitation techniques with a learned latent representation of the agent's state from sequences of observations. In experiments on high-dimensional continuous robotic tasks, we show that our model-free approach in latent space matches state-of-the-art performance. Additionally, we show how our method can be used to improve the efficiency of reinforcement learning from pixels by leveraging expert videos. To ensure reproducibility, we provide free access to our code.

5/27/2024