Behavior-Targeted Attack on Reinforcement Learning with Limited Access to Victim's Policy

2406.03862

Published 6/7/2024 by Shojiro Yamabe, Kazuto Fukuchi, Ryoma Senda, Jun Sakuma

Behavior-Targeted Attack on Reinforcement Learning with Limited Access to Victim's Policy

Abstract

This study considers the attack on reinforcement learning agents where the adversary aims to control the victim's behavior as specified by the adversary by adding adversarial modifications to the victim's state observation. While some attack methods reported success in manipulating the victim agent's behavior, these methods often rely on environment-specific heuristics. In addition, all existing attack methods require white-box access to the victim's policy. In this study, we propose a novel method for manipulating the victim agent in the black-box (i.e., the adversary is allowed to observe the victim's state and action only) and no-box (i.e., the adversary is allowed to observe the victim's state only) setting without requiring environment-specific heuristics. Our attack method is formulated as a bi-level optimization problem that is reduced to a distribution matching problem and can be solved by an existing imitation learning algorithm in the black-box and no-box settings. Empirical evaluations on several reinforcement learning benchmarks show that our proposed method has superior attack performance to baselines.

Create account to get full access

Overview

This paper presents a novel "behavior-targeted attack" on reinforcement learning (RL) systems, which can manipulate the behavior of an RL agent with limited access to its policy.
The proposed attack focuses on altering the agent's behavior rather than just maximizing the reward, making it more stealthy and effective against defenses.
The authors demonstrate the attack's effectiveness on various RL environments and discuss its implications for the security and robustness of RL systems.

Plain English Explanation

The paper describes a new way to attack reinforcement learning (RL) systems, which are a type of AI that learns to make decisions by interacting with an environment and receiving rewards. The attack is designed to change the behavior of the RL agent, rather than just trying to maximize its reward. This makes the attack harder to detect and more effective against defenses that are designed to protect RL systems.

The key idea is that the attacker doesn't need full access to the RL agent's decision-making process (its "policy"). Instead, the attacker can observe the agent's behavior and use that information to gradually nudge the agent towards a desired behavior, even with limited access. This is like a hacker who can't directly control a robot, but can observe its movements and gradually influence it to do what the hacker wants.

The paper shows that this behavior-targeted attack works well in various RL environments, and discusses the implications for the security and robustness of RL systems. As RL becomes more widely used, for example in self-driving cars or robotic assistants, this type of attack could pose a serious threat if not properly addressed.

Technical Explanation

The paper introduces a "behavior-targeted attack" on reinforcement learning (RL) systems, which aims to manipulate the agent's behavior rather than just maximizing its reward. Unlike previous attack methods that require full access to the agent's policy, this approach can be effective with limited access.

The key idea is to learn a surrogate policy that mimics the victim agent's behavior, and then use gradient-based optimization to gradually update the surrogate policy towards a desired behavior. This "behavior cloning" step allows the attacker to capture the agent's decision-making logic without needing to directly access its policy.

The authors then propose a "behavior-targeted attack" that iteratively updates the surrogate policy to nudge the victim agent's behavior in a targeted way. This is done by defining a "behavior distance" metric that quantifies the difference between the victim's and the surrogate's behaviors, and then minimizing this distance through gradient descent.

The paper evaluates the proposed attack on various RL environments, including Towards Evaluating the Robustness of Reinforcement Learning Agents, SleepNets: Universal Backdoor Poisoning Attacks against Reinforcement Learning, and Cooperative Backdoor Attack on Decentralized Reinforcement Learning: Theoretical Analysis. The results demonstrate the effectiveness of the behavior-targeted attack in manipulating the agent's behavior, even with limited access to its policy.

Critical Analysis

The proposed behavior-targeted attack represents an important advancement in the field of RL security, as it highlights the need to consider not just reward maximization, but also the robustness of an agent's behavior. The authors' use of a surrogate policy and the behavior distance metric is a clever and effective way to work with limited access to the victim's policy.

However, the paper also acknowledges some limitations. The attack assumes the attacker has some initial knowledge of the victim's behavior, which may not always be the case in real-world scenarios. Additionally, the paper does not explore the long-term effects of the behavior-targeted attack on the RL agent's performance and learning, which could be an important area for further research.

It would also be valuable to see the authors address potential defenses against this type of attack, such as techniques for Towards Robust Policy: Enhancing Offline Reinforcement Learning or Stealthy Imitation Reward-Guided Environment-Free Policy that could make RL agents more robust to behavior manipulation.

Conclusion

This paper introduces a novel "behavior-targeted attack" on reinforcement learning systems, which can manipulate an agent's behavior with limited access to its policy. The attack focuses on altering the agent's behavior rather than just maximizing its reward, making it more stealthy and effective against defenses.

The paper's technical contributions and experimental results demonstrate the importance of considering behavioral robustness in RL security. As RL systems become more widely deployed in critical applications, the development of such attacks and potential defenses will be crucial to ensuring the reliability and trustworthiness of these AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Optimal Attack and Defense for Reinforcement Learning

Jeremy McMahan, Young Wu, Xiaojin Zhu, Qiaomin Xie

To ensure the usefulness of Reinforcement Learning (RL) in real systems, it is crucial to ensure they are robust to noise and adversarial attacks. In adversarial RL, an external attacker has the power to manipulate the victim agent's interaction with the environment. We study the full class of online manipulation attacks, which include (i) state attacks, (ii) observation attacks (which are a generalization of perceived-state attacks), (iii) action attacks, and (iv) reward attacks. We show the attacker's problem of designing a stealthy attack that maximizes its own expected reward, which often corresponds to minimizing the victim's value, is captured by a Markov Decision Process (MDP) that we call a meta-MDP since it is not the true environment but a higher level environment induced by the attacked interaction. We show that the attacker can derive optimal attacks by planning in polynomial time or learning with polynomial sample complexity using standard RL techniques. We argue that the optimal defense policy for the victim can be computed as the solution to a stochastic Stackelberg game, which can be further simplified into a partially-observable turn-based stochastic game (POTBSG). Neither the attacker nor the victim would benefit from deviating from their respective optimal policies, thus such solutions are truly robust. Although the defense problem is NP-hard, we show that optimal Markovian defenses can be computed (learned) in polynomial time (sample complexity) in many scenarios.

6/18/2024

cs.LG cs.CR cs.GT

🛸

Stealthy Imitation: Reward-guided Environment-free Policy Stealing

Zhixiong Zhuang, Maria-Irina Nicolae, Mario Fritz

Deep reinforcement learning policies, which are integral to modern control systems, represent valuable intellectual property. The development of these policies demands considerable resources, such as domain expertise, simulation fidelity, and real-world validation. These policies are potentially vulnerable to model stealing attacks, which aim to replicate their functionality using only black-box access. In this paper, we propose Stealthy Imitation, the first attack designed to steal policies without access to the environment or knowledge of the input range. This setup has not been considered by previous model stealing methods. Lacking access to the victim's input states distribution, Stealthy Imitation fits a reward model that allows to approximate it. We show that the victim policy is harder to imitate when the distribution of the attack queries matches that of the victim. We evaluate our approach across diverse, high-dimensional control tasks and consistently outperform prior data-free approaches adapted for policy stealing. Lastly, we propose a countermeasure that significantly diminishes the effectiveness of the attack.

5/14/2024

cs.CR cs.LG

🏅

SUB-PLAY: Adversarial Policies against Partially Observed Multi-Agent Reinforcement Learning Systems

Oubo Ma, Yuwen Pu, Linkang Du, Yang Dai, Ruo Wang, Xiaolei Liu, Yingcai Wu, Shouling Ji

Recent advancements in multi-agent reinforcement learning (MARL) have opened up vast application prospects, such as swarm control of drones, collaborative manipulation by robotic arms, and multi-target encirclement. However, potential security threats during the MARL deployment need more attention and thorough investigation. Recent research reveals that attackers can rapidly exploit the victim's vulnerabilities, generating adversarial policies that result in the failure of specific tasks. For instance, reducing the winning rate of a superhuman-level Go AI to around 20%. Existing studies predominantly focus on two-player competitive environments, assuming attackers possess complete global state observation. In this study, we unveil, for the first time, the capability of attackers to generate adversarial policies even when restricted to partial observations of the victims in multi-agent competitive environments. Specifically, we propose a novel black-box attack (SUB-PLAY) that incorporates the concept of constructing multiple subgames to mitigate the impact of partial observability and suggests sharing transitions among subpolicies to improve attackers' exploitative ability. Extensive evaluations demonstrate the effectiveness of SUB-PLAY under three typical partial observability limitations. Visualization results indicate that adversarial policies induce significantly different activations of the victims' policy networks. Furthermore, we evaluate three potential defenses aimed at exploring ways to mitigate security threats posed by adversarial policies, providing constructive recommendations for deploying MARL in competitive environments.

6/27/2024

cs.LG cs.AI cs.CR

🏅

Toward Evaluating Robustness of Reinforcement Learning with Adversarial Policy

Xiang Zheng, Xingjun Ma, Shengjie Wang, Xinyu Wang, Chao Shen, Cong Wang

Reinforcement learning agents are susceptible to evasion attacks during deployment. In single-agent environments, these attacks can occur through imperceptible perturbations injected into the inputs of the victim policy network. In multi-agent environments, an attacker can manipulate an adversarial opponent to influence the victim policy's observations indirectly. While adversarial policies offer a promising technique to craft such attacks, current methods are either sample-inefficient due to poor exploration strategies or require extra surrogate model training under the black-box assumption. To address these challenges, in this paper, we propose Intrinsically Motivated Adversarial Policy (IMAP) for efficient black-box adversarial policy learning in both single- and multi-agent environments. We formulate four types of adversarial intrinsic regularizers -- maximizing the adversarial state coverage, policy coverage, risk, or divergence -- to discover potential vulnerabilities of the victim policy in a principled way. We also present a novel bias-reduction method to balance the extrinsic objective and the adversarial intrinsic regularizers adaptively. Our experiments validate the effectiveness of the four types of adversarial intrinsic regularizers and the bias-reduction method in enhancing black-box adversarial policy learning across a variety of environments. Our IMAP successfully evades two types of defense methods, adversarial training and robust regularizer, decreasing the performance of the state-of-the-art robust WocaR-PPO agents by 34%-54% across four single-agent tasks. IMAP also achieves a state-of-the-art attacking success rate of 83.91% in the multi-agent game YouShallNotPass. Our code is available at url{https://github.com/x-zheng16/IMAP}.

4/29/2024

cs.LG