Data Poisoning Attacks on Off-Policy Policy Evaluation Methods

Read original: arXiv:2404.04714 - Published 4/9/2024 by Elita Lobo, Harvineet Singh, Marek Petrik, Cynthia Rudin, Himabindu Lakkaraju

Data Poisoning Attacks on Off-Policy Policy Evaluation Methods

Overview

This paper examines the vulnerability of off-policy policy evaluation (OPE) methods to data poisoning attacks, where an adversary manipulates the training data to cause the OPE algorithm to produce biased estimates.
The authors propose a framework called DOPE (Data Poisoning Attacks on Off-Policy Evaluation) to generate such attacks and evaluate their effectiveness on several OPE algorithms.
The research highlights the importance of developing robust OPE methods that can withstand data poisoning attempts, which is crucial for reliable policy evaluation in real-world applications.

Plain English Explanation

Off-policy policy evaluation (OPE) is a technique used to estimate the performance of a new policy without actually implementing it. This is useful when you want to try out different policies, but don't want to risk deploying a policy that may perform poorly in the real world. OPE algorithms are designed to provide accurate estimates of a policy's performance based on historical data.

However, this paper shows that these OPE algorithms can be vulnerable to a type of attack called "data poisoning." In a data poisoning attack, an adversary manipulates the training data in a way that causes the OPE algorithm to produce biased, inaccurate estimates of the policy's performance. This could lead to the deployment of a suboptimal policy, with potentially serious consequences.

The authors propose a framework called DOPE (Data Poisoning Attacks on Off-Policy Evaluation) to generate these types of attacks and test their effectiveness on several OPE algorithms. This research highlights the importance of developing OPE methods that are robust to data poisoning attempts, so that policy evaluation can be conducted reliably in real-world applications.

Technical Explanation

The paper first provides the necessary preliminaries for understanding the OPE problem and the threat of data poisoning attacks. It then introduces the DOPE framework, which consists of three key components:

Attack Objective: The authors define the goal of the adversary as maximizing the bias in the OPE estimate, subject to a constraint on the level of perturbation to the training data.
Attack Generation: DOPE formulates the attack generation as an optimization problem, where the adversary seeks to find the optimal data perturbation that achieves the desired attack objective.
Attack Evaluation: The framework evaluates the effectiveness of the generated attacks by measuring the bias and variance of the OPE estimates on the perturbed data, and comparing them to the unperturbed case.

The paper presents experimental results on several benchmark OPE algorithms, including doubly robust estimators and kernel-based methods. The results demonstrate that the DOPE attacks can significantly degrade the performance of these OPE algorithms, highlighting the need for developing more robust methods.

Critical Analysis

The paper provides a comprehensive analysis of the data poisoning threat against OPE methods, but there are a few potential limitations and areas for further research:

The attack generation process assumes the adversary has complete knowledge of the OPE algorithm and access to the training data. In practice, the adversary may have more limited information, which could affect the feasibility and effectiveness of the attacks.
The paper focuses on untargeted attacks that aim to maximize the overall bias in the OPE estimate. Targeted attacks, where the adversary seeks to bias the estimate in a specific direction, could be an interesting avenue for future research.
The evaluation is limited to synthetic experiments, and it would be valuable to examine the real-world implications of these attacks on actual policy evaluation use cases.
While the paper proposes the DOPE framework, it does not provide solutions for making OPE methods more robust to data poisoning. Developing such defenses is an important next step in this line of research.

Overall, this paper makes a valuable contribution by highlighting the vulnerability of OPE methods to data poisoning attacks and providing a systematic framework for studying this threat. Further research in this area could lead to the development of more secure and reliable policy evaluation techniques.

Conclusion

This paper demonstrates that off-policy policy evaluation (OPE) methods can be vulnerable to data poisoning attacks, where an adversary manipulates the training data to cause the OPE algorithm to produce biased estimates of a policy's performance. The authors propose the DOPE framework to generate and evaluate such attacks, and their experiments show that several state-of-the-art OPE algorithms can be significantly degraded by these attacks.

The findings of this research underscore the importance of developing robust OPE methods that can withstand data poisoning attempts. Reliable policy evaluation is crucial for real-world applications, such as healthcare, education, and finance, where the deployment of suboptimal policies can have serious consequences. Advancing the security and reliability of OPE techniques is an important direction for future research in this field.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Data Poisoning Attacks on Off-Policy Policy Evaluation Methods

Elita Lobo, Harvineet Singh, Marek Petrik, Cynthia Rudin, Himabindu Lakkaraju

Off-policy Evaluation (OPE) methods are a crucial tool for evaluating policies in high-stakes domains such as healthcare, where exploration is often infeasible, unethical, or expensive. However, the extent to which such methods can be trusted under adversarial threats to data quality is largely unexplored. In this work, we make the first attempt at investigating the sensitivity of OPE methods to marginal adversarial perturbations to the data. We design a generic data poisoning attack framework leveraging influence functions from robust statistics to carefully construct perturbations that maximize error in the policy value estimates. We carry out extensive experimentation with multiple healthcare and control datasets. Our results demonstrate that many existing OPE methods are highly prone to generating value estimates with large errors when subject to data poisoning attacks, even for small adversarial perturbations. These findings question the reliability of policy values derived using OPE methods and motivate the need for developing OPE methods that are statistically robust to train-time data poisoning attacks.

4/9/2024

IntOPE: Off-Policy Evaluation in the Presence of Interference

Yuqi Bai, Ziyu Zhao, Minqin Zhu, Kun Kuang

Off-Policy Evaluation (OPE) is employed to assess the potential impact of a hypothetical policy using logged contextual bandit feedback, which is crucial in areas such as personalized medicine and recommender systems, where online interactions are associated with significant risks and costs. Traditionally, OPE methods rely on the Stable Unit Treatment Value Assumption (SUTVA), which assumes that the reward for any given individual is unaffected by the actions of others. However, this assumption often fails in real-world scenarios due to the presence of interference, where an individual's reward is affected not just by their own actions but also by the actions of their peers. This realization reveals significant limitations of existing OPE methods in real-world applications. To address this limitation, we propose IntIPW, an IPW-style estimator that extends the Inverse Probability Weighting (IPW) framework by integrating marginalized importance weights to account for both individual actions and the influence of adjacent entities. Extensive experiments are conducted on both synthetic and real-world data to demonstrate the effectiveness of the proposed IntIPW method.

8/27/2024

AutoOPE: Automated Off-Policy Estimator Selection

Nicol`o Felicioni, Michael Benigni, Maurizio Ferrari Dacrema

The Off-Policy Evaluation (OPE) problem consists of evaluating the performance of counterfactual policies with data collected by another one. This problem is of utmost importance for various application domains, e.g., recommendation systems, medical treatments, and many others. To solve the OPE problem, we resort to estimators, which aim to estimate in the most accurate way possible the performance that the counterfactual policies would have had if they were deployed in place of the logging policy. In the literature, several estimators have been developed, all with different characteristics and theoretical guarantees. Therefore, there is no dominant estimator, and each estimator may be the best one for different OPE problems, depending on the characteristics of the dataset at hand. While the selection of the estimator is a crucial choice for an accurate OPE, this problem has been widely overlooked in the literature. We propose an automated data-driven OPE estimator selection method based on machine learning. In particular, the core idea we propose in this paper is to create several synthetic OPE tasks and use a machine learning model trained to predict the best estimator for those synthetic tasks. We empirically show how our method is able to generalize to unseen tasks and make a better estimator selection compared to a baseline method on several real-world datasets, with a computational cost significantly lower than the one of the baseline.

6/27/2024

OPERA: Automatic Offline Policy Evaluation with Re-weighted Aggregates of Multiple Estimators

Allen Nie, Yash Chandak, Christina J. Yuan, Anirudhan Badrinath, Yannis Flet-Berliac, Emma Brunskil

Offline policy evaluation (OPE) allows us to evaluate and estimate a new sequential decision-making policy's performance by leveraging historical interaction data collected from other policies. Evaluating a new policy online without a confident estimate of its performance can lead to costly, unsafe, or hazardous outcomes, especially in education and healthcare. Several OPE estimators have been proposed in the last decade, many of which have hyperparameters and require training. Unfortunately, choosing the best OPE algorithm for each task and domain is still unclear. In this paper, we propose a new algorithm that adaptively blends a set of OPE estimators given a dataset without relying on an explicit selection using a statistical procedure. We prove that our estimator is consistent and satisfies several desirable properties for policy evaluation. Additionally, we demonstrate that when compared to alternative approaches, our estimator can be used to select higher-performing policies in healthcare and robotics. Our work contributes to improving ease of use for a general-purpose, estimator-agnostic, off-policy evaluation framework for offline RL.

5/29/2024