Doubly-Robust Off-Policy Evaluation with Estimated Logging Policy

Read original: arXiv:2404.01830 - Published 4/3/2024 by Kyungbok Lee, Myunghee Cho Paik

Doubly-Robust Off-Policy Evaluation with Estimated Logging Policy

Overview

The paper proposes a new "doubly-robust" approach for evaluating the performance of an AI system in an off-policy setting, where the data used to train the system was collected under a different policy.
The authors introduce a method that combines two existing techniques - inverse propensity scoring and model-based estimation - to provide more accurate and robust performance evaluation.
A key innovation is accounting for the case where the original logging policy (the policy used to collect the training data) is not known and must be estimated.

Plain English Explanation

The paper tackles an important problem in machine learning and AI - how to accurately evaluate the performance of a system when the data used to train it was collected in a different way than how the system will be used. Imagine you have an AI assistant that recommends products to customers. To train this assistant, you may have collected data on how customers responded to product recommendations made by a different, older system. This is known as an "off-policy" setting, since the training data was not collected using the same policy (decision-making process) as the final deployed system.

Evaluating the performance of the new AI assistant is challenging in this off-policy setting. The authors propose a "doubly-robust" approach that combines two established evaluation techniques to get a more accurate and reliable assessment. The first technique, inverse propensity scoring, tries to account for the differences between the old and new recommendation policies. The second, model-based estimation, builds a predictive model of customer responses.

Importantly, the authors extend this approach to handle the realistic case where the original logging policy (the policy used to collect the training data) is not known and must be estimated from the data. This added realism makes the technique more widely applicable.

Overall, this work provides a principled way to evaluate AI systems in real-world settings where the training data does not perfectly match the deployment scenario. This is a common challenge, and the authors' doubly-robust approach offers a valuable tool for ensuring AI systems are properly tested and their performance accurately quantified.

Technical Explanation

The paper focuses on the problem of off-policy evaluation, where the goal is to evaluate the performance of a target policy (the AI assistant) using data collected under a different logging policy. The authors propose a doubly-robust estimator that combines inverse propensity score (IPS) weighting and model-based estimation.

IPS weighting attempts to correct for the mismatch between the target and logging policies by reweighting the observed outcomes. Model-based estimation, on the other hand, learns a predictive model of the outcomes directly from the data.

The key innovation in this work is extending the doubly-robust approach to the case where the logging policy is unknown and must be estimated from data. The authors show that their doubly-robust estimator remains consistent even with an estimated logging policy.

Theoretically, the authors provide conditions under which their estimator is asymptotically normal and derive its asymptotic variance. They also establish finite-sample bounds on the estimation error.

Empirically, the authors evaluate their method on both synthetic and real-world datasets, demonstrating improved performance compared to existing techniques, especially when the logging policy is misspecified or must be estimated.

Critical Analysis

The paper makes a valuable contribution by addressing the realistic setting where the logging policy is unknown and must be estimated. This is an important extension, as in many real-world applications, the full details of the data collection process may not be available.

That said, the authors acknowledge some limitations. Their analysis assumes the estimated logging policy satisfies certain regularity conditions, which may not always hold in practice. Additionally, the finite-sample error bounds rely on strong assumptions about the smoothness of the underlying functions.

An interesting area for further research would be to explore more robust approaches that can handle violations of these assumptions, such as by leveraging tools from nonparametric statistics or by considering alternative model classes for the logging policy.

It would also be worthwhile to investigate the practical implications of estimation error in the logging policy, both in terms of the impact on final evaluation accuracy and the sensitivity of the method to different logging policy estimation techniques.

Overall, the doubly-robust approach with estimated logging policy is a promising step forward, but there remain opportunities to further refine and expand the methodology to make it more widely applicable and robust to real-world challenges.

Conclusion

This paper presents a new doubly-robust off-policy evaluation technique that can handle the realistic setting where the original logging policy is unknown and must be estimated from data. By combining inverse propensity scoring and model-based estimation, the method provides more accurate and reliable performance assessment for AI systems deployed in settings that differ from their training data.

The theoretical and empirical analysis demonstrates the benefits of this approach, particularly when the logging policy is misspecified or estimated imperfectly. While the method has some limitations, it represents an important advance in off-policy evaluation that can help ensure AI systems are thoroughly tested and their capabilities are well understood before deployment. This is a crucial step for building trust and accountability in AI-powered technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Doubly-Robust Off-Policy Evaluation with Estimated Logging Policy

Kyungbok Lee, Myunghee Cho Paik

We introduce a novel doubly-robust (DR) off-policy evaluation (OPE) estimator for Markov decision processes, DRUnknown, designed for situations where both the logging policy and the value function are unknown. The proposed estimator initially estimates the logging policy and then estimates the value function model by minimizing the asymptotic variance of the estimator while considering the estimating effect of the logging policy. When the logging policy model is correctly specified, DRUnknown achieves the smallest asymptotic variance within the class containing existing OPE estimators. When the value function model is also correctly specified, DRUnknown is optimal as its asymptotic variance reaches the semiparametric lower bound. We present experimental results conducted in contextual bandits and reinforcement learning to compare the performance of DRUnknown with that of existing methods.

4/3/2024

⛏️

Off-Policy Evaluation Using Information Borrowing and Context-Based Switching

Sutanoy Dasgupta, Yabo Niu, Kishan Panaganti, Dileep Kalathil, Debdeep Pati, Bani Mallick

We consider the off-policy evaluation (OPE) problem in contextual bandits, where the goal is to estimate the value of a target policy using the data collected by a logging policy. Most popular approaches to the OPE are variants of the doubly robust (DR) estimator obtained by combining a direct method (DM) estimator and a correction term involving the inverse propensity score (IPS). Existing algorithms primarily focus on strategies to reduce the variance of the DR estimator arising from large IPS. We propose a new approach called the Doubly Robust with Information borrowing and Context-based switching (DR-IC) estimator that focuses on reducing both bias and variance. The DR-IC estimator replaces the standard DM estimator with a parametric reward model that borrows information from the 'closer' contexts through a correlation structure that depends on the IPS. The DR-IC estimator also adaptively interpolates between this modified DM estimator and a modified DR estimator based on a context-specific switching rule. We give provable guarantees on the performance of the DR-IC estimator. We also demonstrate the superior performance of the DR-IC estimator compared to the state-of-the-art OPE algorithms on a number of benchmark problems.

8/20/2024

🌿

Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning

Ye Shen, Hengrui Cai, Rui Song

Evaluating the performance of an ongoing policy plays a vital role in many areas such as medicine and economics, to provide crucial instructions on the early-stop of the online experiment and timely feedback from the environment. Policy evaluation in online learning thus attracts increasing attention by inferring the mean outcome of the optimal policy (i.e., the value) in real-time. Yet, such a problem is particularly challenging due to the dependent data generated in the online environment, the unknown optimal policy, and the complex exploration and exploitation trade-off in the adaptive experiment. In this paper, we aim to overcome these difficulties in policy evaluation for online learning. We explicitly derive the probability of exploration that quantifies the probability of exploring non-optimal actions under commonly used bandit algorithms. We use this probability to conduct valid inference on the online conditional mean estimator under each action and develop the doubly robust interval estimation (DREAM) method to infer the value under the estimated optimal policy in online learning. The proposed value estimator provides double protection for consistency and is asymptotically normal with a Wald-type confidence interval provided. Extensive simulation studies and real data applications are conducted to demonstrate the empirical validity of the proposed DREAM method.

8/6/2024

Off-policy Evaluation in Doubly Inhomogeneous Environments

Zeyu Bian, Chengchun Shi, Zhengling Qi, Lan Wang

This work aims to study off-policy evaluation (OPE) under scenarios where two key reinforcement learning (RL) assumptions -- temporal stationarity and individual homogeneity are both violated. To handle the ``double inhomogeneities, we propose a class of latent factor models for the reward and observation transition functions, under which we develop a general OPE framework that consists of both model-based and model-free approaches. To our knowledge, this is the first paper that develops statistically sound OPE methods in offline RL with double inhomogeneities. It contributes to a deeper understanding of OPE in environments, where standard RL assumptions are not met, and provides several practical approaches in these settings. We establish the theoretical properties of the proposed value estimators and empirically show that our approach outperforms competing methods that ignore either temporal nonstationarity or individual heterogeneity. Finally, we illustrate our method on a data set from the Medical Information Mart for Intensive Care.

8/20/2024