Off-Policy Evaluation Using Information Borrowing and Context-Based Switching

Read original: arXiv:2112.09865 - Published 8/20/2024 by Sutanoy Dasgupta, Yabo Niu, Kishan Panaganti, Dileep Kalathil, Debdeep Pati, Bani Mallick

⛏️

Overview

The paper proposes a new approach called the Doubly Robust with Information borrowing and Context-based switching (DR-IC) estimator for off-policy evaluation in contextual bandits.
Most existing approaches use the doubly robust (DR) estimator, which combines a direct method (DM) estimator and a correction term involving the inverse propensity score (IPS).
The DR-IC estimator focuses on reducing both bias and variance by replacing the standard DM estimator with a parametric reward model that borrows information from 'closer' contexts and adaptively interpolating between the modified DM estimator and a modified DR estimator.
The authors provide provable guarantees on the performance of the DR-IC estimator and demonstrate its superior performance compared to state-of-the-art off-policy evaluation algorithms.

Plain English Explanation

The paper tackles the problem of off-policy evaluation (OPE) in contextual bandits. In a contextual bandit setting, you have a set of actions (like which ad to show) and a context (like the user's characteristics) that affects the reward (like whether the user clicks on the ad). The goal of OPE is to estimate the value of a target policy (how well a new way of choosing actions would perform) using data collected by a logging policy (the current way of choosing actions).

The most popular OPE approaches use a doubly robust (DR) estimator, which combines a direct method (DM) estimator (that tries to directly model the rewards) and a correction term involving the inverse propensity score (IPS) (which accounts for the difference between the logging and target policies).

The paper's key idea is the Doubly Robust with Information borrowing and Context-based switching (DR-IC) estimator. This estimator:

Replaces the standard DM estimator with a parametric reward model that "borrows information" from 'closer' contexts (those with similar IPS values).
Adaptively interpolates between this modified DM estimator and a modified DR estimator, depending on the context.

This helps reduce both the bias and variance of the OPE, leading to better performance than existing approaches.

Technical Explanation

The paper proposes the Doubly Robust with Information borrowing and Context-based switching (DR-IC) estimator for off-policy evaluation in contextual bandits.

The key components of the DR-IC estimator are:

Modified Direct Method (DM) Estimator: The standard DM estimator is replaced with a parametric reward model that "borrows information" from 'closer' contexts. This is achieved by modeling the reward as a function of the context and the IPS, with the correlation structure between the two depending on the IPS.
Modified Doubly Robust (DR) Estimator: The authors also modify the standard DR estimator by incorporating the context-specific IPS information into the correction term.
Context-based Switching: The DR-IC estimator adaptively interpolates between the modified DM estimator and the modified DR estimator based on a context-specific switching rule. This helps balance the bias-variance tradeoff.

The authors provide theoretical guarantees on the performance of the DR-IC estimator, showing that it can outperform the state-of-the-art OPE algorithms in terms of both bias and variance.

Critical Analysis

The paper makes a valuable contribution to the field of off-policy evaluation in contextual bandits. The key strength of the proposed DR-IC estimator is its ability to effectively leverage the underlying structure of the problem to reduce both bias and variance, which is a significant improvement over existing approaches.

One potential limitation of the work is that the theoretical analysis and experiments are conducted under specific assumptions, such as the availability of a parametric reward model and the ability to accurately estimate the IPS. In real-world scenarios, these assumptions may not always hold, and the performance of the DR-IC estimator may be affected.

Additionally, the paper does not explore the sensitivity of the DR-IC estimator to misspecification of the parametric reward model or the switching rule. Further research could investigate the robustness of the method to these potential sources of error.

It would also be interesting to see how the DR-IC estimator performs on a wider range of benchmarks, including more complex contextual bandit problems or settings with greater covariate shift between the logging and target policies.

Conclusion

The Doubly Robust with Information borrowing and Context-based switching (DR-IC) estimator proposed in this paper represents a significant advancement in the field of off-policy evaluation for contextual bandits. By combining a modified direct method estimator that borrows information from 'closer' contexts and a context-specific adaptive interpolation between the modified direct method and doubly robust estimators, the DR-IC estimator demonstrates superior performance compared to state-of-the-art approaches.

The theoretical guarantees and experimental results presented in the paper suggest that the DR-IC estimator could have important practical implications for a wide range of applications, from recommender systems to online advertising, where accurate off-policy evaluation is crucial for making informed decisions about new policies or interventions. Further research to address the potential limitations and extend the applicability of the DR-IC estimator could lead to even more impactful contributions to the field.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

⛏️

Off-Policy Evaluation Using Information Borrowing and Context-Based Switching

Sutanoy Dasgupta, Yabo Niu, Kishan Panaganti, Dileep Kalathil, Debdeep Pati, Bani Mallick

We consider the off-policy evaluation (OPE) problem in contextual bandits, where the goal is to estimate the value of a target policy using the data collected by a logging policy. Most popular approaches to the OPE are variants of the doubly robust (DR) estimator obtained by combining a direct method (DM) estimator and a correction term involving the inverse propensity score (IPS). Existing algorithms primarily focus on strategies to reduce the variance of the DR estimator arising from large IPS. We propose a new approach called the Doubly Robust with Information borrowing and Context-based switching (DR-IC) estimator that focuses on reducing both bias and variance. The DR-IC estimator replaces the standard DM estimator with a parametric reward model that borrows information from the 'closer' contexts through a correlation structure that depends on the IPS. The DR-IC estimator also adaptively interpolates between this modified DM estimator and a modified DR estimator based on a context-specific switching rule. We give provable guarantees on the performance of the DR-IC estimator. We also demonstrate the superior performance of the DR-IC estimator compared to the state-of-the-art OPE algorithms on a number of benchmark problems.

8/20/2024

🤯

Anytime-valid off-policy inference for contextual bandits

Ian Waudby-Smith, Lili Wu, Aaditya Ramdas, Nikos Karampatziakis, Paul Mineiro

Contextual bandit algorithms are ubiquitous tools for active sequential experimentation in healthcare and the tech industry. They involve online learning algorithms that adaptively learn policies over time to map observed contexts $X_t$ to actions $A_t$ in an attempt to maximize stochastic rewards $R_t$. This adaptivity raises interesting but hard statistical inference questions, especially counterfactual ones: for example, it is often of interest to estimate the properties of a hypothetical policy that is different from the logging policy that was used to collect the data -- a problem known as ``off-policy evaluation'' (OPE). Using modern martingale techniques, we present a comprehensive framework for OPE inference that relax unnecessary conditions made in some past works, significantly improving on them both theoretically and empirically. Importantly, our methods can be employed while the original experiment is still running (that is, not necessarily post-hoc), when the logging policy may be itself changing (due to learning), and even if the context distributions are a highly dependent time-series (such as if they are drifting over time). More concretely, we derive confidence sequences for various functionals of interest in OPE. These include doubly robust ones for time-varying off-policy mean reward values, but also confidence bands for the entire cumulative distribution function of the off-policy reward distribution. All of our methods (a) are valid at arbitrary stopping times (b) only make nonparametric assumptions, (c) do not require importance weights to be uniformly bounded and if they are, we do not need to know these bounds, and (d) adapt to the empirical variance of our estimators. In summary, our methods enable anytime-valid off-policy inference using adaptively collected contextual bandit data.

8/19/2024

IntOPE: Off-Policy Evaluation in the Presence of Interference

Yuqi Bai, Ziyu Zhao, Minqin Zhu, Kun Kuang

Off-Policy Evaluation (OPE) is employed to assess the potential impact of a hypothetical policy using logged contextual bandit feedback, which is crucial in areas such as personalized medicine and recommender systems, where online interactions are associated with significant risks and costs. Traditionally, OPE methods rely on the Stable Unit Treatment Value Assumption (SUTVA), which assumes that the reward for any given individual is unaffected by the actions of others. However, this assumption often fails in real-world scenarios due to the presence of interference, where an individual's reward is affected not just by their own actions but also by the actions of their peers. This realization reveals significant limitations of existing OPE methods in real-world applications. To address this limitation, we propose IntIPW, an IPW-style estimator that extends the Inverse Probability Weighting (IPW) framework by integrating marginalized importance weights to account for both individual actions and the influence of adjacent entities. Extensive experiments are conducted on both synthetic and real-world data to demonstrate the effectiveness of the proposed IntIPW method.

8/27/2024

Effective Off-Policy Evaluation and Learning in Contextual Combinatorial Bandits

Tatsuhiro Shimizu, Koichi Tanaka, Ren Kishimoto, Haruka Kiyohara, Masahiro Nomura, Yuta Saito

We explore off-policy evaluation and learning (OPE/L) in contextual combinatorial bandits (CCB), where a policy selects a subset in the action space. For example, it might choose a set of furniture pieces (a bed and a drawer) from available items (bed, drawer, chair, etc.) for interior design sales. This setting is widespread in fields such as recommender systems and healthcare, yet OPE/L of CCB remains unexplored in the relevant literature. Typical OPE/L methods such as regression and importance sampling can be applied to the CCB problem, however, they face significant challenges due to high bias or variance, exacerbated by the exponential growth in the number of available subsets. To address these challenges, we introduce a concept of factored action space, which allows us to decompose each subset into binary indicators. This formulation allows us to distinguish between the ''main effect'' derived from the main actions, and the ''residual effect'', originating from the supplemental actions, facilitating more effective OPE. Specifically, our estimator, called OPCB, leverages an importance sampling-based approach to unbiasedly estimate the main effect, while employing regression-based approach to deal with the residual effect with low variance. OPCB achieves substantial variance reduction compared to conventional importance sampling methods and bias reduction relative to regression methods under certain conditions, as illustrated in our theoretical analysis. Experiments demonstrate OPCB's superior performance over typical methods in both OPE and OPL.

8/22/2024