Policy learning without overlap: Pessimism and generalized empirical Bernstein's inequality

Read original: arXiv:2212.09900 - Published 9/30/2024 by Ying Jin, Zhimei Ren, Zhuoran Yang, Zhaoran Wang

🤷

Overview

This paper explores a new approach to offline policy learning, which aims to learn an optimal decision-making strategy from previously collected data.
Existing methods rely on a "uniform overlap" assumption, which may not hold in real-world scenarios where behavior policies can evolve over time.
The authors propose a new algorithm called Pessimistic Policy Learning (PPL) that optimizes lower confidence bounds instead of point estimates, without requiring the uniform overlap condition.

Plain English Explanation

The paper focuses on a problem called offline policy learning, which is about using past observations to find the best way to make decisions. Imagine you have data on how people responded to different actions in the past, and you want to use that to figure out the optimal decision-making strategy for a given population.

Existing methods for this problem rely on an assumption called "uniform overlap," which means the probability of exploring all possible actions must be higher than a certain threshold for everyone. However, this assumption can be unrealistic, especially when the behavior policies (i.e., how decisions were made in the past) are allowed to change over time and become less likely to try certain actions.

To address this issue, the authors propose a new algorithm called Pessimistic Policy Learning (PPL). Instead of using the average performance of each decision, PPL looks at the lower confidence bound - the worst-case scenario that is still reasonably likely given the data. This allows PPL to learn an optimal policy without needing the uniform overlap assumption.

The key idea is that as long as the optimal actions have a reasonable chance of being tried over time, PPL can still learn an effective decision-making strategy, even if the probabilities of trying suboptimal actions eventually become very small. The authors provide a theoretical analysis to show that PPL can learn efficiently in this setting.

Technical Explanation

The paper presents the Pessimistic Policy Learning (PPL) algorithm for offline policy learning. Unlike existing methods, PPL does not rely on the uniform overlap assumption, which requires the propensities (probabilities) of exploring all actions for all individual characteristics to be lower bounded.

The key idea behind PPL is to optimize lower confidence bounds (LCBs) of the policy values, rather than point estimates. The LCBs are constructed using knowledge of the behavior policies that collected the offline data. This allows PPL to learn an optimal policy without any uniform overlap condition.

The authors establish a data-dependent upper bound for the suboptimality of their algorithm, which only depends on (i) the overlap for the optimal policy, and (ii) the complexity of the policy class being optimized. This means PPL can learn efficiently as long as the propensities for optimal actions are lower bounded over time, even if the propensities for suboptimal actions are allowed to diminish rapidly.

To enable this, the authors develop a new self-normalized type concentration inequality for inverse-propensity-weighting estimators, generalizing the well-known empirical Bernstein's inequality to unbounded and non-i.i.d. data. They also provide an efficient optimization algorithm via Majorization-Minimization and policy tree search.

The paper includes extensive simulation studies and real-world applications that demonstrate the efficacy of PPL compared to existing approaches.

Critical Analysis

The authors acknowledge that the proposed PPL algorithm requires knowledge of the behavior policies that generated the offline data, which may not always be available in practice. They suggest that future work could explore ways to relax this assumption.

Additionally, the theoretical analysis relies on a lower bound on the propensities for optimal actions, which may also be difficult to verify in real-world scenarios. It would be useful to investigate the sensitivity of PPL's performance to violations of this assumption.

The authors do not compare PPL to other recently proposed offline policy learning methods that also aim to relax the uniform overlap condition, such as Doubly Robust Indirect Policy Learning or Kernel Inverse Propensity Weighting. Comparing PPL's performance to these alternative approaches could provide a more comprehensive evaluation.

Conclusion

This paper introduces a novel Pessimistic Policy Learning (PPL) algorithm for offline policy learning that can learn an optimal decision-making strategy without relying on the restrictive uniform overlap assumption. By optimizing lower confidence bounds instead of point estimates, PPL can learn efficiently as long as the propensities for optimal actions are lower bounded over time, even when the propensities for suboptimal actions diminish rapidly.

The theoretical analysis and empirical results demonstrate the potential of PPL to be a practical and effective solution for offline policy learning in real-world scenarios where the behavior policies may evolve over time. Further research is needed to address the remaining challenges, such as relaxing the requirement for known behavior policies and investigating the sensitivity to assumptions on the propensity bounds.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤷

Policy learning without overlap: Pessimism and generalized empirical Bernstein's inequality

Ying Jin, Zhimei Ren, Zhuoran Yang, Zhaoran Wang

This paper studies offline policy learning, which aims at utilizing observations collected a priori (from either fixed or adaptively evolving behavior policies) to learn an optimal individualized decision rule that achieves the best overall outcomes for a given population. Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded. As one has no control over the data collection process, this assumption can be unrealistic in many situations, especially when the behavior policies are allowed to evolve over time with diminishing propensities for certain actions. In this paper, we propose Pessimistic Policy Learning (PPL), a new algorithm that optimizes lower confidence bounds (LCBs) -- instead of point estimates -- of the policy values. The LCBs are constructed using knowledge of the behavior policies for collecting the offline data. Without assuming any uniform overlap condition, we establish a data-dependent upper bound for the suboptimality of our algorithm, which only depends on (i) the overlap for the optimal policy, and (ii) the complexity of the policy class we optimize over. As an implication, for adaptively collected data, we ensure efficient policy learning as long as the propensities for optimal actions are lower bounded over time, while those for suboptimal ones are allowed to diminish arbitrarily fast. In our theoretical analysis, we develop a new self-normalized type concentration inequality for inverse-propensity-weighting estimators, generalizing the well-known empirical Bernstein's inequality to unbounded and non-i.i.d. data. We complement our theory with an efficient optimization algorithm via Majorization-Minimization and policy tree search, as well as extensive simulation studies and real-world applications that demonstrate the efficacy of PPL.

9/30/2024

📊

Logarithmic Smoothing for Pessimistic Off-Policy Evaluation, Selection and Learning

Otmane Sakhi, Imad Aouali, Pierre Alquier, Nicolas Chopin

This work investigates the offline formulation of the contextual bandit problem, where the goal is to leverage past interactions collected under a behavior policy to evaluate, select, and learn new, potentially better-performing, policies. Motivated by critical applications, we move beyond point estimators. Instead, we adopt the principle of pessimism where we construct upper bounds that assess a policy's worst-case performance, enabling us to confidently select and learn improved policies. Precisely, we introduce novel, fully empirical concentration bounds for a broad class of importance weighting risk estimators. These bounds are general enough to cover most existing estimators and pave the way for the development of new ones. In particular, our pursuit of the tightest bound within this class motivates a novel estimator (LS), that logarithmically smooths large importance weights. The bound for LS is provably tighter than all its competitors, and naturally results in improved policy selection and learning strategies. Extensive policy evaluation, selection, and learning experiments highlight the versatility and favorable performance of LS.

5/24/2024

Unified PAC-Bayesian Study of Pessimism for Offline Policy Learning with Regularized Importance Sampling

Imad Aouali, Victor-Emmanuel Brunel, David Rohde, Anna Korba

Off-policy learning (OPL) often involves minimizing a risk estimator based on importance weighting to correct bias from the logging policy used to collect data. However, this method can produce an estimator with a high variance. A common solution is to regularize the importance weights and learn the policy by minimizing an estimator with penalties derived from generalization bounds specific to the estimator. This approach, known as pessimism, has gained recent attention but lacks a unified framework for analysis. To address this gap, we introduce a comprehensive PAC-Bayesian framework to examine pessimism with regularized importance weighting. We derive a tractable PAC-Bayesian generalization bound that universally applies to common importance weight regularizations, enabling their comparison within a single framework. Our empirical results challenge common understanding, demonstrating the effectiveness of standard IW regularization techniques.

6/6/2024

🛠️

Pessimistic Off-Policy Optimization for Learning to Rank

Matej Cief, Branislav Kveton, Michal Kompan

Off-policy learning is a framework for optimizing policies without deploying them, using data collected by another policy. In recommender systems, this is especially challenging due to the imbalance in logged data: some items are recommended and thus logged more frequently than others. This is further perpetuated when recommending a list of items, as the action space is combinatorial. To address this challenge, we study pessimistic off-policy optimization for learning to rank. The key idea is to compute lower confidence bounds on parameters of click models and then return the list with the highest pessimistic estimate of its value. This approach is computationally efficient, and we analyze it. We study its Bayesian and frequentist variants and overcome the limitation of unknown prior by incorporating empirical Bayes. To show the empirical effectiveness of our approach, we compare it to off-policy optimizers that use inverse propensity scores or neglect uncertainty. Our approach outperforms all baselines and is both robust and general.

8/26/2024