Low Variance Off-policy Evaluation with State-based Importance Sampling

Read original: arXiv:2212.03932 - Published 5/7/2024 by David M. Bossens, Philip S. Thomas

🤔

Overview

Reinforcement learning often requires trying out suboptimal policies, which can be costly in many domains.
Off-policy evaluation can be used to evaluate a target policy based on data collected from a known behavior policy.
Importance sampling estimators are used to weight the trajectory based on the probability ratio of the target and behavior policies, but these have high variance.
This paper proposes state-based importance sampling estimators that reduce variance by dropping certain states from the importance weight calculation.

Plain English Explanation

Reinforcement learning is a powerful technique used to train AI systems, but it can be expensive in real-world applications. This is because the AI needs to try out different strategies, even if they aren't the best ones, in order to learn. This can be problematic in domains where trying suboptimal policies is costly.

To address this, researchers use off-policy evaluation, which allows them to evaluate a desired policy (the "target policy") based on data collected from a different policy (the "behavior policy") that was used during training. They do this using importance sampling estimators, which weight the data based on the probability of the target policy versus the behavior policy.

Unfortunately, these importance sampling estimators tend to have high variance, meaning the results can be quite variable and unreliable. This is a common issue in reinforcement learning optimization techniques.

The key innovation in this paper is the introduction of state-based importance sampling estimators. These new estimators reduce the variance by selectively dropping certain states from the importance weight calculation. The authors demonstrate how this technique can be applied to several different off-policy evaluation methods, consistently improving the accuracy compared to the traditional approaches.

Technical Explanation

The paper proposes state-based variants of several important off-policy evaluation methods, including ordinary importance sampling, weighted importance sampling, per-decision importance sampling, incremental importance sampling, doubly robust off-policy evaluation, and stationary density ratio estimation.

The core idea is to reduce the variance of the importance sampling estimator by dropping certain states from the calculation of the importance weight. This is based on the insight that not all states contribute equally to the variance of the estimator. By selectively ignoring certain states, the authors are able to significantly reduce the variance while still maintaining an unbiased estimate.

The authors evaluate their state-based methods across four different domains, including simulated robotics tasks and a real-world dialogue system. The results consistently show that the state-based estimators outperform their traditional counterparts in terms of reduced variance and improved accuracy.

This approach of adaptively selecting which states to use in the importance weight calculation is related to other variance reduction techniques in reinforcement learning, such as clustered policy decision ranking and policy optimization with learned guidance from state information.

Critical Analysis

The paper provides a thorough theoretical analysis of the state-based importance sampling estimators, including proofs of their unbiasedness and variance properties. However, the authors acknowledge that the practical implementation may require tuning certain hyperparameters, such as the threshold for dropping states, which could introduce additional complexity.

Additionally, the paper only evaluates the methods on a relatively small number of domains. While the consistent performance improvements are promising, it would be valuable to see how the state-based estimators scale to larger, more complex environments. There is also prior work on high-probability sample complexity bounds for policy evaluation in linear function approximation settings that could provide further insight.

Overall, this paper represents an important step forward in reducing the high variance inherent in importance sampling-based off-policy evaluation. The state-based approach is a clever and effective solution, and the authors have done a commendable job of rigorously analyzing its theoretical properties and demonstrating its practical efficacy.

Conclusion

This paper introduces a novel class of state-based importance sampling estimators for off-policy evaluation in reinforcement learning. By selectively dropping certain states from the importance weight calculation, the authors are able to significantly reduce the variance of the estimates without introducing bias.

The state-based methods are shown to outperform traditional importance sampling estimators across a variety of domains, suggesting they could be a valuable tool for researchers and practitioners working on reinforcement learning problems with high-cost exploration. The techniques proposed in this paper represent an important advancement in the field of off-policy evaluation, which is crucial for applying reinforcement learning to real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤔

Low Variance Off-policy Evaluation with State-based Importance Sampling

David M. Bossens, Philip S. Thomas

In many domains, the exploration process of reinforcement learning will be too costly as it requires trying out suboptimal policies, resulting in a need for off-policy evaluation, in which a target policy is evaluated based on data collected from a known behaviour policy. In this context, importance sampling estimators provide estimates for the expected return by weighting the trajectory based on the probability ratio of the target policy and the behaviour policy. Unfortunately, such estimators have a high variance and therefore a large mean squared error. This paper proposes state-based importance sampling estimators which reduce the variance by dropping certain states from the computation of the importance weight. To illustrate their applicability, we demonstrate state-based variants of ordinary importance sampling, weighted importance sampling, per-decision importance sampling, incremental importance sampling, doubly robust off-policy evaluation, and stationary density ratio estimation. Experiments in four domains show that state-based methods consistently yield reduced variance and improved accuracy compared to their traditional counterparts.

5/7/2024

🤿

Vlearn: Off-Policy Learning with Efficient State-Value Function Estimation

Fabian Otto, Philipp Becker, Ngo Anh Vien, Gerhard Neumann

Existing off-policy reinforcement learning algorithms often rely on an explicit state-action-value function representation, which can be problematic in high-dimensional action spaces due to the curse of dimensionality. This reliance results in data inefficiency as maintaining a state-action-value function in such spaces is challenging. We present an efficient approach that utilizes only a state-value function as the critic for off-policy deep reinforcement learning. This approach, which we refer to as Vlearn, effectively circumvents the limitations of existing methods by eliminating the necessity for an explicit state-action-value function. To this end, we introduce a novel importance sampling loss for learning deep value functions from off-policy data. While this is common for linear methods, it has not been combined with deep value function networks. This transfer to deep methods is not straightforward and requires novel design choices such as robust policy updates, twin value function networks to avoid an optimization bias, and importance weight clipping. We also present a novel analysis of the variance of our estimate compared to commonly used importance sampling estimators such as V-trace. Our approach improves sample complexity as well as final performance and ensures consistent and robust performance across various benchmark tasks. Eliminating the state-action-value function in Vlearn facilitates a streamlined learning process, enabling more effective exploration and exploitation in complex environments.

6/21/2024

🤷

Policy Gradient with Active Importance Sampling

Matteo Papini, Giorgio Manganini, Alberto Maria Metelli, Marcello Restelli

Importance sampling (IS) represents a fundamental technique for a large surge of off-policy reinforcement learning approaches. Policy gradient (PG) methods, in particular, significantly benefit from IS, enabling the effective reuse of previously collected samples, thus increasing sample efficiency. However, classically, IS is employed in RL as a passive tool for re-weighting historical samples. However, the statistical community employs IS as an active tool combined with the use of behavioral distributions that allow the reduction of the estimate variance even below the sample mean one. In this paper, we focus on this second setting by addressing the behavioral policy optimization (BPO) problem. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance as much as possible. We provide an iterative algorithm that alternates between the cross-entropy estimation of the minimum-variance behavioral policy and the actual policy optimization, leveraging on defensive IS. We theoretically analyze such an algorithm, showing that it enjoys a convergence rate of order $O(epsilon^{-4})$ to a stationary point, but depending on a more convenient variance term w.r.t. standard PG methods. We then provide a practical version that is numerically validated, showing the advantages in the policy gradient estimation variance and on the learning speed.

5/10/2024

🤷

An Adaptive Importance Sampling for Locally Stable Point Processes

Hee-Geon Kang, Sunggon Kim

The problem of finding the expected value of a statistic of a locally stable point process in a bounded region is addressed. We propose an adaptive importance sampling for solving the problem. In our proposal, we restrict the importance point process to the family of homogeneous Poisson point processes, which enables us to generate quickly independent samples of the importance point process. The optimal intensity of the importance point process is found by applying the cross-entropy minimization method. In the proposed scheme, the expected value of the function and the optimal intensity are iteratively estimated in an adaptive manner. We show that the proposed estimator converges to the target value almost surely, and prove the asymptotic normality of it. We explain how to apply the proposed scheme to the estimation of the intensity of a stationary pairwise interaction point process. The performance of the proposed scheme is compared numerically with the Markov chain Monte Carlo simulation and the perfect sampling.

8/15/2024