Solving Long-run Average Reward Robust MDPs via Stochastic Games

2312.13912

Published 5/1/2024 by Krishnendu Chatterjee, Ehsan Kafshdar Goharshady, Mehrdad Karrabi, Petr Novotn'y, {DJ}or{dj}e v{Z}ikeli'c

cs.AI

Solving Long-run Average Reward Robust MDPs via Stochastic Games

Abstract

Markov decision processes (MDPs) provide a standard framework for sequential decision making under uncertainty. However, MDPs do not take uncertainty in transition probabilities into account. Robust Markov decision processes (RMDPs) address this shortcoming of MDPs by assigning to each transition an uncertainty set rather than a single probability value. In this work, we consider polytopic RMDPs in which all uncertainty sets are polytopes and study the problem of solving long-run average reward polytopic RMDPs. We present a novel perspective on this problem and show that it can be reduced to solving long-run average reward turn-based stochastic games with finite state and action spaces. This reduction allows us to derive several important consequences that were hitherto not known to hold for polytopic RMDPs. First, we derive new computational complexity bounds for solving long-run average reward polytopic RMDPs, showing for the first time that the threshold decision problem for them is in $NP cap coNP$ and that they admit a randomized algorithm with sub-exponential expected runtime. Second, we present Robust Polytopic Policy Iteration (RPPI), a novel policy iteration algorithm for solving long-run average reward polytopic RMDPs. Our experimental evaluation shows that RPPI is much more efficient in solving long-run average reward polytopic RMDPs compared to state-of-the-art methods based on value iteration.

Create account to get full access

Overview

This research paper proposes a method for solving long-run average reward robust Markov decision processes (RMDPs) using stochastic games.
RMDPs are a type of decision-making model that account for uncertainty or adversarial disturbances in the environment.
The authors show how long-run average reward RMDPs can be formulated as zero-sum stochastic games, which allows them to be solved more efficiently.

Plain English Explanation

Imagine you're a robot trying to navigate through a complex environment with lots of unpredictable factors, like [internal link: https://aimodels.fyi/papers/arxiv/curious-price-distributional-robustness-reinforcement-learning-generative]. You might encounter obstacles, changing terrain, or even an adversary trying to interfere with your goals. A robust Markov decision process (RMDP) is a way to model this kind of uncertain environment and find the best strategy for the robot to achieve its long-term objectives.

In this paper, the researchers show how long-term average reward RMDPs can be transformed into a [internal link: https://aimodels.fyi/papers/arxiv/sample-efficient-learning-infinite-horizon-average-reward] zero-sum stochastic game. This allows them to solve the RMDP more efficiently by leveraging techniques from game theory, rather than trying to optimize the robot's actions directly. By formulating the problem as a game, the robot can find a strategy that works well even in the face of unpredictable or adversarial factors in the environment.

Technical Explanation

The authors first introduce the concept of robust Markov decision processes (RMDPs), which extend traditional Markov decision processes (MDPs) to account for uncertainty or adversarial disturbances in the environment. In an RMDP, there is an additional "adversary" that can perturb the system dynamics or rewards to make the robot's task more difficult.

The key contribution of this paper is showing how long-run average reward RMDPs can be formulated as zero-sum stochastic games. By reframing the problem in this way, the authors are able to leverage techniques from game theory to solve the RMDP more efficiently than previous methods, which had [internal link: https://aimodels.fyi/papers/arxiv/efficient-duple-perturbation-robustness-low-rank-mdps] significant computational and memory requirements.

The authors provide a detailed theoretical analysis, proving that their stochastic game formulation is equivalent to the original long-run average reward RMDP. They also derive algorithms for solving the stochastic game and extracting an optimal policy for the robot. These algorithms achieve [internal link: https://aimodels.fyi/papers/arxiv/quantum-speedups-regret-analysis-infinite-horizon-average] strong regret bounds, showing they can find a near-optimal solution efficiently.

Critical Analysis

The authors acknowledge that their approach relies on the assumption that the robot and adversary have perfect information about the system dynamics and rewards. In more realistic scenarios, this information may be partially or entirely unknown, which would require additional techniques like [internal link: https://aimodels.fyi/papers/arxiv/variance-reduced-policy-gradient-approaches-infinite-horizon] reinforcement learning to estimate the underlying model.

Additionally, the authors focus on the long-run average reward objective, which may not capture all the nuances of real-world robotic applications. Other reward criteria, such as discounted or multi-objective functions, could be worth exploring in future work.

Overall, this paper presents a promising approach for solving long-run average reward RMDPs, but there are still opportunities to extend and refine the techniques to handle more practical and complex scenarios.

Conclusion

This research demonstrates how transforming long-run average reward robust Markov decision processes into zero-sum stochastic games can lead to more efficient solution methods. By leveraging game theoretic techniques, the authors are able to find near-optimal policies for robots navigating uncertain environments, even in the face of adversarial disturbances. While the current approach has some limitations, this work lays the groundwork for further advancements in robust decision-making for real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🏅

Provably Efficient Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs

Kihyuk Hong, Yufan Zhang, Ambuj Tewari

We resolve the open problem of designing a computationally efficient algorithm for infinite-horizon average-reward linear Markov Decision Processes (MDPs) with $widetilde{O}(sqrt{T})$ regret. Previous approaches with $widetilde{O}(sqrt{T})$ regret either suffer from computational inefficiency or require strong assumptions on dynamics, such as ergodicity. In this paper, we approximate the average-reward setting by the discounted setting and show that running an optimistic value iteration-based algorithm for learning the discounted setting achieves $widetilde{O}(sqrt{T})$ regret when the discounting factor $gamma$ is tuned appropriately. The challenge in the approximation approach is to get a regret bound with a sharp dependency on the effective horizon $1 / (1 - gamma)$. We use a computationally efficient clipping operator that constrains the span of the optimistic state value function estimate to achieve a sharp regret bound in terms of the effective horizon, which leads to $widetilde{O}(sqrt{T})$ regret.

5/27/2024

stat.ML cs.LG

Achieving Tractable Minimax Optimal Regret in Average Reward MDPs

Victor Boone, Zihan Zhang

In recent years, significant attention has been directed towards learning average-reward Markov Decision Processes (MDPs). However, existing algorithms either suffer from sub-optimal regret guarantees or computational inefficiencies. In this paper, we present the first tractable algorithm with minimax optimal regret of $widetilde{mathrm{O}}(sqrt{mathrm{sp}(h^*) S A T})$, where $mathrm{sp}(h^*)$ is the span of the optimal bias function $h^*$, $S times A$ is the size of the state-action space and $T$ the number of learning steps. Remarkably, our algorithm does not require prior information on $mathrm{sp}(h^*)$. Our algorithm relies on a novel subroutine, Projected Mitigated Extended Value Iteration (PMEVI), to compute bias-constrained optimal policies efficiently. This subroutine can be applied to various previous algorithms to improve regret bounds.

6/4/2024

cs.LG cs.SY eess.SY stat.ML

🏅

The Curious Price of Distributional Robustness in Reinforcement Learning with a Generative Model

Laixi Shi, Gen Li, Yuting Wei, Yuxin Chen, Matthieu Geist, Yuejie Chi

This paper investigates model robustness in reinforcement learning (RL) to reduce the sim-to-real gap in practice. We adopt the framework of distributionally robust Markov decision processes (RMDPs), aimed at learning a policy that optimizes the worst-case performance when the deployed environment falls within a prescribed uncertainty set around the nominal MDP. Despite recent efforts, the sample complexity of RMDPs remained mostly unsettled regardless of the uncertainty set in use. It was unclear if distributional robustness bears any statistical consequences when benchmarked against standard RL. Assuming access to a generative model that draws samples based on the nominal MDP, we characterize the sample complexity of RMDPs when the uncertainty set is specified via either the total variation (TV) distance or $chi^2$ divergence. The algorithm studied here is a model-based method called {em distributionally robust value iteration}, which is shown to be near-optimal for the full range of uncertainty levels. Somewhat surprisingly, our results uncover that RMDPs are not necessarily easier or harder to learn than standard MDPs. The statistical consequence incurred by the robustness requirement depends heavily on the size and shape of the uncertainty set: in the case w.r.t.~the TV distance, the minimax sample complexity of RMDPs is always smaller than that of standard MDPs; in the case w.r.t.~the $chi^2$ divergence, the sample complexity of RMDPs can often far exceed the standard MDP counterpart.

4/15/2024

cs.LG cs.IT

🤯

Sample-efficient Learning of Infinite-horizon Average-reward MDPs with General Function Approximation

Jianliang He, Han Zhong, Zhuoran Yang

We study infinite-horizon average-reward Markov decision processes (AMDPs) in the context of general function approximation. Specifically, we propose a novel algorithmic framework named Local-fitted Optimization with OPtimism (LOOP), which incorporates both model-based and value-based incarnations. In particular, LOOP features a novel construction of confidence sets and a low-switching policy updating scheme, which are tailored to the average-reward and function approximation setting. Moreover, for AMDPs, we propose a novel complexity measure -- average-reward generalized eluder coefficient (AGEC) -- which captures the challenge of exploration in AMDPs with general function approximation. Such a complexity measure encompasses almost all previously known tractable AMDP models, such as linear AMDPs and linear mixture AMDPs, and also includes newly identified cases such as kernel AMDPs and AMDPs with Bellman eluder dimensions. Using AGEC, we prove that LOOP achieves a sublinear $tilde{mathcal{O}}(mathrm{poly}(d, mathrm{sp}(V^*)) sqrt{Tbeta} )$ regret, where $d$ and $beta$ correspond to AGEC and log-covering number of the hypothesis class respectively, $mathrm{sp}(V^*)$ is the span of the optimal state bias function, $T$ denotes the number of steps, and $tilde{mathcal{O}} (cdot) $ omits logarithmic factors. When specialized to concrete AMDP models, our regret bounds are comparable to those established by the existing algorithms designed specifically for these special cases. To the best of our knowledge, this paper presents the first comprehensive theoretical framework capable of handling nearly all AMDPs.

4/22/2024

cs.LG stat.ML