Tree Search-Based Policy Optimization under Stochastic Execution Delay

2404.05440

YC

0

Reddit

0

Published 4/9/2024 by David Valensi, Esther Derman, Shie Mannor, Gal Dalal

🛠️

Abstract

The standard formulation of Markov decision processes (MDPs) assumes that the agent's decisions are executed immediately. However, in numerous realistic applications such as robotics or healthcare, actions are performed with a delay whose value can even be stochastic. In this work, we introduce stochastic delayed execution MDPs, a new formalism addressing random delays without resorting to state augmentation. We show that given observed delay values, it is sufficient to perform a policy search in the class of Markov policies in order to reach optimal performance, thus extending the deterministic fixed delay case. Armed with this insight, we devise DEZ, a model-based algorithm that optimizes over the class of Markov policies. DEZ leverages Monte-Carlo tree search similar to its non-delayed variant EfficientZero to accurately infer future states from the action queue. Thus, it handles delayed execution while preserving the sample efficiency of EfficientZero. Through a series of experiments on the Atari suite, we demonstrate that although the previous baseline outperforms the naive method in scenarios with constant delay, it underperforms in the face of stochastic delays. In contrast, our approach significantly outperforms the baselines, for both constant and stochastic delays. The code is available at http://github.com/davidva1/Delayed-EZ .

Create account to get full access

or

If you already have an account, we'll log you in

Overview

  • Introduces a new formalism called "stochastic delayed execution Markov decision processes" (SDEMDPs) to address random delays in action execution
  • Demonstrates that optimal performance can be achieved by searching over the class of Markov policies, even with stochastic delays
  • Proposes an algorithm called "DEZ" that leverages Monte-Carlo tree search to accurately infer future states and handle delayed execution while maintaining sample efficiency

Plain English Explanation

Markov decision processes (MDPs) are a common way to model decision-making problems, where an agent takes actions that have consequences and affect the state of the environment. The standard MDP formulation assumes that the agent's actions are executed immediately. However, in many real-world applications like robotics or healthcare, there can be a delay between when an action is taken and when it is actually executed, and this delay can even be random.

The researchers in this paper introduce a new type of MDP called a "stochastic delayed execution MDP" (SDEMDP) that addresses this issue of random delays. They show that even with stochastic delays, the agent can still find an optimal policy by searching within the class of Markov policies, without needing to expand the state space.

Building on this insight, the researchers develop an algorithm called "DEZ" that can handle delayed execution. DEZ uses a technique called Monte-Carlo tree search, similar to the EfficientZero algorithm, to accurately predict future states despite the delays. This allows DEZ to optimize the agent's policy while maintaining the sample efficiency of EfficientZero.

The researchers test their approach on Atari games and find that it significantly outperforms other methods, particularly in scenarios with stochastic delays, where previous baselines struggle.

Technical Explanation

The researchers introduce a new formalism called "stochastic delayed execution Markov decision processes" (SDEMDPs) to model decision-making problems with random delays in action execution. In a standard MDP, the agent's actions are assumed to be executed immediately, but in an SDEMDP, there can be a delay between when an action is taken and when it is actually executed, and this delay can be stochastic.

The researchers show that even with stochastic delays, it is still sufficient to perform a policy search in the class of Markov policies to reach optimal performance. This extends the previous results for the deterministic fixed delay case.

Building on this insight, the researchers develop an algorithm called "DEZ" (Delayed EfficientZero) that can handle delayed execution. DEZ leverages Monte-Carlo tree search, similar to the EfficientZero algorithm, to accurately infer future states from the action queue. This allows DEZ to optimize the agent's policy while preserving the sample efficiency of EfficientZero.

The researchers evaluate their approach on the Atari game suite and find that DEZ significantly outperforms the baseline methods, particularly in scenarios with stochastic delays, where the previous baselines struggle.

Critical Analysis

The researchers acknowledge that their work assumes the delay values are observed, which may not always be the case in real-world applications. Additionally, the paper does not address the potential computational overhead introduced by the Monte-Carlo tree search used in the DEZ algorithm.

It would be interesting to see how the DEZ algorithm performs in more complex environments, such as those with multi-agent interactions or partial observability. The researchers could also explore the possibility of incorporating uncertainty in the object search or using alternative policy optimization techniques to further improve the algorithm's performance.

Conclusion

This paper introduces a new formalism, stochastic delayed execution Markov decision processes (SDEMDPs), to address the issue of random delays in action execution, which is common in many real-world applications. The researchers show that optimal performance can be achieved by searching over the class of Markov policies, even with stochastic delays. They then propose an algorithm called DEZ that leverages Monte-Carlo tree search to handle delayed execution while maintaining sample efficiency. Experimental results on Atari games demonstrate the effectiveness of the DEZ algorithm, particularly in scenarios with stochastic delays.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🛠️

Variational Delayed Policy Optimization

Qingyuan Wu, Simon Sinong Zhan, Yixuan Wang, Yuhui Wang, Chung-Wei Lin, Chen Lv, Qi Zhu, Chao Huang

YC

0

Reddit

0

In environments with delayed observation, state augmentation by including actions within the delay window is adopted to retrieve Markovian property to enable reinforcement learning (RL). However, state-of-the-art (SOTA) RL techniques with Temporal-Difference (TD) learning frameworks often suffer from learning inefficiency, due to the significant expansion of the augmented state space with the delay. To improve learning efficiency without sacrificing performance, this work introduces a novel framework called Variational Delayed Policy Optimization (VDPO), which reformulates delayed RL as a variational inference problem. This problem is further modelled as a two-step iterative optimization problem, where the first step is TD learning in the delay-free environment with a small state space, and the second step is behaviour cloning which can be addressed much more efficiently than TD learning. We not only provide a theoretical analysis of VDPO in terms of sample complexity and performance, but also empirically demonstrate that VDPO can achieve consistent performance with SOTA methods, with a significant enhancement of sample efficiency (approximately 50% less amount of samples) in the MuJoCo benchmark.

Read more

5/24/2024

🎲

Monte Carlo Planning for Stochastic Control on Constrained Markov Decision Processes

Larkin Liu, Shiqi Liu, Matej Jusup

YC

0

Reddit

0

In the world of stochastic control, especially in economics and engineering, Markov Decision Processes (MDPs) can effectively model various stochastic decision processes, from asset management to transportation optimization. These underlying MDPs, upon closer examination, often reveal a specifically constrained causal structure concerning the transition and reward dynamics. By exploiting this structure, we can obtain a reduction in the causal representation of the problem setting, allowing us to solve of the optimal value function more efficiently. This work defines an MDP framework, the texttt{SD-MDP}, where we disentangle the causal structure of MDPs' transition and reward dynamics, providing distinct partitions on the temporal causal graph. With this stochastic reduction, the texttt{SD-MDP} reflects a general class of resource allocation problems. This disentanglement further enables us to derive theoretical guarantees on the estimation error of the value function under an optimal policy by allowing independent value estimation from Monte Carlo sampling. Subsequently, by integrating this estimator into well-known Monte Carlo planning algorithms, such as Monte Carlo Tree Search (MCTS), we derive bounds on the simple regret of the algorithm. Finally, we quantify the policy improvement of MCTS under the texttt{SD-MDP} framework by demonstrating that the MCTS planning algorithm achieves higher expected reward (lower costs) under a constant simulation budget, on a tangible economic example based on maritime refuelling.

Read more

6/26/2024

🛸

Interpretable Decision Tree Search as a Markov Decision Process

Hector Kohler, Riad Akrour, Philippe Preux

YC

0

Reddit

0

Finding an optimal decision tree for a supervised learning task is a challenging combinatorial problem to solve at scale. It was recently proposed to frame the problem as a Markov Decision Problem (MDP) and use deep reinforcement learning to tackle scaling. Unfortunately, these methods are not competitive with the current branch-and-bound state-of-the-art. We propose instead to scale the resolution of such MDPs using an information-theoretic tests generating function that heuristically, and dynamically for every state, limits the set of admissible test actions to a few good candidates. As a solver, we show empirically that our algorithm is at the very least competitive with branch-and-bound alternatives. As a machine learning tool, a key advantage of our approach is to solve for multiple complexity-performance trade-offs at virtually no additional cost. With such a set of solutions, a user can then select the tree that generalizes best and which has the interpretability level that best suits their needs, which no current branch-and-bound method allows.

Read more

6/14/2024

🤿

Improving Sample Efficiency of Model-Free Algorithms for Zero-Sum Markov Games

Songtao Feng, Ming Yin, Yu-Xiang Wang, Jing Yang, Yingbin Liang

YC

0

Reddit

0

The problem of two-player zero-sum Markov games has recently attracted increasing interests in theoretical studies of multi-agent reinforcement learning (RL). In particular, for finite-horizon episodic Markov decision processes (MDPs), it has been shown that model-based algorithms can find an $epsilon$-optimal Nash Equilibrium (NE) with the sample complexity of $O(H^3SAB/epsilon^2)$, which is optimal in the dependence of the horizon $H$ and the number of states $S$ (where $A$ and $B$ denote the number of actions of the two players, respectively). However, none of the existing model-free algorithms can achieve such an optimality. In this work, we propose a model-free stage-based Q-learning algorithm and show that it achieves the same sample complexity as the best model-based algorithm, and hence for the first time demonstrate that model-free algorithms can enjoy the same optimality in the $H$ dependence as model-based algorithms. The main improvement of the dependency on $H$ arises by leveraging the popular variance reduction technique based on the reference-advantage decomposition previously used only for single-agent RL. However, such a technique relies on a critical monotonicity property of the value function, which does not hold in Markov games due to the update of the policy via the coarse correlated equilibrium (CCE) oracle. Thus, to extend such a technique to Markov games, our algorithm features a key novel design of updating the reference value functions as the pair of optimistic and pessimistic value functions whose value difference is the smallest in the history in order to achieve the desired improvement in the sample efficiency.

Read more

6/7/2024