Two-Timescale Critic-Actor for Average Reward MDPs with Function Approximation

Read original: arXiv:2402.01371 - Published 5/27/2024 by Prashansa Panda, Shalabh Bhatnagar

⚙️

Overview

This paper presents a new two-timescale critic-actor algorithm for reinforcement learning in the long-run average reward setting with function approximation.
It provides the first finite-time non-asymptotic and asymptotic convergence analysis for such a scheme, showing optimal learning rates and improved sample complexity compared to prior two-timescale actor-critic algorithms.
The analysis demonstrates the critic recursion asymptotically converges to the attractor of an associated differential inclusion, with actor parameters corresponding to local maxima of a perturbed average reward objective.
Numerical experiments on benchmark settings show the proposed critic-actor algorithm performs on par or better than other algorithms considered.

Plain English Explanation

In reinforcement learning, agents learn to make good decisions by interacting with an environment and receiving rewards. One important class of algorithms for this is called "actor-critic," where there are two components: the "actor" chooses actions, and the "critic" evaluates how good those actions are.

This paper introduces a new type of actor-critic algorithm that uses a "two-timescale" approach, meaning the actor and critic components are updated at different rates. Specifically, the critic updates more slowly than the actor. The authors show that this algorithm can achieve good performance in the "long-run average reward" setting, where the goal is to maximize the average reward over a long period of time.

Importantly, the authors provide a rigorous mathematical analysis of their algorithm. They prove that it converges to the optimal solution, and they quantify how quickly it converges - this is important for ensuring the algorithm is practical and efficient. Their analysis shows the algorithm outperforms previous two-timescale actor-critic methods in terms of sample complexity, meaning it can learn the optimal strategy using fewer interactions with the environment.

The authors also demonstrate experimentally that their algorithm performs well on standard benchmarks, matching or exceeding the performance of other state-of-the-art algorithms. Overall, this work advances the theory and practice of reinforcement learning by introducing a new, efficient actor-critic algorithm with strong theoretical guarantees.

Technical Explanation

The paper introduces a new two-timescale critic-actor algorithm for reinforcement learning in the long-run average reward setting with function approximation. Unlike previous work that only showed asymptotic convergence for such schemes, this is the first to provide a complete finite-time non-asymptotic as well as asymptotic convergence analysis.

The key technical contributions are:

Optimal learning rates: The authors derive optimal learning rates for the actor and critic updates, showing their algorithm achieves a sample complexity of $\tilde{\mathcal{O}}(\epsilon^{-2.08})$ for the mean squared error of the critic to be upper bounded by $\epsilon$. This improves upon the sample complexity of $\mathcal{O}(\epsilon^{-2.5})$ obtained for two-timescale actor-critic algorithms in a similar setting.
Asymptotic convergence analysis: Unlike recent single-timescale actor-critic algorithms, the authors present a complete asymptotic convergence analysis of their two-timescale scheme. They show the (slower) critic recursion converges asymptotically to the attractor of an associated differential inclusion, with actor parameters corresponding to local maxima of a perturbed average reward objective.
Numerical experiments: The authors evaluate their critic-actor algorithm on three benchmark settings and observe its performance matches or exceeds that of other state-of-the-art algorithms considered.

Critical Analysis

The paper provides a rigorous and comprehensive analysis of the proposed two-timescale critic-actor algorithm, addressing important theoretical and practical aspects of reinforcement learning in the long-run average reward setting.

One potential limitation is that the analysis assumes access to the true gradient of the average reward objective, which may not be feasible in practice. The authors mention extending the results to the gradient-free case as future work.

Additionally, the paper does not discuss the sensitivity of the algorithm's performance to hyperparameter choices, such as the relative learning rates of the actor and critic. Investigating the robustness of the algorithm to these settings could be valuable for real-world applications.

Furthermore, while the authors demonstrate the algorithm's empirical performance on benchmark tasks, applying it to more complex, large-scale reinforcement learning problems would help further validate its practical utility.

Overall, this work represents a significant advancement in the theoretical understanding and sample efficiency of two-timescale critic-actor methods, with promising implications for the development of more effective reinforcement learning algorithms.

Conclusion

This paper presents a new two-timescale critic-actor algorithm for reinforcement learning in the long-run average reward setting, providing the first finite-time non-asymptotic and asymptotic convergence analysis for such a scheme. The algorithm achieves optimal learning rates and improved sample complexity compared to prior two-timescale actor-critic methods.

The rigorous theoretical analysis, combined with the promising empirical results on benchmark tasks, suggest this approach could lead to more sample-efficient and reliable reinforcement learning systems. As the authors highlight, extending the analysis to handle gradient-free settings and exploring the algorithm's robustness to hyperparameter choices are important directions for future research.

Overall, this work represents a significant contribution to the growing body of research on improving the theoretical foundations and practical performance of actor-critic reinforcement learning algorithms.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

⚙️

Two-Timescale Critic-Actor for Average Reward MDPs with Function Approximation

Prashansa Panda, Shalabh Bhatnagar

In recent years, there has been a lot of research activity focused on carrying out non-asymptotic convergence analyses for actor-critic algorithms. Recently a two-timescale critic-actor algorithm has been presented for the discounted cost setting in the look-up table case where the timescales of the actor and the critic are reversed and only asymptotic convergence shown. In our work, we present the first two-timescale critic-actor algorithm with function approximation in the long-run average reward setting and present the first finite-time non-asymptotic as well as asymptotic convergence analysis for such a scheme. We obtain optimal learning rates and prove that our algorithm achieves a sample complexity of $mathcal{tilde{O}}(epsilon^{-2.08})$ for the mean squared error of the critic to be upper bounded by $epsilon$ which is better than the one obtained for two-timescale actor-critic in a similar setting. A notable feature of our analysis is that unlike recent single-timescale actor-critic algorithms, we present a complete asymptotic convergence analysis of our scheme in addition to the finite-time bounds that we obtain and show that the (slower) critic recursion converges asymptotically to the attractor of an associated differential inclusion with actor parameters corresponding to local maxima of a perturbed average reward objective. We also show the results of numerical experiments on three benchmark settings and observe that our critic-actor algorithm performs on par and is in fact better than the other algorithms considered.

5/27/2024

🗣️

Actor-Critic or Critic-Actor? A Tale of Two Time Scales

Shalabh Bhatnagar, Vivek S. Borkar, Soumyajit Guin

We revisit the standard formulation of tabular actor-critic algorithm as a two time-scale stochastic approximation with value function computed on a faster time-scale and policy computed on a slower time-scale. This emulates policy iteration. We observe that reversal of the time scales will in fact emulate value iteration and is a legitimate algorithm. We provide a proof of convergence and compare the two empirically with and without function approximation (with both linear and nonlinear function approximators) and observe that our proposed critic-actor algorithm performs on par with actor-critic in terms of both accuracy and computational effort.

6/13/2024

🌿

Finite-Time Analysis of Three-Timescale Constrained Actor-Critic and Constrained Natural Actor-Critic Algorithms

Prashansa Panda, Shalabh Bhatnagar

Actor Critic methods have found immense applications on a wide range of Reinforcement Learning tasks especially when the state-action space is large. In this paper, we consider actor critic and natural actor critic algorithms with function approximation for constrained Markov decision processes (C-MDP) involving inequality constraints and carry out a non-asymptotic analysis for both of these algorithms in a non-i.i.d (Markovian) setting. We consider the long-run average cost criterion where both the objective and the constraint functions are suitable policy-dependent long-run averages of certain prescribed cost functions. We handle the inequality constraints using the Lagrange multiplier method. We prove that these algorithms are guaranteed to find a first-order stationary point (i.e., $Vert nabla L(theta,gamma)Vert_2^2 leq epsilon$) of the performance (Lagrange) function $L(theta,gamma)$, with a sample complexity of $mathcal{tilde{O}}(epsilon^{-2.5})$ in the case of both Constrained Actor Critic (C-AC) and Constrained Natural Actor Critic (C-NAC) algorithms. We also show the results of experiments on three different Safety-Gym environments.

5/30/2024

🗣️

Non-Asymptotic Analysis for Single-Loop (Natural) Actor-Critic with Compatible Function Approximation

Yudan Wang, Yue Wang, Yi Zhou, Shaofeng Zou

Actor-critic (AC) is a powerful method for learning an optimal policy in reinforcement learning, where the critic uses algorithms, e.g., temporal difference (TD) learning with function approximation, to evaluate the current policy and the actor updates the policy along an approximate gradient direction using information from the critic. This paper provides the textit{tightest} non-asymptotic convergence bounds for both the AC and natural AC (NAC) algorithms. Specifically, existing studies show that AC converges to an $epsilon+varepsilon_{text{critic}}$ neighborhood of stationary points with the best known sample complexity of $mathcal{O}(epsilon^{-2})$ (up to a log factor), and NAC converges to an $epsilon+varepsilon_{text{critic}}+sqrt{varepsilon_{text{actor}}}$ neighborhood of the global optimum with the best known sample complexity of $mathcal{O}(epsilon^{-3})$, where $varepsilon_{text{critic}}$ is the approximation error of the critic and $varepsilon_{text{actor}}$ is the approximation error induced by the insufficient expressive power of the parameterized policy class. This paper analyzes the convergence of both AC and NAC algorithms with compatible function approximation. Our analysis eliminates the term $varepsilon_{text{critic}}$ from the error bounds while still achieving the best known sample complexities. Moreover, we focus on the challenging single-loop setting with a single Markovian sample trajectory. Our major technical novelty lies in analyzing the stochastic bias due to policy-dependent and time-varying compatible function approximation in the critic, and handling the non-ergodicity of the MDP due to the single Markovian sample trajectory. Numerical results are also provided in the appendix.

6/5/2024