Actor-Critic or Critic-Actor? A Tale of Two Time Scales

Read original: arXiv:2210.04470 - Published 6/13/2024 by Shalabh Bhatnagar, Vivek S. Borkar, Soumyajit Guin

🗣️

Overview

The paper revisits the standard formulation of tabular actor-critic algorithms, which involve a two time-scale stochastic approximation.
The value function is computed on a faster time-scale, while the policy is computed on a slower time-scale, emulating policy iteration.
The paper observes that reversing the time scales emulates value iteration, which is also a legitimate algorithm.
The paper provides a proof of convergence and compares the two approaches empirically, with and without function approximation (linear and nonlinear).
The proposed critic-actor algorithm performs on par with actor-critic in terms of both accuracy and computational effort.

Plain English Explanation

The paper discusses a class of reinforcement learning algorithms called actor-critic algorithms, which are used to learn both a policy (the "actor") and a value function (the "critic"). In the standard formulation, the value function is updated more quickly than the policy, which emulates a process called "policy iteration."

The paper suggests that reversing the time scales, so that the policy is updated more quickly than the value function, actually emulates a different process called "value iteration." This is also a valid and legitimate approach to reinforcement learning.

The paper provides a mathematical proof showing that this reversed time-scale approach, which the authors call a "critic-actor" algorithm, will also converge to the optimal solution. They then compare the performance of the critic-actor and actor-critic algorithms, both with and without using function approximation (where the value function and policy are represented by more complex, nonlinear models).

The key finding is that the critic-actor algorithm performs just as well as the standard actor-critic, in terms of both the accuracy of the final solution and the computational effort required to reach it. This suggests that the critic-actor formulation is a viable alternative to the more commonly used actor-critic approach.

Technical Explanation

The paper revisits the standard formulation of tabular actor-critic algorithms, which involve a two time-scale stochastic approximation. In this setup, the value function (the "critic") is computed on a faster time-scale, while the policy (the "actor") is computed on a slower time-scale. This emulates the policy iteration procedure from dynamic programming.

The key insight of the paper is that reversing the time scales, so that the policy is updated more quickly than the value function, will in fact emulate the value iteration procedure. The authors show that this reversed time-scale approach, which they call a "critic-actor" algorithm, is also a legitimate and convergent algorithm for reinforcement learning.

The paper provides a proof of convergence for the critic-actor algorithm and then compares its empirical performance to the standard actor-critic algorithm. They evaluate both approaches with and without function approximation, using both linear and nonlinear function approximators for the value function and policy.

The results indicate that the critic-actor algorithm performs on par with actor-critic in terms of both accuracy and computational effort. This suggests that the critic-actor formulation is a viable alternative to the more commonly used actor-critic approach, offering similar performance characteristics but with the potential advantage of emulating the value iteration procedure.

Critical Analysis

The paper provides a thorough theoretical and empirical analysis of the critic-actor algorithm, demonstrating its convergence properties and comparing its performance to the standard actor-critic approach. However, the paper does not discuss any potential limitations or caveats of the proposed method.

One area that could be explored further is the sensitivity of the critic-actor algorithm to the choice of hyperparameters, such as the relative learning rates for the value function and policy updates. The paper's results suggest that the critic-actor algorithm is robust to these choices, but a more detailed analysis of its sensitivity and stability characteristics would be valuable.

Additionally, the paper focuses on tabular and linear function approximation settings. It would be interesting to see how the critic-actor algorithm performs in more challenging domains with complex, nonlinear function approximators, such as deep neural networks. Further research in this direction could help establish the broader applicability and scalability of the proposed approach.

Overall, the paper makes a valuable contribution by introducing the critic-actor algorithm and demonstrating its theoretical and empirical viability as an alternative to the standard actor-critic formulation. However, additional research is needed to fully understand the strengths, weaknesses, and broader implications of this approach.

Conclusion

The paper revisits the standard formulation of tabular actor-critic algorithms, observing that reversing the time scales of the value function and policy updates leads to a "critic-actor" algorithm that emulates the value iteration procedure. The authors provide a proof of convergence for this critic-actor algorithm and show that it performs on par with the standard actor-critic approach, both with and without function approximation.

This work suggests that the critic-actor formulation is a legitimate and viable alternative to the more commonly used actor-critic algorithms, offering similar performance characteristics but with the potential advantage of emulating the value iteration process. Further research is needed to explore the sensitivity and scalability of the critic-actor algorithm, as well as its applicability in more complex, nonlinear function approximation settings. Nevertheless, this paper contributes an important new perspective to the field of reinforcement learning algorithm design.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🗣️

Actor-Critic or Critic-Actor? A Tale of Two Time Scales

Shalabh Bhatnagar, Vivek S. Borkar, Soumyajit Guin

We revisit the standard formulation of tabular actor-critic algorithm as a two time-scale stochastic approximation with value function computed on a faster time-scale and policy computed on a slower time-scale. This emulates policy iteration. We observe that reversal of the time scales will in fact emulate value iteration and is a legitimate algorithm. We provide a proof of convergence and compare the two empirically with and without function approximation (with both linear and nonlinear function approximators) and observe that our proposed critic-actor algorithm performs on par with actor-critic in terms of both accuracy and computational effort.

6/13/2024

⚙️

Two-Timescale Critic-Actor for Average Reward MDPs with Function Approximation

Prashansa Panda, Shalabh Bhatnagar

In recent years, there has been a lot of research activity focused on carrying out non-asymptotic convergence analyses for actor-critic algorithms. Recently a two-timescale critic-actor algorithm has been presented for the discounted cost setting in the look-up table case where the timescales of the actor and the critic are reversed and only asymptotic convergence shown. In our work, we present the first two-timescale critic-actor algorithm with function approximation in the long-run average reward setting and present the first finite-time non-asymptotic as well as asymptotic convergence analysis for such a scheme. We obtain optimal learning rates and prove that our algorithm achieves a sample complexity of $mathcal{tilde{O}}(epsilon^{-2.08})$ for the mean squared error of the critic to be upper bounded by $epsilon$ which is better than the one obtained for two-timescale actor-critic in a similar setting. A notable feature of our analysis is that unlike recent single-timescale actor-critic algorithms, we present a complete asymptotic convergence analysis of our scheme in addition to the finite-time bounds that we obtain and show that the (slower) critic recursion converges asymptotically to the attractor of an associated differential inclusion with actor parameters corresponding to local maxima of a perturbed average reward objective. We also show the results of numerical experiments on three benchmark settings and observe that our critic-actor algorithm performs on par and is in fact better than the other algorithms considered.

5/27/2024

🛠️

A Two-Time-Scale Stochastic Optimization Framework with Applications in Control and Reinforcement Learning

Sihan Zeng, Thinh T. Doan, Justin Romberg

We study a new two-time-scale stochastic gradient method for solving optimization problems, where the gradients are computed with the aid of an auxiliary variable under samples generated by time-varying MDPs controlled by the underlying optimization variable. These time-varying samples make gradient directions in our update biased and dependent, which can potentially lead to the divergence of the iterates. In our two-time-scale approach, one scale is to estimate the true gradient from these samples, which is then used to update the estimate of the optimal solution. While these two iterates are implemented simultaneously, the former is updated faster than the latter. Our first contribution is to characterize the finite-time complexity of the proposed two-time-scale stochastic gradient method. In particular, we provide explicit formulas for the convergence rates of this method under different structural assumptions, namely, strong convexity, PL condition, and general non-convexity. We apply our framework to various policy optimization problems. First, we look at the infinite-horizon average-reward MDP with finite state and action spaces and derive a convergence rate of $O(k^{-2/5})$ for the online actor-critic algorithm under function approximation, which recovers the best known rate derived specifically for this problem. Second, we study the linear-quadratic regulator and show that an online actor-critic method converges with rate $O(k^{-2/3})$. Third, we use the actor-critic algorithm to solve the policy optimization problem in an entropy regularized Markov decision process, where we also establish a convergence of $O(k^{-2/3})$. The results we derive for both the second and third problem are novel and previously unknown in the literature. Finally, we briefly present the application of our framework to gradient-based policy evaluation algorithms in reinforcement learning.

8/27/2024

🌿

Finite-Time Analysis of Three-Timescale Constrained Actor-Critic and Constrained Natural Actor-Critic Algorithms

Prashansa Panda, Shalabh Bhatnagar

Actor Critic methods have found immense applications on a wide range of Reinforcement Learning tasks especially when the state-action space is large. In this paper, we consider actor critic and natural actor critic algorithms with function approximation for constrained Markov decision processes (C-MDP) involving inequality constraints and carry out a non-asymptotic analysis for both of these algorithms in a non-i.i.d (Markovian) setting. We consider the long-run average cost criterion where both the objective and the constraint functions are suitable policy-dependent long-run averages of certain prescribed cost functions. We handle the inequality constraints using the Lagrange multiplier method. We prove that these algorithms are guaranteed to find a first-order stationary point (i.e., $Vert nabla L(theta,gamma)Vert_2^2 leq epsilon$) of the performance (Lagrange) function $L(theta,gamma)$, with a sample complexity of $mathcal{tilde{O}}(epsilon^{-2.5})$ in the case of both Constrained Actor Critic (C-AC) and Constrained Natural Actor Critic (C-NAC) algorithms. We also show the results of experiments on three different Safety-Gym environments.

5/30/2024