Adaptive Horizon Actor-Critic for Policy Learning in Contact-Rich Differentiable Simulation

2405.17784

Published 6/5/2024 by Ignat Georgiev, Krishnan Srinivasan, Jie Xu, Eric Heiden, Animesh Garg

🔗

Abstract

Model-Free Reinforcement Learning (MFRL), leveraging the policy gradient theorem, has demonstrated considerable success in continuous control tasks. However, these approaches are plagued by high gradient variance due to zeroth-order gradient estimation, resulting in suboptimal policies. Conversely, First-Order Model-Based Reinforcement Learning (FO-MBRL) methods employing differentiable simulation provide gradients with reduced variance but are susceptible to sampling error in scenarios involving stiff dynamics, such as physical contact. This paper investigates the source of this error and introduces Adaptive Horizon Actor-Critic (AHAC), an FO-MBRL algorithm that reduces gradient error by adapting the model-based horizon to avoid stiff dynamics. Empirical findings reveal that AHAC outperforms MFRL baselines, attaining 40% more reward across a set of locomotion tasks and efficiently scaling to high-dimensional control environments with improved wall-clock-time efficiency.

Create account to get full access

Overview

Model-Free Reinforcement Learning (MFRL) methods, based on the policy gradient theorem, have been successful in continuous control tasks.
However, MFRL approaches suffer from high gradient variance due to zeroth-order gradient estimation, leading to suboptimal policies.
First-Order Model-Based Reinforcement Learning (FO-MBRL) methods, using differentiable simulation, provide gradients with reduced variance but are prone to sampling error in scenarios involving stiff dynamics, such as physical contact.

Plain English Explanation

The paper explores a challenging problem in reinforcement learning (RL) - how to effectively train agents to perform continuous control tasks, such as robotic locomotion. [Model-Free Reinforcement Learning (MFRL)] approaches, which rely on the policy gradient theorem, have shown promising results in these domains. However, these methods often suffer from high gradient variance, which means the estimates of the gradients used to update the agent's policy can be quite noisy and unreliable.

In contrast, [First-Order Model-Based Reinforcement Learning (FO-MBRL)] techniques use a differentiable simulation model to compute gradients more efficiently, reducing this variance. But these model-based methods can struggle in scenarios with stiff dynamics, such as when the agent makes physical contact with the environment. In these cases, the simulation model may introduce sampling error, leading to inaccurate gradient estimates.

The paper introduces a new algorithm called Adaptive Horizon Actor-Critic (AHAC), which aims to address this issue by adapting the model-based horizon to avoid the problematic stiff dynamics. The key idea is to dynamically adjust the length of the model-based planning horizon to find a sweet spot that balances the benefits of reduced variance from the model-based approach with the accuracy needed to handle challenging physical interactions.

Technical Explanation

The paper presents the Adaptive Horizon Actor-Critic (AHAC) algorithm, an FO-MBRL method that addresses the sampling error issues encountered in scenarios with stiff dynamics. The core insight is that by dynamically adjusting the model-based planning horizon, AHAC can avoid the regions of the state space where the simulation model is prone to high error, while still leveraging the benefits of reduced gradient variance from the model-based approach.

Empirically, the authors demonstrate that AHAC outperforms MFRL baselines by a significant margin (40% more reward) across a set of challenging locomotion tasks. Furthermore, AHAC is shown to scale efficiently to high-dimensional control environments, with improved wall-clock-time performance compared to the MFRL alternatives.

The paper also provides a detailed analysis of the source of the sampling error in FO-MBRL methods, relating it to the stiffness of the system dynamics. This understanding forms the foundation for the adaptive horizon mechanism employed by AHAC to mitigate this issue.

Critical Analysis

The paper makes a valuable contribution by addressing a key limitation of FO-MBRL methods - their susceptibility to sampling error in the presence of stiff dynamics. The introduction of the Adaptive Horizon Actor-Critic (AHAC) algorithm represents a promising step forward in overcoming this challenge.

However, the paper does not delve into the potential limitations or caveats of the AHAC approach. For example, it would be interesting to understand how the adaptive horizon mechanism performs in highly stochastic or partially observable environments, where the model-based component may face additional challenges.

Additionally, the paper could explore the scalability of AHAC to more complex tasks or higher-dimensional control problems, beyond the locomotion scenarios presented. Investigating the sample efficiency and training stability of AHAC compared to state-of-the-art MFRL and FO-MBRL methods would also be valuable.

Overall, the research presented in this paper represents a significant advancement in the field of [Actor-Critic Reinforcement Learning] and [Learning Quadrupedal Locomotion via Differentiable Simulation]. The introduction of the AHAC algorithm, with its ability to adapt to stiff dynamics, opens up new avenues for [Deep Reinforcement Learning in Infinite-Horizon Mean-Field] and [Actor-Critic Model Predictive Control] approaches.

Conclusion

This paper addresses a crucial challenge in reinforcement learning - the high gradient variance and sampling error issues encountered by existing methods when dealing with continuous control tasks involving stiff dynamics. The introduction of the Adaptive Horizon Actor-Critic (AHAC) algorithm represents a promising solution that combines the benefits of model-based and model-free approaches, while mitigating their respective weaknesses.

The empirical results demonstrating AHAC's superior performance and efficiency compared to MFRL baselines suggest that this approach could significantly advance the state of the art in continuous control tasks, with potential applications in areas like [Finite-Time Convergence and Sample Complexity of Actor-Critic] and robotic control. The insights gained from this research can also inform the development of future reinforcement learning algorithms that need to operate in complex, physically-grounded environments.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Actor-Critic Reinforcement Learning with Phased Actor

Ruofan Wu, Junmin Zhong, Jennie Si

Policy gradient methods in actor-critic reinforcement learning (RL) have become perhaps the most promising approaches to solving continuous optimal control problems. However, the trial-and-error nature of RL and the inherent randomness associated with solution approximations cause variations in the learned optimal values and policies. This has significantly hindered their successful deployment in real life applications where control responses need to meet dynamic performance criteria deterministically. Here we propose a novel phased actor in actor-critic (PAAC) method, aiming at improving policy gradient estimation and thus the quality of the control policy. Specifically, PAAC accounts for both $Q$ value and TD error in its actor update. We prove qualitative properties of PAAC for learning convergence of the value and policy, solution optimality, and stability of system dynamics. Additionally, we show variance reduction in policy gradient estimation. PAAC performance is systematically and quantitatively evaluated in this study using DeepMind Control Suite (DMC). Results show that PAAC leads to significant performance improvement measured by total cost, learning variance, robustness, learning speed and success rate. As PAAC can be piggybacked onto general policy gradient learning frameworks, we select well-known methods such as direct heuristic dynamic programming (dHDP), deep deterministic policy gradient (DDPG) and their variants to demonstrate the effectiveness of PAAC. Consequently we provide a unified view on these related policy gradient algorithms.

4/19/2024

cs.LG

Learning Quadrupedal Locomotion via Differentiable Simulation

Clemens Schwarke, Victor Klemm, Jesus Tordesillas, Jean-Pierre Sleiman, Marco Hutter

The emergence of differentiable simulators enabling analytic gradient computation has motivated a new wave of learning algorithms that hold the potential to significantly increase sample efficiency over traditional Reinforcement Learning (RL) methods. While recent research has demonstrated performance gains in scenarios with comparatively smooth dynamics and, thus, smooth optimization landscapes, research on leveraging differentiable simulators for contact-rich scenarios, such as legged locomotion, is scarce. This may be attributed to the discontinuous nature of contact, which introduces several challenges to optimizing with analytic gradients. The purpose of this paper is to determine if analytic gradients can be beneficial even in the face of contact. Our investigation focuses on the effects of different soft and hard contact models on the learning process, examining optimization challenges through the lens of contact simulation. We demonstrate the viability of employing analytic gradients to learn physically plausible locomotion skills with a quadrupedal robot using Short-Horizon Actor-Critic (SHAC), a learning algorithm leveraging analytic gradients, and draw a comparison to a state-of-the-art RL algorithm, Proximal Policy Optimization (PPO), to understand the benefits of analytic gradients.

4/4/2024

cs.RO

AC4MPC: Actor-Critic Reinforcement Learning for Nonlinear Model Predictive Control

Rudolf Reiter, Andrea Ghezzi, Katrin Baumgartner, Jasper Hoffmann, Robert D. McAllister, Moritz Diehl

Ac{MPC} and ac{RL} are two powerful control strategies with, arguably, complementary advantages. In this work, we show how actor-critic ac{RL} techniques can be leveraged to improve the performance of ac{MPC}. The ac{RL} critic is used as an approximation of the optimal value function, and an actor roll-out provides an initial guess for primal variables of the ac{MPC}. A parallel control architecture is proposed where each ac{MPC} instance is solved twice for different initial guesses. Besides the actor roll-out initialization, a shifted initialization from the previous solution is used. Thereafter, the actor and the critic are again used to approximately evaluate the infinite horizon cost of these trajectories. The control actions from the lowest-cost trajectory are applied to the system at each time step. We establish that the proposed algorithm is guaranteed to outperform the original ac{RL} policy plus an error term that depends on the accuracy of the critic and decays with the horizon length of the ac{MPC} formulation. Moreover, we do not require globally optimal solutions for these guarantees to hold. The approach is demonstrated on an illustrative toy example and an ac{AD} overtaking scenario.

6/7/2024

eess.SY cs.AI cs.SY

Deep Reinforcement Learning for Infinite Horizon Mean Field Problems in Continuous Spaces

Andrea Angiuli, Jean-Pierre Fouque, Ruimeng Hu, Alan Raydan

We present the development and analysis of a reinforcement learning (RL) algorithm designed to solve continuous-space mean field game (MFG) and mean field control (MFC) problems in a unified manner. The proposed approach pairs the actor-critic (AC) paradigm with a representation of the mean field distribution via a parameterized score function, which can be efficiently updated in an online fashion, and uses Langevin dynamics to obtain samples from the resulting distribution. The AC agent and the score function are updated iteratively to converge, either to the MFG equilibrium or the MFC optimum for a given mean field problem, depending on the choice of learning rates. A straightforward modification of the algorithm allows us to solve mixed mean field control games (MFCGs). The performance of our algorithm is evaluated using linear-quadratic benchmarks in the asymptotic infinite horizon framework.

5/6/2024

cs.LG