SAFE-GIL: SAFEty Guided Imitation Learning

2404.05249

Published 4/9/2024 by Yusuf Umut Ciftci, Zeyuan Feng, Somil Bansal

SAFE-GIL: SAFEty Guided Imitation Learning

Abstract

Behavior Cloning is a popular approach to Imitation Learning, in which a robot observes an expert supervisor and learns a control policy. However, behavior cloning suffers from the compounding error problem - the policy errors compound as it deviates from the expert demonstrations and might lead to catastrophic system failures, limiting its use in safety-critical applications. On-policy data aggregation methods are able to address this issue at the cost of rolling out and repeated training of the imitation policy, which can be tedious and computationally prohibitive. We propose SAFE-GIL, an off-policy behavior cloning method that guides the expert via adversarial disturbance during data collection. The algorithm abstracts the imitation error as an adversarial disturbance in the system dynamics, injects it during data collection to expose the expert to safety critical states, and collects corrective actions. Our method biases training to more closely replicate expert behavior in safety-critical states and allows more variance in less critical states. We compare our method with several behavior cloning techniques and DAgger on autonomous navigation and autonomous taxiing tasks and show higher task success and safety, especially in low data regimes where the likelihood of error is higher, at a slight drop in the performance.

Create account to get full access

Overview

• The paper introduces a new approach called SAFE-GIL (SAFEty Guided Imitation Learning) for teaching agents to perform tasks by observing expert demonstrations. • SAFE-GIL aims to address the covariate shift problem, where the agent's state distribution diverges from the expert's during training, leading to poor performance. • The key idea is to guide the agent's exploration during training using a safety critic that estimates the risk of the agent's actions, encouraging it to stay close to the expert's state distribution.

Plain English Explanation

• Imagine you're trying to teach a robot how to play a video game by showing it videos of an expert player. The robot might start to deviate from the expert's actions over time, leading to poor performance. • SAFE-GIL is a new technique that tries to keep the robot's actions closer to the expert's by constantly evaluating the "safety" of the robot's actions and guiding it to stay in safer, more familiar territory. • This is like having a coach who constantly gives the robot feedback on whether its actions are too risky, and nudges it back towards the expert's demonstrated behavior.

Technical Explanation

• SAFE-GIL builds on Hierarchical Generative Adversarial Imitation Learning and Sensor Imitate: Learning to Imitate Experts' Behaviors via Sensing and Reasoning to address covariate shift. • The key components are: 1) a policy network that learns to imitate the expert, 2) a safety critic network that estimates the risk of the agent's actions, and 3) a training process that updates the policy to stay close to the expert's state distribution guided by the safety critic. • Experiments on continuous control tasks show that SAFE-GIL outperforms standard imitation learning approaches in terms of sample efficiency and final performance.

Critical Analysis

• The paper does not address potential issues with the safety critic's ability to accurately evaluate the risk of the agent's actions, which could lead to suboptimal guidance. • It also does not discuss how SAFE-GIL would scale to more complex, high-dimensional tasks or environments with sparse rewards, where covariate shift is a significant challenge. • Further research is needed to understand the limitations of this approach and how it compares to other techniques for addressing covariate shift in imitation learning, such as Programmatic Imitation Learning or Toward a Surgeon-in-the-Loop Ophthalmic Robotic Apprentice.

Conclusion

• SAFE-GIL introduces a novel approach to addressing the covariate shift problem in imitation learning by guiding the agent's exploration using a safety critic. • The promising results on continuous control tasks suggest that this technique could be a valuable tool for training agents to perform complex tasks by observing expert demonstrations. • However, further research is needed to fully understand the strengths, limitations, and broader applicability of this approach within the field of imitation learning.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🚀

ISAACS: Iterative Soft Adversarial Actor-Critic for Safety

Kai-Chieh Hsu, Duy Phuong Nguyen, Jaime Fern'andez Fisac

The deployment of robots in uncontrolled environments requires them to operate robustly under previously unseen scenarios, like irregular terrain and wind conditions. Unfortunately, while rigorous safety frameworks from robust optimal control theory scale poorly to high-dimensional nonlinear dynamics, control policies computed by more tractable deep methods lack guarantees and tend to exhibit little robustness to uncertain operating conditions. This work introduces a novel approach enabling scalable synthesis of robust safety-preserving controllers for robotic systems with general nonlinear dynamics subject to bounded modeling error by combining game-theoretic safety analysis with adversarial reinforcement learning in simulation. Following a soft actor-critic scheme, a safety-seeking fallback policy is co-trained with an adversarial disturbance agent that aims to invoke the worst-case realization of model error and training-to-deployment discrepancy allowed by the designer's uncertainty. While the learned control policy does not intrinsically guarantee safety, it is used to construct a real-time safety filter (or shield) with robust safety guarantees based on forward reachability rollouts. This shield can be used in conjunction with a safety-agnostic control policy, precluding any task-driven actions that could result in loss of safety. We evaluate our learning-based safety approach in a 5D race car simulator, compare the learned safety policy to the numerically obtained optimal solution, and empirically validate the robust safety guarantee of our proposed safety shield against worst-case model discrepancy.

6/11/2024

cs.LG cs.RO cs.SY eess.SY

CIMRL: Combining IMitiation and Reinforcement Learning for Safe Autonomous Driving

Jonathan Booher, Khashayar Rohanimanesh, Junhong Xu, Vladislav Isenbaev, Ashwin Balakrishna, Ishan Gupta, Wei Liu, Aleksandr Petiushko

Modern approaches to autonomous driving rely heavily on learned components trained with large amounts of human driving data via imitation learning. However, these methods require large amounts of expensive data collection and even then face challenges with safely handling long-tail scenarios and compounding errors over time. At the same time, pure Reinforcement Learning (RL) methods can fail to learn performant policies in sparse, constrained, and challenging-to-define reward settings like driving. Both of these challenges make deploying purely cloned policies in safety critical applications like autonomous vehicles challenging. In this paper we propose Combining IMitation and Reinforcement Learning (CIMRL) approach - a framework that enables training driving policies in simulation through leveraging imitative motion priors and safety constraints. CIMRL does not require extensive reward specification and improves on the closed loop behavior of pure cloning methods. By combining RL and imitation, we demonstrate that our method achieves state-of-the-art results in closed loop simulation driving benchmarks.

6/27/2024

cs.LG

🏷️

MEGA-DAgger: Imitation Learning with Multiple Imperfect Experts

Xiatao Sun, Shuo Yang, Mingyan Zhou, Kunpeng Liu, Rahul Mangharam

Imitation learning has been widely applied to various autonomous systems thanks to recent development in interactive algorithms that address covariate shift and compounding errors induced by traditional approaches like behavior cloning. However, existing interactive imitation learning methods assume access to one perfect expert. Whereas in reality, it is more likely to have multiple imperfect experts instead. In this paper, we propose MEGA-DAgger, a new DAgger variant that is suitable for interactive learning with multiple imperfect experts. First, unsafe demonstrations are filtered while aggregating the training data, so the imperfect demonstrations have little influence when training the novice policy. Next, experts are evaluated and compared on scenarios-specific metrics to resolve the conflicted labels among experts. Through experiments in autonomous racing scenarios, we demonstrate that policy learned using MEGA-DAgger can outperform both experts and policies learned using the state-of-the-art interactive imitation learning algorithms such as Human-Gated DAgger. The supplementary video can be found at url{https://youtu.be/wPCht31MHrw}.

5/3/2024

cs.LG cs.RO

⚙️

Diffusion Model-Augmented Behavioral Cloning

Shang-Fu Chen, Hsiang-Chun Wang, Ming-Hao Hsu, Chun-Mao Lai, Shao-Hua Sun

Imitation learning addresses the challenge of learning by observing an expert's demonstrations without access to reward signals from environments. Most existing imitation learning methods that do not require interacting with environments either model the expert distribution as the conditional probability p(a|s) (e.g., behavioral cloning, BC) or the joint probability p(s, a). Despite the simplicity of modeling the conditional probability with BC, it usually struggles with generalization. While modeling the joint probability can improve generalization performance, the inference procedure is often time-consuming, and the model can suffer from manifold overfitting. This work proposes an imitation learning framework that benefits from modeling both the conditional and joint probability of the expert distribution. Our proposed Diffusion Model-Augmented Behavioral Cloning (DBC) employs a diffusion model trained to model expert behaviors and learns a policy to optimize both the BC loss (conditional) and our proposed diffusion model loss (joint). DBC outperforms baselines in various continuous control tasks in navigation, robot arm manipulation, dexterous manipulation, and locomotion. We design additional experiments to verify the limitations of modeling either the conditional probability or the joint probability of the expert distribution, as well as compare different generative models. Ablation studies justify the effectiveness of our design choices.

6/4/2024

cs.LG cs.AI cs.RO