Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

Read original: arXiv:2409.16283 - Published 9/25/2024 by Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, Sean Kirmani

Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

Overview

A new approach called "Gen2Act" that enables robots to learn manipulation skills from human video demonstrations
This approach allows robots to generalize to novel scenarios and tasks beyond the training data
Key innovations include a video generation model that can create new human demonstration videos, and a robot policy network that learns from these generated videos

Plain English Explanation

Gen2Act is a new system that teaches robots how to perform manipulation tasks by learning from videos of humans. Rather than just copying the exact motions they see in the training videos, the robots use Gen2Act to generalize and figure out how to adapt those skills to new situations.

The key innovation is a video generation model that can create brand new demonstration videos of humans performing the task. This gives the robots a much richer set of examples to learn from, beyond just the original training videos. The robots then use a policy network to learn how to replicate the movements from these generated videos.

This allows the robots to acquire highly flexible and generalizable manipulation skills, going well beyond what they could learn from a fixed set of human demonstration videos alone. The robots can adapt what they've learned to perform the task in novel scenarios and settings that weren't part of the original training data.

Technical Explanation

The Gen2Act system consists of two key components:

Human Video Generation Model: This model takes in a textual description of a task and generates a corresponding video of a human performing that task. This allows the system to create new diverse examples of human demonstrations beyond the original training data.
Robot Policy Network: This network learns to map the generated human videos to the appropriate robot actions needed to replicate the demonstrated task. By training on this expanded set of video examples, the robot can acquire more flexible and generalizable manipulation skills.

The researchers evaluated Gen2Act on a range of manipulation tasks, including opening drawers, pushing buttons, and grasping objects. They found that robots using [Gen2Act] significantly outperformed baselines that only had access to the original training videos, demonstrating the benefits of the video generation and policy learning approach.

Critical Analysis

The Gen2Act paper makes a compelling case for the value of using generated human video demonstrations to train more flexible and generalizable robot manipulation skills. However, the authors acknowledge several limitations:

The video generation model has room for improvement in terms of realism and diversity of the generated examples.
The policy learning approach assumes the robot has access to perfect state information, which may not always be the case in real-world settings.
The experiments were conducted in simulated environments, so further testing is needed to validate the approach in physical robot systems.

Additionally, one could question whether the generated videos truly capture the nuanced understanding of the task that an expert human demonstrator would have. There may be subtle contextual cues or decision-making processes that are difficult to fully replicate through generation alone.

Overall, Gen2Act represents an exciting step forward in enabling robots to learn from human demonstrations in a more flexible and scalable way. With further refinements and real-world validation, this approach could have significant implications for advancing the state of robot manipulation capabilities.

Conclusion

Gen2Act introduces a novel approach for training robots to perform manipulation tasks by learning from generated human video demonstrations. This allows the robots to generalize beyond the specific examples in the training data, enabling them to adapt their skills to new scenarios and tasks.

The key innovations of Gen2Act are the video generation model that can create diverse new examples, and the policy network that learns from these generated videos. By exposing the robots to a much richer set of demonstrations, they can acquire highly flexible and generalizable manipulation capabilities.

While the current system has some limitations, Gen2Act represents an important step forward in enabling robots to learn from human expertise in a scalable and adaptable way. As the underlying technologies continue to improve, this approach could have significant implications for advancing the field of robot manipulation and autonomy.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, Sean Kirmani

How can robot manipulation policies generalize to novel tasks involving unseen object types and new motions? In this paper, we provide a solution in terms of predicting motion information from web data through human video generation and conditioning a robot policy on the generated video. Instead of attempting to scale robot data collection which is expensive, we show how we can leverage video generation models trained on easily available web data, for enabling generalization. Our approach Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution with a single policy conditioned on the generated video. To train the policy, we use an order of magnitude less robot interaction data compared to what the video prediction model was trained on. Gen2Act doesn't require fine-tuning the video model at all and we directly use a pre-trained model for generating human videos. Our results on diverse real-world scenarios show how Gen2Act enables manipulating unseen object types and performing novel motions for tasks not present in the robot data. Videos are at https://homangab.github.io/gen2act/

9/25/2024

📉

Track2Act: Predicting Point Tracks from Internet Videos enables Diverse Zero-shot Robot Manipulation

Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, Shubham Tulsiani

We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation: interacting with unseen objects in novel scenes without test-time adaptation. While typical approaches rely on a large amount of demonstration data for such generalization, we propose an approach that leverages web videos to predict plausible interaction plans and learns a task-agnostic transformation to obtain robot actions in the real world. Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal, and can be trained with diverse videos on the web including those of humans and robots manipulating everyday objects. We use these 2D track predictions to infer a sequence of rigid transforms of the object to be manipulated, and obtain robot end-effector poses that can be executed in an open-loop manner. We then refine this open-loop plan by predicting residual actions through a closed loop policy trained with a few embodiment-specific demonstrations. We show that this approach of combining scalably learned track prediction with a residual policy requiring minimal in-domain robot-specific data enables diverse generalizable robot manipulation, and present a wide array of real-world robot manipulation results across unseen tasks, objects, and scenes. https://homangab.github.io/track2act/

8/12/2024

Dreamitate: Real-World Visuomotor Policy Learning via Video Generation

Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, Carl Vondrick

A key challenge in manipulation is learning a policy that can robustly generalize to diverse visual environments. A promising mechanism for learning robust policies is to leverage video generative models, which are pretrained on large-scale datasets of internet videos. In this paper, we propose a visuomotor policy learning framework that fine-tunes a video diffusion model on human demonstrations of a given task. At test time, we generate an example of an execution of the task conditioned on images of a novel scene, and use this synthesized execution directly to control the robot. Our key insight is that using common tools allows us to effortlessly bridge the embodiment gap between the human hand and the robot manipulator. We evaluate our approach on four tasks of increasing complexity and demonstrate that harnessing internet-scale generative models allows the learned policy to achieve a significantly higher degree of generalization than existing behavior cloning approaches.

6/26/2024

Hand-Object Interaction Pretraining from Videos

Himanshu Gaurav Singh, Antonio Loquercio, Carmelo Sferrazza, Jane Wu, Haozhi Qi, Pieter Abbeel, Jitendra Malik

We present an approach to learn general robot manipulation priors from 3D hand-object interaction trajectories. We build a framework to use in-the-wild videos to generate sensorimotor robot trajectories. We do so by lifting both the human hand and the manipulated object in a shared 3D space and retargeting human motions to robot actions. Generative modeling on this data gives us a task-agnostic base policy. This policy captures a general yet flexible manipulation prior. We empirically demonstrate that finetuning this policy, with both reinforcement learning (RL) and behavior cloning (BC), enables sample-efficient adaptation to downstream tasks and simultaneously improves robustness and generalizability compared to prior approaches. Qualitative experiments are available at: url{https://hgaurav2k.github.io/hop/}.

9/14/2024