Track2Act: Predicting Point Tracks from Internet Videos enables Diverse Zero-shot Robot Manipulation

Read original: arXiv:2405.01527 - Published 8/12/2024 by Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, Shubham Tulsiani

📉

Overview

This paper proposes a novel approach called Track2Act for enabling zero-shot robot manipulation - the ability to interact with unseen objects in novel scenes without any test-time adaptation.
Typical approaches rely on a large amount of demonstration data for such generalization, but this paper leverages web videos to predict plausible interaction plans and learns a task-agnostic transformation to obtain robot actions in the real world.
The framework predicts tracks of how points in an image should move in future time-steps based on a goal, and can be trained with diverse videos on the web including those of humans and robots manipulating everyday objects.
It uses these 2D track predictions to infer a sequence of rigid transforms of the object to be manipulated, and obtain robot end-effector poses that can be executed in an open-loop manner.
The open-loop plan is then refined by predicting residual actions through a closed loop policy trained with a few embodiment-specific demonstrations.

Plain English Explanation

The paper presents a system called Track2Act that allows robots to interact with new objects and scenes without any prior training on those specific tasks. This is an important capability, as it means robots can be deployed in a wide variety of situations without the need for extensive data collection and fine-tuning.

The key innovation is using web videos of humans and robots manipulating objects to learn general patterns of how objects move during interactions. The system can then take a goal (e.g. push this object to the left) and use the learned patterns to predict how the object should move to achieve that goal. From these 2D movement predictions, the system infers the necessary robot motions to carry out the task.

To refine this open-loop plan, the system also learns a small residual policy that can make minor adjustments based on a few demonstrations of the robot performing the task. This combination of scalable learning from web data and fine-tuning with minimal real-world data enables the zero-shot manipulation capabilities.

The result is a robot that can adapt to new objects and scenes without requiring retraining, paving the way for more versatile and practical robot assistants that can handle a wide range of tasks.

Technical Explanation

The core of the Track2Act framework is its ability to predict future 2D trajectories of object points based on a provided goal. This is done by training a neural network on a diverse dataset of web videos showing humans and robots manipulating everyday objects.

The network learns to take in an image of the current scene along with a goal (e.g. "move the object to the left") and output a set of 2D trajectories that specify how key points on the object should move over time to achieve that goal. These 2D track predictions are then used to infer a sequence of rigid transforms that the robot's end-effector should follow to manipulate the object.

To refine this open-loop plan, the system also trains a residual policy that can make small adjustments based on a few demonstrations of the robot performing the task. This residual policy is trained using imitationnet, ag2manip, and view techniques to efficiently learn from limited data.

The authors evaluate the Track2Act framework on a wide range of real-world robot manipulation tasks, demonstrating its ability to successfully interact with unseen objects in novel scenes without any test-time adaptation. This contrasts with typical approaches that rely on large amounts of demonstration data and deep reinforcement learning to achieve generalization.

Critical Analysis

The Track2Act framework represents an impressive step forward in enabling robots to adapt to new situations without the need for extensive retraining. By leveraging web data and learning general patterns of object manipulation, the system can transfer its knowledge to novel tasks and scenes.

However, the paper does not address some potential limitations and areas for further research. For instance, the system relies on being able to accurately predict 2D trajectories of object points, which may be challenging in cluttered or dynamic environments. Additionally, the residual policy component requires some task-specific demonstrations, which could limit the system's true zero-shot capabilities.

Further research could explore ways to reduce or eliminate the need for any task-specific data, perhaps by developing more robust and generalizable prediction models or incorporating additional sources of information (e.g. 3D data, semantic understanding of the environment). Addressing these limitations could further enhance the versatility and practicality of zero-shot robot manipulation systems.

Conclusion

The Track2Act framework presented in this paper represents a significant advancement in the field of robot manipulation, enabling zero-shot interaction with unseen objects and scenes. By leveraging web data to learn general patterns of object manipulation and combining this with a residual policy to refine the plan, the system can adapt to a wide range of tasks without the need for extensive retraining.

This capability has the potential to greatly improve the versatility and practicality of robot assistants, allowing them to be deployed in a variety of real-world scenarios without the need for specialized data collection and fine-tuning. As the field of robotics continues to evolve, techniques like those presented in this paper will be crucial for developing robots that can seamlessly integrate into our daily lives and assist us with a diverse range of tasks.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📉

Track2Act: Predicting Point Tracks from Internet Videos enables Diverse Zero-shot Robot Manipulation

Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, Shubham Tulsiani

We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation: interacting with unseen objects in novel scenes without test-time adaptation. While typical approaches rely on a large amount of demonstration data for such generalization, we propose an approach that leverages web videos to predict plausible interaction plans and learns a task-agnostic transformation to obtain robot actions in the real world. Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal, and can be trained with diverse videos on the web including those of humans and robots manipulating everyday objects. We use these 2D track predictions to infer a sequence of rigid transforms of the object to be manipulated, and obtain robot end-effector poses that can be executed in an open-loop manner. We then refine this open-loop plan by predicting residual actions through a closed loop policy trained with a few embodiment-specific demonstrations. We show that this approach of combining scalably learned track prediction with a residual policy requiring minimal in-domain robot-specific data enables diverse generalizable robot manipulation, and present a wide array of real-world robot manipulation results across unseen tasks, objects, and scenes. https://homangab.github.io/track2act/

8/12/2024

Hand-Object Interaction Pretraining from Videos

Himanshu Gaurav Singh, Antonio Loquercio, Carmelo Sferrazza, Jane Wu, Haozhi Qi, Pieter Abbeel, Jitendra Malik

We present an approach to learn general robot manipulation priors from 3D hand-object interaction trajectories. We build a framework to use in-the-wild videos to generate sensorimotor robot trajectories. We do so by lifting both the human hand and the manipulated object in a shared 3D space and retargeting human motions to robot actions. Generative modeling on this data gives us a task-agnostic base policy. This policy captures a general yet flexible manipulation prior. We empirically demonstrate that finetuning this policy, with both reinforcement learning (RL) and behavior cloning (BC), enables sample-efficient adaptation to downstream tasks and simultaneously improves robustness and generalizability compared to prior approaches. Qualitative experiments are available at: url{https://hgaurav2k.github.io/hop/}.

9/14/2024

💬

Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

Vidhi Jain, Maria Attarian, Nikhil J Joshi, Ayzaan Wahid, Danny Driess, Quan Vuong, Pannag R Sanketi, Pierre Sermanet, Stefan Welker, Christine Chan, Igor Gilitschenski, Yonatan Bisk, Debidatta Dwibedi

Large-scale multi-task robotic manipulation systems often rely on text to specify the task. In this work, we explore whether a robot can learn by observing humans. To do so, the robot must understand a person's intent and perform the inferred task despite differences in the embodiments and environments. We introduce Vid2Robot, an end-to-end video-conditioned policy that takes human videos demonstrating manipulation tasks as input and produces robot actions. Our model is trained with a large dataset of prompt video-robot trajectory pairs to learn unified representations of human and robot actions from videos. Vid2Robot uses cross-attention transformer layers between video features and the current robot state to produce the actions and perform the same task as shown in the video. We use auxiliary contrastive losses to align the prompt and robot video representations for better policies. We evaluate Vid2Robot on real-world robots and observe over 20% improvement over BC-Z when using human prompt videos. Further, we also show cross-object motion transfer ability that enables video-conditioned policies to transfer a motion observed on one object in the prompt video to another object in the robot's own environment. Videos available at https://vid2robot.github.io

8/29/2024

Any-point Trajectory Modeling for Policy Learning

Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, Pieter Abbeel

Learning from demonstration is a powerful method for teaching robots new skills, and having more demonstration data often improves policy learning. However, the high cost of collecting demonstration data is a significant bottleneck. Videos, as a rich data source, contain knowledge of behaviors, physics, and semantics, but extracting control-specific information from them is challenging due to the lack of action labels. In this work, we introduce a novel framework, Any-point Trajectory Modeling (ATM), that utilizes video demonstrations by pre-training a trajectory model to predict future trajectories of arbitrary points within a video frame. Once trained, these trajectories provide detailed control guidance, enabling the learning of robust visuomotor policies with minimal action-labeled data. Across over 130 language-conditioned tasks we evaluated in both simulation and the real world, ATM outperforms strong video pre-training baselines by 80% on average. Furthermore, we show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology. Visualizations and code are available at: url{https://xingyu-lin.github.io/atm}.

7/15/2024