InteRACT: Transformer Models for Human Intent Prediction Conditioned on Robot Actions

Read original: arXiv:2311.12943 - Published 6/4/2024 by Kushal Kedia, Atiksh Bhardwaj, Prithwish Dan, Sanjiban Choudhury

🔮

Overview

In collaborative human-robot manipulation, robots must predict human intents and adapt their actions to smoothly execute tasks.
However, the human's intent depends on the robot's actions, creating a chicken-or-egg problem.
Prior methods trained marginal intent prediction models independent of robot actions due to a lack of paired human-robot interaction datasets.
This paper proposes a novel architecture, InteRACT, that leverages large-scale human-human interaction data to improve human intent prediction in human-robot collaboration.

Plain English Explanation

When a human and a robot work together on a task, the robot needs to be able to predict what the human is trying to do so it can adjust its own actions accordingly. This helps the collaboration go smoothly. However, the human's intentions [https://aimodels.fyi/papers/arxiv/hoi4abot-human-object-interaction-anticipation-human-intention] actually depend on what the robot does. This creates a tricky situation where the robot has to predict the human's intent, but the human's intent depends on the robot's actions.

Previous methods tried to solve this by training the robot to predict human intent without considering the robot's own actions. This was because there weren't many datasets available that showed how humans and robots interact. But this research team had a clever idea: [https://aimodels.fyi/papers/arxiv/incremental-learning-humanoid-robot-behavior-from-natural] they realized that the actions a human takes when working with another human are similar to the actions a human would take when working with a robot. So they developed a system called InteRACT that first learns about human-human interactions, and then fine-tunes that knowledge to work with human-robot interactions.

This allows the robot to better understand the human's intentions [https://aimodels.fyi/papers/arxiv/robotic-imitation-human-actions] and adapt its own actions accordingly, creating a smoother collaborative experience. The researchers also developed new techniques to collect human-robot interaction data, which they are sharing publicly so others can build on this work.

Technical Explanation

The key insight of this paper is that there is a correspondence between human and robot actions that enables transfer learning from human-human to human-robot data. The proposed InteRACT architecture first pre-trains a conditional intent prediction model on large-scale human-human interaction datasets, and then fine-tunes this model on a smaller human-robot dataset.

This is in contrast to prior methods that trained marginal intent prediction models independent of robot actions. The challenge with these previous approaches was the lack of paired human-robot interaction datasets required for training conditional models.

InteRACT leverages the similarities between human-human and human-robot interactions to enable this transfer learning. The pre-training on human-human data allows the model to learn general patterns of human intent and how it relates to actions. This knowledge can then be adapted to the human-robot setting through fine-tuning on the smaller dataset.

The researchers evaluate InteRACT on real-world collaborative human-robot manipulation tasks and show that their conditional model outperforms various marginal baselines. They also introduce new techniques to collect a diverse range of human-robot collaborative manipulation data, which they open-source to facilitate further research in this area [https://aimodels.fyi/papers/arxiv/comparing-apples-to-oranges-llm-powered-multimodal].

Critical Analysis

The paper presents a promising approach to address the chicken-or-egg problem in collaborative human-robot manipulation. By exploiting the correspondence between human and robot actions, the InteRACT architecture is able to leverage large-scale human-human interaction data to improve human intent prediction in human-robot collaboration.

However, the paper does not discuss the potential limitations of this transfer learning approach. For example, it's unclear how well the model would generalize to settings where the human and robot's actions differ significantly from the pre-training data. Additionally, the open-sourced human-robot dataset may not capture the full range of real-world collaborative scenarios, which could limit the model's performance [https://aimodels.fyi/papers/arxiv/utility-external-agent-intention-predictor-human-ai].

Further research could explore ways to address these limitations, such as techniques to better align the human and robot action spaces or methods to augment the dataset with diverse human-robot interaction scenarios.

Conclusion

This research presents a novel approach to address the challenge of predicting human intent in collaborative human-robot manipulation. By leveraging large-scale human-human interaction data through transfer learning, the InteRACT architecture is able to improve upon previous methods that trained intent prediction models independent of robot actions.

The open-sourcing of the human-robot dataset and the promising results on real-world tasks suggest that this work could have a significant impact on advancing the field of human-robot collaboration. As robots become more integrated into our daily lives, the ability to smoothly interact with humans will be crucial for the successful deployment of these technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔮

InteRACT: Transformer Models for Human Intent Prediction Conditioned on Robot Actions

Kushal Kedia, Atiksh Bhardwaj, Prithwish Dan, Sanjiban Choudhury

In collaborative human-robot manipulation, a robot must predict human intents and adapt its actions accordingly to smoothly execute tasks. However, the human's intent in turn depends on actions the robot takes, creating a chicken-or-egg problem. Prior methods ignore such inter-dependency and instead train marginal intent prediction models independent of robot actions. This is because training conditional models is hard given a lack of paired human-robot interaction datasets. Can we instead leverage large-scale human-human interaction data that is more easily accessible? Our key insight is to exploit a correspondence between human and robot actions that enables transfer learning from human-human to human-robot data. We propose a novel architecture, InteRACT, that pre-trains a conditional intent prediction model on large human-human datasets and fine-tunes on a small human-robot dataset. We evaluate on a set of real-world collaborative human-robot manipulation tasks and show that our conditional model improves over various marginal baselines. We also introduce new techniques to tele-operate a 7-DoF robot arm and collect a diverse range of human-robot collaborative manipulation data, which we open-source.

6/4/2024

🔮

Towards Proactive Safe Human-Robot Collaborations via Data-Efficient Conditional Behavior Prediction

Ravi Pandya, Zhuoyuan Wang, Yorie Nakahira, Changliu Liu

We focus on the problem of how we can enable a robot to collaborate seamlessly with a human partner, specifically in scenarios where preexisting data is sparse. Much prior work in human-robot collaboration uses observational models of humans (i.e. models that treat the robot purely as an observer) to choose the robot's behavior, but such models do not account for the influence the robot has on the human's actions, which may lead to inefficient interactions. We instead formulate the problem of optimally choosing a collaborative robot's behavior based on a conditional model of the human that depends on the robot's future behavior. First, we propose a novel model-based formulation of conditional behavior prediction that allows the robot to infer the human's intentions based on its future plan in data-sparse environments. We then show how to utilize a conditional model for proactive goal selection and safe trajectory generation around human collaborators. Finally, we use our proposed proactive controller in a collaborative task with real users to show that it can improve users' interactions with a robot collaborator quantitatively and qualitatively.

7/2/2024

💬

Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

Vidhi Jain, Maria Attarian, Nikhil J Joshi, Ayzaan Wahid, Danny Driess, Quan Vuong, Pannag R Sanketi, Pierre Sermanet, Stefan Welker, Christine Chan, Igor Gilitschenski, Yonatan Bisk, Debidatta Dwibedi

Large-scale multi-task robotic manipulation systems often rely on text to specify the task. In this work, we explore whether a robot can learn by observing humans. To do so, the robot must understand a person's intent and perform the inferred task despite differences in the embodiments and environments. We introduce Vid2Robot, an end-to-end video-conditioned policy that takes human videos demonstrating manipulation tasks as input and produces robot actions. Our model is trained with a large dataset of prompt video-robot trajectory pairs to learn unified representations of human and robot actions from videos. Vid2Robot uses cross-attention transformer layers between video features and the current robot state to produce the actions and perform the same task as shown in the video. We use auxiliary contrastive losses to align the prompt and robot video representations for better policies. We evaluate Vid2Robot on real-world robots and observe over 20% improvement over BC-Z when using human prompt videos. Further, we also show cross-object motion transfer ability that enables video-conditioned policies to transfer a motion observed on one object in the prompt video to another object in the robot's own environment. Videos available at https://vid2robot.github.io

8/29/2024

👀

HOI4ABOT: Human-Object Interaction Anticipation for Human Intention Reading Collaborative roBOTs

Esteve Valls Mascaro, Daniel Sliwowski, Dongheui Lee

Robots are becoming increasingly integrated into our lives, assisting us in various tasks. To ensure effective collaboration between humans and robots, it is essential that they understand our intentions and anticipate our actions. In this paper, we propose a Human-Object Interaction (HOI) anticipation framework for collaborative robots. We propose an efficient and robust transformer-based model to detect and anticipate HOIs from videos. This enhanced anticipation empowers robots to proactively assist humans, resulting in more efficient and intuitive collaborations. Our model outperforms state-of-the-art results in HOI detection and anticipation in VidHOI dataset with an increase of 1.76% and 1.04% in mAP respectively while being 15.4 times faster. We showcase the effectiveness of our approach through experimental results in a real robot, demonstrating that the robot's ability to anticipate HOIs is key for better Human-Robot Interaction. More information can be found on our project webpage: https://evm7.github.io/HOI4ABOT_page/

4/9/2024