ImitationNet: Unsupervised Human-to-Robot Motion Retargeting via Shared Latent Space

2309.05310

Published 4/9/2024 by Yashuai Yan, Esteve Valls Mascaro, Dongheui Lee

🤷

Abstract

This paper introduces a novel deep-learning approach for human-to-robot motion retargeting, enabling robots to mimic human poses accurately. Contrary to prior deep-learning-based works, our method does not require paired human-to-robot data, which facilitates its translation to new robots. First, we construct a shared latent space between humans and robots via adaptive contrastive learning that takes advantage of a proposed cross-domain similarity metric between the human and robot poses. Additionally, we propose a consistency term to build a common latent space that captures the similarity of the poses with precision while allowing direct robot motion control from the latent space. For instance, we can generate in-between motion through simple linear interpolation between two projected human poses. We conduct a comprehensive evaluation of robot control from diverse modalities (i.e., texts, RGB videos, and key poses), which facilitates robot control for non-expert users. Our model outperforms existing works regarding human-to-robot retargeting in terms of efficiency and precision. Finally, we implemented our method in a real robot with self-collision avoidance through a whole-body controller to showcase the effectiveness of our approach. More information on our website https://evm7.github.io/UnsH2R/

Create account to get full access

Overview

Introduces a novel deep learning approach for human-to-robot motion retargeting
Enables robots to accurately mimic human poses without requiring paired human-robot data
Constructs a shared latent space between humans and robots using adaptive contrastive learning and a cross-domain similarity metric
Proposes a consistency term to build a common latent space that captures pose similarity with precision, allowing direct robot motion control
Enables robot control from diverse modalities (text, RGB video, key poses) for non-expert users
Outperforms existing works in human-to-robot retargeting efficiency and precision
Implemented on a real robot with self-collision avoidance

Plain English Explanation

This paper presents a new deep learning technique that allows robots to closely imitate human movements and poses. Unlike previous deep learning-based approaches, this method does not require having pairs of human and robot data to train on. Instead, it creates a shared "latent space" between human and robot poses, using a technique called adaptive contrastive learning and a new way of measuring the similarity between human and robot poses.

This shared latent space allows the system to generate smooth, in-between motions by simply interpolating between human poses. The paper also introduces a "consistency term" that helps the system build this common latent space in a way that captures the precise details of the poses, enabling direct control of the robot from this latent space.

Another key aspect is that the system can take input from various sources - text descriptions, RGB videos, or just key poses - and use that to control the robot. This makes it accessible for non-experts to control the robot. The authors show that their approach outperforms existing methods in terms of how efficiently and accurately it can retarget human motion to the robot.

Finally, the authors implemented their system on a real robot, with safeguards to prevent the robot from colliding with itself, demonstrating the practicality of their approach.

Technical Explanation

The paper introduces a novel deep learning-based approach for human-to-robot motion retargeting that does not require paired human-robot data, in contrast to previous deep learning methods. The key technical components are:

Shared Latent Space: The system constructs a shared latent space between human and robot poses using adaptive contrastive learning and a cross-domain similarity metric between human and robot poses. This allows the model to capture the correspondence between human and robot poses.
Consistency Term: The authors propose a consistency term that helps build a common latent space that precisely captures the similarity of the poses, enabling direct robot motion control from the latent space. This allows generating smooth, in-between motions by interpolating between human poses in the latent space.
Multi-Modal Control: The system can take input from diverse modalities (text, RGB video, key poses) and use that to control the robot, making it accessible for non-expert users.

The authors evaluate their approach comprehensively, showing that it outperforms existing human-to-robot retargeting methods in terms of efficiency and precision. They also implement the system on a real robot, incorporating self-collision avoidance through a whole-body controller.

Critical Analysis

The paper presents a promising approach for human-to-robot motion retargeting that addresses some key limitations of prior deep learning-based methods. The ability to work without paired human-robot data and the multi-modal control capabilities are particularly noteworthy.

However, the paper does not discuss the computational complexity or inference speed of the proposed approach, which would be important considerations for real-world robot applications. Additionally, the evaluation is mostly conducted in simulation, and more real-world testing would be needed to fully assess the method's practical viability.

The paper also does not address potential issues with the robustness of the system, such as how it might handle noisy or incomplete input data, or its ability to generalize to a wider range of human and robot morphologies. Exploring these aspects could be an interesting direction for future research.

Overall, the work represents a valuable contribution to the field of human-robot interaction, but there are still opportunities to further refine and validate the approach through additional experimentation and analysis.

Conclusion

This paper introduces a novel deep learning-based technique for enabling robots to accurately mimic human poses and motions, without requiring paired human-robot training data. The key innovations include the construction of a shared latent space between human and robot poses, and the use of a consistency term to precisely capture pose similarities for direct robot control.

The multi-modal control capabilities, which allow the system to take input from text, video, or key poses, make the approach accessible for non-expert users. The authors demonstrate that their method outperforms existing human-to-robot retargeting techniques, and they have implemented it on a real robot with self-collision avoidance.

While there are still some open questions around computational efficiency, robustness, and real-world generalization, this work represents an important step forward in enhancing the ability of robots to naturally and seamlessly interact with humans. Further research building on these insights could have significant implications for a wide range of human-robot collaboration scenarios.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Unsupervised Neural Motion Retargeting for Humanoid Teleoperation

Satoshi Yagi, Mitsunori Tada, Eiji Uchibe, Suguru Kanoga, Takamitsu Matsubara, Jun Morimoto

This study proposes an approach to human-to-humanoid teleoperation using GAN-based online motion retargeting, which obviates the need for the construction of pairwise datasets to identify the relationship between the human and the humanoid kinematics. Consequently, it can be anticipated that our proposed teleoperation system will reduce the complexity and setup requirements typically associated with humanoid controllers, thereby facilitating the development of more accessible and intuitive teleoperation systems for users without robotics knowledge. The experiments demonstrated the efficacy of the proposed method in retargeting a range of upper-body human motions to humanoid, including a body jab motion and a basketball shoot motion. Moreover, the human-in-the-loop teleoperation performance was evaluated by measuring the end-effector position errors between the human and the retargeted humanoid motions. The results demonstrated that the error was comparable to those of conventional motion retargeting methods that require pairwise motion datasets. Finally, a box pick-and-place task was conducted to demonstrate the usability of the developed humanoid teleoperation system.

6/4/2024

cs.RO

Cross-Embodiment Robot Manipulation Skill Transfer using Latent Space Alignment

Tianyu Wang, Dwait Bhatt, Xiaolong Wang, Nikolay Atanasov

This paper focuses on transferring control policies between robot manipulators with different morphology. While reinforcement learning (RL) methods have shown successful results in robot manipulation tasks, transferring a trained policy from simulation to a real robot or deploying it on a robot with different states, actions, or kinematics is challenging. To achieve cross-embodiment policy transfer, our key insight is to project the state and action spaces of the source and target robots to a common latent space representation. We first introduce encoders and decoders to associate the states and actions of the source robot with a latent space. The encoders, decoders, and a latent space control policy are trained simultaneously using loss functions measuring task performance, latent dynamics consistency, and encoder-decoder ability to reconstruct the original states and actions. To transfer the learned control policy, we only need to train target encoders and decoders that align a new target domain to the latent space. We use generative adversarial training with cycle consistency and latent dynamics losses without access to the task reward or reward tuning in the target domain. We demonstrate sim-to-sim and sim-to-real manipulation policy transfer with source and target robots of different states, actions, and embodiments. The source code is available at url{https://github.com/ExistentialRobotics/cross_embodiment_transfer}.

6/5/2024

cs.RO cs.AI

HumanPlus: Humanoid Shadowing and Imitation from Humans

Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, Chelsea Finn

One of the key arguments for building robots that have similar form factors to human beings is that we can leverage the massive human data for training. Yet, doing so has remained challenging in practice due to the complexities in humanoid perception and control, lingering physical gaps between humanoids and humans in morphologies and actuation, and lack of a data pipeline for humanoids to learn autonomous skills from egocentric vision. In this paper, we introduce a full-stack system for humanoids to learn motion and autonomous skills from human data. We first train a low-level policy in simulation via reinforcement learning using existing 40-hour human motion datasets. This policy transfers to the real world and allows humanoid robots to follow human body and hand motion in real time using only a RGB camera, i.e. shadowing. Through shadowing, human operators can teleoperate humanoids to collect whole-body data for learning different tasks in the real world. Using the data collected, we then perform supervised behavior cloning to train skill policies using egocentric vision, allowing humanoids to complete different tasks autonomously by imitating human skills. We demonstrate the system on our customized 33-DoF 180cm humanoid, autonomously completing tasks such as wearing a shoe to stand up and walk, unloading objects from warehouse racks, folding a sweatshirt, rearranging objects, typing, and greeting another robot with 60-100% success rates using up to 40 demonstrations. Project website: https://humanoid-ai.github.io/

6/18/2024

cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY

🏅

I-CTRL: Imitation to Control Humanoid Robots Through Constrained Reinforcement Learning

Yashuai Yan, Esteve Valls Mascaro, Tobias Egle, Dongheui Lee

This paper addresses the critical need for refining robot motions that, despite achieving a high visual similarity through human-to-humanoid retargeting methods, fall short of practical execution in the physical realm. Existing techniques in the graphics community often prioritize visual fidelity over physics-based feasibility, posing a significant challenge for deploying bipedal systems in practical applications. Our research introduces a constrained reinforcement learning algorithm to produce physics-based high-quality motion imitation onto legged humanoid robots that enhance motion resemblance while successfully following the reference human trajectory. We name our framework: I-CTRL. By reformulating the motion imitation problem as a constrained refinement over non-physics-based retargeted motions, our framework excels in motion imitation with simple and unique rewards that generalize across four robots. Moreover, our framework can follow large-scale motion datasets with a unique RL agent. The proposed approach signifies a crucial step forward in advancing the control of bipedal robots, emphasizing the importance of aligning visual and physical realism for successful motion imitation.

5/15/2024

cs.RO cs.AI