HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction

Read original: arXiv:2406.06843 - Published 6/18/2024 by Jikai Wang, Qifan Zhang, Yu-Wei Chao, Bowen Wen, Xiaohu Guo, Yu Xiang

HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction

Overview

Introduces a new dataset and capture system called HO-Cap for 3D reconstruction and pose tracking of hand-object interactions
Provides detailed annotations of hand pose, object pose, and hand-object contact information
Enables research on complex hand-object manipulation tasks and their applications

Plain English Explanation

The HO-Cap system and dataset allows researchers to study how people use their hands to interact with objects in 3D. This captures important information about how the hand moves and makes contact with the object during tasks like grasping, manipulating, and using everyday objects.

By recording detailed data on hand pose, object pose, and hand-object contacts, HO-Cap enables the development of advanced hand pose estimation and 3D object reconstruction algorithms. This can lead to improvements in areas like robotics, augmented reality, and human-computer interaction, where understanding natural hand-object interactions is crucial.

The dataset builds on previous work on capturing human-object interactions, but provides richer and more comprehensive annotations to support a wide range of research applications.

Technical Explanation

The HO-Cap system uses a combination of RGB-D cameras and a marker-based motion capture system to record hand and object poses, as well as contact information, during a variety of hand-object manipulation tasks. The dataset includes over 35,000 annotated frames across 100 sequences, capturing a diverse range of hand-object interactions.

The annotations provided include 3D hand joint positions, object poses, and detailed hand-object contact labels. This allows researchers to develop and evaluate algorithms for tasks like 3D hand pose estimation, object pose estimation, and hand-object interaction recognition.

The researchers demonstrate the utility of the HO-Cap dataset through several benchmark experiments, showing its potential to advance the state of the art in hand-object interaction understanding.

Critical Analysis

The HO-Cap dataset and capture system represent an important step forward in the study of hand-object interactions. By providing rich, comprehensive annotations, the dataset enables research on complex manipulation tasks that was not previously possible.

However, the dataset is limited to a relatively small number of objects and hand-object interaction scenarios. Expanding the diversity of the dataset, both in terms of objects and task types, could further broaden its applicability. Additionally, the reliance on marker-based motion capture may limit the scalability and real-world applicability of the system.

As the authors note, future work could explore the use of more scalable, markerless sensing technologies to capture hand-object interactions in natural environments. Integrating the HO-Cap dataset with other large-scale datasets on human-object interaction could also lead to more comprehensive and robust algorithms.

Conclusion

The HO-Cap dataset and capture system represent a significant contribution to the field of hand-object interaction understanding. By providing rich, annotated data on hand pose, object pose, and hand-object contacts, the system enables the development of advanced algorithms for a wide range of applications, from robotics to augmented reality.

While the current dataset has some limitations, the core ideas and methodologies presented in this work lay the groundwork for future progress in this important area of research. As the field continues to evolve, the HO-Cap system and similar approaches will play a crucial role in advancing our understanding of natural human-object interactions.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction

Jikai Wang, Qifan Zhang, Yu-Wei Chao, Bowen Wen, Xiaohu Guo, Yu Xiang

We introduce a data capture system and a new dataset named HO-Cap that can be used to study 3D reconstruction and pose tracking of hands and objects in videos. The capture system uses multiple RGB-D cameras and a HoloLens headset for data collection, avoiding the use of expensive 3D scanners or mocap systems. We propose a semi-automatic method to obtain annotations of shape and pose of hands and objects in the collected videos, which significantly reduces the required annotation time compared to manual labeling. With this system, we captured a video dataset of humans using objects to perform different tasks, as well as simple pick-and-place and handover of an object from one hand to the other, which can be used as human demonstrations for embodied AI and robot manipulation research. Our data capture setup and annotation framework can be used by the community to reconstruct 3D shapes of objects and human hands and track their poses in videos.

6/18/2024

RoCap: A Robotic Data Collection Pipeline for the Pose Estimation of Appearance-Changing Objects

Jiahao Nick Li, Toby Chong, Zhongyi Zhou, Hironori Yoshida, Koji Yatani, Xiang 'Anthony' Chen, Takeo Igarashi

Object pose estimation plays a vital role in mixed-reality interactions when users manipulate tangible objects as controllers. Traditional vision-based object pose estimation methods leverage 3D reconstruction to synthesize training data. However, these methods are designed for static objects with diffuse colors and do not work well for objects that change their appearance during manipulation, such as deformable objects like plush toys, transparent objects like chemical flasks, reflective objects like metal pitchers, and articulated objects like scissors. To address this limitation, we propose Rocap, a robotic pipeline that emulates human manipulation of target objects while generating data labeled with ground truth pose information. The user first gives the target object to a robotic arm, and the system captures many pictures of the object in various 6D configurations. The system trains a model by using captured images and their ground truth pose information automatically calculated from the joint angles of the robotic arm. We showcase pose estimation for appearance-changing objects by training simple deep-learning models using the collected data and comparing the results with a model trained with synthetic data based on 3D reconstruction via quantitative and qualitative evaluation. The findings underscore the promising capabilities of Rocap.

7/12/2024

🗣️

Dense Hand-Object(HO) GraspNet with Full Grasping Taxonomy and Dynamics

Woojin Cho, Jihyun Lee, Minjae Yi, Minje Kim, Taeyun Woo, Donghwan Kim, Taewook Ha, Hyokeun Lee, Je-Hwan Ryu, Woontack Woo, Tae-Kyun Kim

Existing datasets for 3D hand-object interaction are limited either in the data cardinality, data variations in interaction scenarios, or the quality of annotations. In this work, we present a comprehensive new training dataset for hand-object interaction called HOGraspNet. It is the only real dataset that captures full grasp taxonomies, providing grasp annotation and wide intraclass variations. Using grasp taxonomies as atomic actions, their space and time combinatorial can represent complex hand activities around objects. We select 22 rigid objects from the YCB dataset and 8 other compound objects using shape and size taxonomies, ensuring coverage of all hand grasp configurations. The dataset includes diverse hand shapes from 99 participants aged 10 to 74, continuous video frames, and a 1.5M RGB-Depth of sparse frames with annotations. It offers labels for 3D hand and object meshes, 3D keypoints, contact maps, and emph{grasp labels}. Accurate hand and object 3D meshes are obtained by fitting the hand parametric model (MANO) and the hand implicit function (HALO) to multi-view RGBD frames, with the MoCap system only for objects. Note that HALO fitting does not require any parameter tuning, enabling scalability to the dataset's size with comparable accuracy to MANO. We evaluate HOGraspNet on relevant tasks: grasp classification and 3D hand pose estimation. The result shows performance variations based on grasp type and object class, indicating the potential importance of the interaction space captured by our dataset. The provided data aims at learning universal shape priors or foundation models for 3D hand-object interaction. Our dataset and code are available at https://hograspnet2024.github.io/.

9/9/2024

Gaze-guided Hand-Object Interaction Synthesis: Dataset and Method

Jie Tian, Ran Ji, Lingxiao Yang, Yuexin Ma, Lan Xu, Jingyi Yu, Ye Shi, Jingya Wang

Gaze plays a crucial role in revealing human attention and intention, particularly in hand-object interaction scenarios, where it guides and synchronizes complex tasks that require precise coordination between the brain, hand, and object. Motivated by this, we introduce a novel task: Gaze-Guided Hand-Object Interaction Synthesis, with potential applications in augmented reality, virtual reality, and assistive technologies. To support this task, we present GazeHOI, the first dataset to capture simultaneous 3D modeling of gaze, hand, and object interactions. This task poses significant challenges due to the inherent sparsity and noise in gaze data, as well as the need for high consistency and physical plausibility in generating hand and object motions. To tackle these issues, we propose a stacked gaze-guided hand-object interaction diffusion model, named GHO-Diffusion. The stacked design effectively reduces the complexity of motion generation. We also introduce HOI-Manifold Guidance during the sampling stage of GHO-Diffusion, enabling fine-grained control over generated motions while maintaining the data manifold. Additionally, we propose a spatial-temporal gaze feature encoding for the diffusion condition and select diffusion results based on consistency scores between gaze-contact maps and gaze-interaction trajectories. Extensive experiments highlight the effectiveness of our method and the unique contributions of our dataset.

8/23/2024