HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects

Read original: arXiv:2407.12371 - Published 9/12/2024 by Xintao Lv, Liang Xu, Yichao Yan, Xin Jin, Congsheng Xu, Shuwen Wu, Yifan Liu, Lincheng Li, Mengxiao Bi, Wenjun Zeng and 1 other

HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects

Overview

Introduces a new benchmark called HIMO for evaluating full-body human interaction with multiple objects
Provides a large-scale dataset and evaluation metrics for this task
Proposes a novel text-driven human-object interaction synthesis model that generates realistic human-object interaction sequences

Plain English Explanation

The paper presents a new benchmark called HIMO (Human Interacting with Multiple Objects) for evaluating how people interact with multiple objects in a scene. This is an important and challenging task in computer vision and robotics, as it requires understanding the complex relationships between people, objects, and their actions over time.

The researchers created a large dataset of real-world human-object interactions, along with evaluation metrics to measure the performance of models on this task. They also developed a new text-driven human-object interaction synthesis model that can generate realistic sequences of people interacting with multiple objects, based on textual descriptions.

By providing this benchmark and model, the researchers aim to advance the field of computer vision and enable the development of more intelligent systems that can better understand and interact with the physical world around them. This could have applications in areas like robotics, augmented reality, and human-computer interaction.

Technical Explanation

The paper introduces the HIMO benchmark, which contains a large-scale dataset of 3D human-object interactions captured in a studio setting. The dataset includes over 20,000 sequences of people performing various activities with multiple objects, such as cooking, cleaning, and assembling furniture.

To evaluate model performance on this task, the researchers propose several metrics, including:

Temporal segmentation accuracy: how well the model can identify the start and end of each individual interaction
Pose estimation accuracy: how well the model can estimate the 3D pose of the person during the interaction
Object detection and classification accuracy: how well the model can detect and recognize the objects being used

The paper also presents a novel text-driven human-object interaction synthesis model that can generate realistic sequences of people interacting with multiple objects based on textual descriptions. This model uses a transformer-based architecture to encode the input text and then generates the corresponding 3D human pose and object interactions over time.

The authors evaluate their synthesis model on the HIMO benchmark and show that it outperforms several baseline approaches, demonstrating the potential of this technique for applications like open-world human-object interaction detection, cooperative human-object interaction, and fine-grained semantic-aligned human-object interaction.

Critical Analysis

The HIMO benchmark and synthesis model presented in this paper represent an important advancement in the field of human-object interaction understanding. By focusing on the more complex and realistic scenario of multiple objects, the researchers are pushing the boundaries of what current computer vision systems can do.

However, the paper does acknowledge some limitations of the work. For example, the dataset was captured in a studio setting, which may not fully capture the complexity and variability of real-world human-object interactions. Additionally, the synthesis model is currently limited to generating sequences based on text inputs, and may not be able to handle more open-ended or spontaneous interactions.

Further research is needed to address these limitations and extend the capabilities of the system. This could involve collecting more diverse human-object interaction data, developing more advanced synthesis models, and exploring how these techniques can be applied to real-world applications.

Conclusion

The HIMO benchmark and text-driven human-object interaction synthesis model presented in this paper represent a significant step forward in the field of computer vision and understanding human-object interactions. By providing a large-scale dataset and robust evaluation metrics, the researchers have created a valuable resource for the research community.

The synthesis model's ability to generate realistic human-object interaction sequences based on textual descriptions also has exciting potential applications, such as in robotics, augmented reality, and human-computer interaction. As the field continues to evolve, we can expect to see more advanced systems that can better understand and interact with the complex physical world around us.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects

Xintao Lv, Liang Xu, Yichao Yan, Xin Jin, Congsheng Xu, Shuwen Wu, Yifan Liu, Lincheng Li, Mengxiao Bi, Wenjun Zeng, Xiaokang Yang

Generating human-object interactions (HOIs) is critical with the tremendous advances of digital avatars. Existing datasets are typically limited to humans interacting with a single object while neglecting the ubiquitous manipulation of multiple objects. Thus, we propose HIMO, a large-scale MoCap dataset of full-body human interacting with multiple objects, containing 3.3K 4D HOI sequences and 4.08M 3D HOI frames. We also annotate HIMO with detailed textual descriptions and temporal segments, benchmarking two novel tasks of HOI synthesis conditioned on either the whole text prompt or the segmented text prompts as fine-grained timeline control. To address these novel tasks, we propose a dual-branch conditional diffusion model with a mutual interaction module for HOI synthesis. Besides, an auto-regressive generation pipeline is also designed to obtain smooth transitions between HOI segments. Experimental results demonstrate the generalization ability to unseen object geometries and temporal compositions.

9/12/2024

HOI-M3:Capture Multiple Humans and Objects Interaction within Contextual Environment

Juze Zhang, Jingyan Zhang, Zining Song, Zhanhe Shi, Chengfeng Zhao, Ye Shi, Jingyi Yu, Lan Xu, Jingya Wang

Humans naturally interact with both others and the surrounding multiple objects, engaging in various social activities. However, recent advances in modeling human-object interactions mostly focus on perceiving isolated individuals and objects, due to fundamental data scarcity. In this paper, we introduce HOI-M3, a novel large-scale dataset for modeling the interactions of Multiple huMans and Multiple objects. Notably, it provides accurate 3D tracking for both humans and objects from dense RGB and object-mounted IMU inputs, covering 199 sequences and 181M frames of diverse humans and objects under rich activities. With the unique HOI-M3 dataset, we introduce two novel data-driven tasks with companion strong baselines: monocular capture and unstructured generation of multiple human-object interactions. Extensive experiments demonstrate that our dataset is challenging and worthy of further research about multiple human-object interactions and behavior analysis. Our HOI-M3 dataset, corresponding codes, and pre-trained models will be disseminated to the community for future research.

4/3/2024

Open-World Human-Object Interaction Detection via Multi-modal Prompts

Jie Yang, Bingliang Li, Ailing Zeng, Lei Zhang, Ruimao Zhang

In this paper, we develop textbf{MP-HOI}, a powerful Multi-modal Prompt-based HOI detector designed to leverage both textual descriptions for open-set generalization and visual exemplars for handling high ambiguity in descriptions, realizing HOI detection in the open world. Specifically, it integrates visual prompts into existing language-guided-only HOI detectors to handle situations where textual descriptions face difficulties in generalization and to address complex scenarios with high interaction ambiguity. To facilitate MP-HOI training, we build a large-scale HOI dataset named Magic-HOI, which gathers six existing datasets into a unified label space, forming over 186K images with 2.4K objects, 1.2K actions, and 20K HOI interactions. Furthermore, to tackle the long-tail issue within the Magic-HOI dataset, we introduce an automated pipeline for generating realistically annotated HOI images and present SynHOI, a high-quality synthetic HOI dataset containing 100K images. Leveraging these two datasets, MP-HOI optimizes the HOI task as a similarity learning process between multi-modal prompts and objects/interactions via a unified contrastive loss, to learn generalizable and transferable objects/interactions representations from large-scale data. MP-HOI could serve as a generalist HOI detector, surpassing the HOI vocabulary of existing expert models by more than 30 times. Concurrently, our results demonstrate that MP-HOI exhibits remarkable zero-shot capability in real-world scenarios and consistently achieves a new state-of-the-art performance across various benchmarks.

6/12/2024

CooHOI: Learning Cooperative Human-Object Interaction with Manipulated Object Dynamics

Jiawei Gao, Ziqin Wang, Zeqi Xiao, Jingbo Wang, Tai Wang, Jinkun Cao, Xiaolin Hu, Si Liu, Jifeng Dai, Jiangmiao Pang

Recent years have seen significant advancements in humanoid control, largely due to the availability of large-scale motion capture data and the application of reinforcement learning methodologies. However, many real-world tasks, such as moving large and heavy furniture, require multi-character collaboration. Given the scarcity of data on multi-character collaboration and the efficiency challenges associated with multi-agent learning, these tasks cannot be straightforwardly addressed using training paradigms designed for single-agent scenarios. In this paper, we introduce Cooperative Human-Object Interaction (CooHOI), a novel framework that addresses multi-character objects transporting through a two-phase learning paradigm: individual skill acquisition and subsequent transfer. Initially, a single agent learns to perform tasks using the Adversarial Motion Priors (AMP) framework. Following this, the agent learns to collaborate with others by considering the shared dynamics of the manipulated object during parallel training using Multi Agent Proximal Policy Optimization (MAPPO). When one agent interacts with the object, resulting in specific object dynamics changes, the other agents learn to respond appropriately, thereby achieving implicit communication and coordination between teammates. Unlike previous approaches that relied on tracking-based methods for multi-character HOI, CooHOI is inherently efficient, does not depend on motion capture data of multi-character interactions, and can be seamlessly extended to include more participants and a wide range of object types

6/21/2024