Unified Human-Scene Interaction via Prompted Chain-of-Contacts

2309.07918

Published 4/22/2024 by Zeqi Xiao, Tai Wang, Jingbo Wang, Jinkun Cao, Wenwei Zhang, Bo Dai, Dahua Lin, Jiangmiao Pang

📈

Abstract

Human-Scene Interaction (HSI) is a vital component of fields like embodied AI and virtual reality. Despite advancements in motion quality and physical plausibility, two pivotal factors, versatile interaction control and the development of a user-friendly interface, require further exploration before the practical application of HSI. This paper presents a unified HSI framework, UniHSI, which supports unified control of diverse interactions through language commands. This framework is built upon the definition of interaction as Chain of Contacts (CoC): steps of human joint-object part pairs, which is inspired by the strong correlation between interaction types and human-object contact regions. Based on the definition, UniHSI constitutes a Large Language Model (LLM) Planner to translate language prompts into task plans in the form of CoC, and a Unified Controller that turns CoC into uniform task execution. To facilitate training and evaluation, we collect a new dataset named ScenePlan that encompasses thousands of task plans generated by LLMs based on diverse scenarios. Comprehensive experiments demonstrate the effectiveness of our framework in versatile task execution and generalizability to real scanned scenes. The project page is at https://github.com/OpenRobotLab/UniHSI .

Create account to get full access

Overview

This paper presents a unified framework called UniHSI for human-scene interaction (HSI) that supports versatile interaction control through language commands.
UniHSI is built upon the concept of "Chain of Contacts (CoC)," which describes the steps of human joint-object part pairs involved in an interaction.
The framework includes a Large Language Model (LLM) Planner to translate language prompts into task plans in the form of CoC, and a Unified Controller to execute these tasks.
To facilitate training and evaluation, the authors collected a new dataset called ScenePlan with task plans generated by LLMs for diverse scenarios.

Plain English Explanation

The paper discusses a new framework called UniHSI that aims to improve human-scene interaction (HSI), which is an important aspect of fields like embodied AI and virtual reality. Despite advancements in the quality and realism of motion, two key factors – versatile interaction control and user-friendly interfaces – still require further development before practical application of HSI.

UniHSI is based on the idea of "Chain of Contacts (CoC)," which describes the sequence of human body parts that make contact with objects during an interaction. By modeling interactions in this way, UniHSI can translate language commands into specific task plans, and then execute those tasks using a unified controller. This allows for more versatile and natural control of interactions compared to traditional methods.

To support the development and evaluation of UniHSI, the researchers created a new dataset called ScenePlan. This dataset contains thousands of task plans generated by large language models (LLMs) for a variety of different scenarios, providing a rich resource for training and testing HSI systems.

Technical Explanation

The core of UniHSI is the definition of an interaction as a "Chain of Contacts (CoC)," which are the steps of human joint-object part pairs involved in the interaction. This is inspired by the observation that different types of interactions tend to involve contact between specific human body parts and object regions.

Based on this CoC concept, the UniHSI framework consists of two main components:

LLM Planner: This module takes natural language prompts and translates them into task plans represented as sequences of CoC steps.
Unified Controller: This component executes the task plans generated by the LLM Planner, controlling the motion of the human body and object manipulation in a unified manner.

To enable training and evaluation of the UniHSI framework, the researchers collected the ScenePlan dataset. This dataset contains thousands of task plans generated by LLMs for diverse scenarios, providing a rich resource for developing and testing HSI systems.

Comprehensive experiments demonstrated the effectiveness of the UniHSI framework in versatile task execution and its ability to generalize to real-world scanned scenes.

Critical Analysis

The UniHSI framework represents a promising approach to improving human-scene interaction capabilities, particularly in its use of language-based control and the unified handling of interaction tasks. The authors' focus on addressing the key challenges of versatile interaction control and user-friendly interfaces is well-aligned with the needs of practical HSI applications.

One potential limitation of the research is the scope of the experiments, which were primarily conducted in simulated environments. While the authors demonstrate the framework's ability to generalize to real-world scanned scenes, further evaluation in more diverse and complex real-world settings would be valuable to fully assess its capabilities and limitations.

Additionally, the reliance on large language models (LLMs) for task planning raises questions about the interpretability and robustness of the system's decision-making process. The authors could potentially explore ways to improve the transparency and accountability of the LLM Planner component, perhaps by incorporating additional mechanisms for plan verification or by leveraging Exploring Interactive Semantic Alignment for Efficient HOI Detection or MotionChain: Conversational Motion Controllers via Multimodal Prompts techniques.

Furthermore, the authors could consider extending the UniHSI framework to handle more complex interaction scenarios, such as those involving multiple humans and objects or human-human interactions, to further demonstrate its versatility and potential for real-world applications.

Conclusion

The UniHSI framework presented in this paper represents a significant step forward in the development of human-scene interaction capabilities. By defining interactions as Chains of Contacts and leveraging large language models for task planning, the authors have created a versatile and user-friendly system for controlling diverse interaction tasks. The collection of the ScenePlan dataset also provides a valuable resource for further research and development in this field.

While the framework shows promising results, there are opportunities for further exploration, such as improving the interpretability of the LLM-based planning component and expanding the scope of interaction scenarios. Continued advancements in this area could lead to more natural and intuitive human-computer interaction experiences in a wide range of applications, from embodied AI to virtual reality.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🛸

InterControl: Zero-shot Human Interaction Generation by Controlling Every Joint

Zhenzhi Wang, Jingbo Wang, Yixuan Li, Dahua Lin, Bo Dai

Text-conditioned motion synthesis has made remarkable progress with the emergence of diffusion models. However, the majority of these motion diffusion models are primarily designed for a single character and overlook multi-human interactions. In our approach, we strive to explore this problem by synthesizing human motion with interactions for a group of characters of any size in a zero-shot manner. The key aspect of our approach is the adaptation of human-wise interactions as pairs of human joints that can be either in contact or separated by a desired distance. In contrast to existing methods that necessitate training motion generation models on multi-human motion datasets with a fixed number of characters, our approach inherently possesses the flexibility to model human interactions involving an arbitrary number of individuals, thereby transcending the limitations imposed by the training data. We introduce a novel controllable motion generation method, InterControl, to encourage the synthesized motions maintaining the desired distance between joint pairs. It consists of a motion controller and an inverse kinematics guidance module that realistically and accurately aligns the joints of synthesized characters to the desired location. Furthermore, we demonstrate that the distance between joint pairs for human-wise interactions can be generated using an off-the-shelf Large Language Model (LLM). Experimental results highlight the capability of our framework to generate interactions with multiple human characters and its potential to work with off-the-shelf physics-based character simulators.

6/18/2024

cs.CV

Human-Object Interaction from Human-Level Instructions

Zhen Wu, Jiaman Li, C. Karen Liu

Intelligent agents need to autonomously navigate and interact within contextual environments to perform a wide range of daily tasks based on human-level instructions. These agents require a foundational understanding of the world, incorporating common sense and knowledge, to interpret such instructions. Moreover, they must possess precise low-level skills for movement and interaction to execute the detailed task plans derived from these instructions. In this work, we address the task of synthesizing continuous human-object interactions for manipulating large objects within contextual environments, guided by human-level instructions. Our goal is to generate synchronized object motion, full-body human motion, and detailed finger motion, all essential for realistic interactions. Our framework consists of a large language model (LLM) planning module and a low-level motion generator. We use LLMs to deduce spatial object relationships and devise a method for accurately determining their positions and orientations in target scene layouts. Additionally, the LLM planner outlines a detailed task plan specifying a sequence of sub-tasks. This task plan, along with the target object poses, serves as input for our low-level motion generator, which seamlessly alternates between navigation and interaction modules. We present the first complete system that can synthesize object motion, full-body motion, and finger motion simultaneously from human-level instructions. Our experiments demonstrate the effectiveness of our high-level planner in generating plausible target layouts and our low-level motion generator in synthesizing realistic interactions for diverse objects. Please refer to our project page for more results: https://hoifhli.github.io/.

6/27/2024

cs.AI cs.CV

CooHOI: Learning Cooperative Human-Object Interaction with Manipulated Object Dynamics

Jiawei Gao, Ziqin Wang, Zeqi Xiao, Jingbo Wang, Tai Wang, Jinkun Cao, Xiaolin Hu, Si Liu, Jifeng Dai, Jiangmiao Pang

Recent years have seen significant advancements in humanoid control, largely due to the availability of large-scale motion capture data and the application of reinforcement learning methodologies. However, many real-world tasks, such as moving large and heavy furniture, require multi-character collaboration. Given the scarcity of data on multi-character collaboration and the efficiency challenges associated with multi-agent learning, these tasks cannot be straightforwardly addressed using training paradigms designed for single-agent scenarios. In this paper, we introduce Cooperative Human-Object Interaction (CooHOI), a novel framework that addresses multi-character objects transporting through a two-phase learning paradigm: individual skill acquisition and subsequent transfer. Initially, a single agent learns to perform tasks using the Adversarial Motion Priors (AMP) framework. Following this, the agent learns to collaborate with others by considering the shared dynamics of the manipulated object during parallel training using Multi Agent Proximal Policy Optimization (MAPPO). When one agent interacts with the object, resulting in specific object dynamics changes, the other agents learn to respond appropriately, thereby achieving implicit communication and coordination between teammates. Unlike previous approaches that relied on tracking-based methods for multi-character HOI, CooHOI is inherently efficient, does not depend on motion capture data of multi-character interactions, and can be seamlessly extended to include more participants and a wide range of object types

6/21/2024

cs.RO cs.AI

Scaling Up Dynamic Human-Scene Interaction Modeling

Nan Jiang, Zhiyuan Zhang, Hongjie Li, Xiaoxuan Ma, Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, Siyuan Huang

Confronting the challenges of data scarcity and advanced motion synthesis in human-scene interaction modeling, we introduce the TRUMANS dataset alongside a novel HSI motion synthesis method. TRUMANS stands as the most comprehensive motion-captured HSI dataset currently available, encompassing over 15 hours of human interactions across 100 indoor scenes. It intricately captures whole-body human motions and part-level object dynamics, focusing on the realism of contact. This dataset is further scaled up by transforming physical environments into exact virtual models and applying extensive augmentations to appearance and motion for both humans and objects while maintaining interaction fidelity. Utilizing TRUMANS, we devise a diffusion-based autoregressive model that efficiently generates HSI sequences of any length, taking into account both scene context and intended actions. In experiments, our approach shows remarkable zero-shot generalizability on a range of 3D scene datasets (e.g., PROX, Replica, ScanNet, ScanNet++), producing motions that closely mimic original motion-captured sequences, as confirmed by quantitative experiments and human studies.

5/27/2024

cs.CV