Generating Human Motion in 3D Scenes from Text Descriptions






Published 5/14/2024 by Zhi Cen, Huaijin Pi, Sida Peng, Zehong Shen, Minghui Yang, Shuai Zhu, Hujun Bao, Xiaowei Zhou



Generating human motions from textual descriptions has gained growing research interest due to its wide range of applications. However, only a few works consider human-scene interactions together with text conditions, which is crucial for visual and physical realism. This paper focuses on the task of generating human motions in 3D indoor scenes given text descriptions of the human-scene interactions. This task presents challenges due to the multi-modality nature of text, scene, and motion, as well as the need for spatial reasoning. To address these challenges, we propose a new approach that decomposes the complex problem into two more manageable sub-problems: (1) language grounding of the target object and (2) object-centric motion generation. For language grounding of the target object, we leverage the power of large language models. For motion generation, we design an object-centric scene representation for the generative model to focus on the target object, thereby reducing the scene complexity and facilitating the modeling of the relationship between human motions and the object. Experiments demonstrate the better motion quality of our approach compared to baselines and validate our design choices.

Create account to get full access


If you already have an account, we'll log you in


  • This paper focuses on generating human motions in 3D indoor scenes based on text descriptions of human-scene interactions.
  • The task presents challenges due to the multi-modal nature of text, scene, and motion, as well as the need for spatial reasoning.
  • The authors propose a two-step approach: (1) language grounding of the target object and (2) object-centric motion generation.

Plain English Explanation

The paper discusses the task of creating realistic 3D animations of human movements and interactions within indoor scenes, based on textual descriptions. This is a challenging problem because it requires understanding the relationship between language, the physical environment, and human actions.

To address this, the researchers developed a two-part system. First, it uses powerful language models to identify and ground the key objects in the text description. Then, it generates the human motion by focusing specifically on the target object and its relationship to the person, rather than trying to model the entire scene. This helps reduce the complexity of the problem and allows the system to better capture the connection between the text, the environment, and the human movements.

The experiments show that this approach produces higher quality motions compared to previous methods. By breaking down the task into these two more manageable sub-problems, the researchers were able to leverage the strengths of different techniques to create more realistic and text-guided 3D animations of human-scene interactions.

Technical Explanation

The paper addresses the task of generating human motions in 3D indoor scenes based on text descriptions of the interactions. This is a challenging problem due to the need to reason about the relationships between the textual description, the physical scene, and the resulting human motion.

To tackle this, the authors propose a two-step approach. First, they leverage large language models to perform language grounding of the target object mentioned in the text description. This grounds the textual input to the relevant elements in the 3D scene.

For the second step, the authors design an object-centric scene representation that allows their generative model to focus specifically on the target object and its relationship to the human motion. This reduces the complexity of the scene and facilitates modeling the connection between the text, the object, and the resulting human movements.

Experiments demonstrate that this two-step approach produces higher quality motions compared to baseline methods. The language grounding and object-centric representation help the system better understand and generate the appropriate human-scene interactions based on the provided textual descriptions.

Critical Analysis

The paper presents a novel and promising approach to the challenging task of generating human motions in 3D scenes from text descriptions. The two-step decomposition into language grounding and object-centric motion generation is a clever way to manage the complexity of the problem.

However, the paper does not address some potential limitations of the approach. For example, it's unclear how well the system would perform on more complex scenes with multiple interacting objects, or how it would handle ambiguous or open-ended text descriptions. Additionally, the quality and realism of the generated motions, while improved over baselines, may still have room for further enhancement.

Future research could explore ways to further integrate the textual, visual, and motion modalities to capture more nuanced relationships. Incorporating additional scene and human context could also help the system generate even more natural and realistic human-scene interactions.

Overall, this paper represents an important step forward in the field of text-guided human motion synthesis. By carefully structuring the problem and leveraging the strengths of different techniques, the authors have demonstrated the potential for creating more immersive and interactive 3D animations from simple textual descriptions.


This paper presents a novel approach to the task of generating human motions in 3D indoor scenes based on textual descriptions of human-scene interactions. By decomposing the problem into language grounding and object-centric motion generation, the authors were able to develop a system that outperforms previous baselines in terms of motion quality.

The key innovations of this work are the use of powerful language models for grounding textual input and the design of an object-centric scene representation to facilitate the modeling of human-object interactions. These techniques help bridge the gap between the multi-modal nature of text, scene, and motion, enabling more realistic and text-guided 3D animations.

While the paper highlights the potential of this approach, there are still opportunities for further improvements, such as handling more complex scenes and enhancing the overall realism of the generated motions. Future research in this area could lead to even more advanced and versatile systems for creating engaging and interactive 3D experiences from natural language descriptions.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Generating Human Interaction Motions in Scenes with Text Control

Generating Human Interaction Motions in Scenes with Text Control

Hongwei Yi, Justus Thies, Michael J. Black, Xue Bin Peng, Davis Rempe





We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models. Previous text-to-motion methods focus on characters in isolation without considering scenes due to the limited availability of datasets that include motion, text descriptions, and interactive scenes. Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model, emphasizing goal-reaching constraints on large-scale motion-capture datasets. We then enhance this model with a scene-aware component, fine-tuned using data augmented with detailed scene information, including ground plane and object shapes. To facilitate training, we embed annotated navigation and interaction motions within scenes. The proposed method produces realistic and diverse human-object interactions, such as navigation and sitting, in different scenes with various object shapes, orientations, initial body positions, and poses. Extensive experiments demonstrate that our approach surpasses prior techniques in terms of the plausibility of human-scene interactions, as well as the realism and variety of the generated motions. Code will be released upon publication of this work at

Read more



GHOST: Grounded Human Motion Generation with Open Vocabulary Scene-and-Text Contexts

Zolt'an 'A. Milacski, Koichiro Niinuma, Ryosuke Kawamura, Fernando de la Torre, L'aszl'o A. Jeni





The connection between our 3D surroundings and the descriptive language that characterizes them would be well-suited for localizing and generating human motion in context but for one problem. The complexity introduced by multiple modalities makes capturing this connection challenging with a fixed set of descriptors. Specifically, closed vocabulary scene encoders, which require learning text-scene associations from scratch, have been favored in the literature, often resulting in inaccurate motion grounding. In this paper, we propose a method that integrates an open vocabulary scene encoder into the architecture, establishing a robust connection between text and scene. Our two-step approach starts with pretraining the scene encoder through knowledge distillation from an existing open vocabulary semantic image segmentation model, ensuring a shared text-scene feature space. Subsequently, the scene encoder is fine-tuned for conditional motion generation, incorporating two novel regularization losses that regress the category and size of the goal object. Our methodology achieves up to a 30% reduction in the goal object distance metric compared to the prior state-of-the-art baseline model on the HUMANISE dataset. This improvement is demonstrated through evaluations conducted using three implementations of our framework and a perceptual study. Additionally, our method is designed to seamlessly accommodate future 2D segmentation methods that provide per-pixel text-aligned features for distillation.

Read more


Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer

Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer

Zichen Geng, Caren Han, Zeeshan Hayder, Jian Liu, Mubarak Shah, Ajmal Mian





Text-driven human motion generation is an emerging task in animation and humanoid robot design. Existing algorithms directly generate the full sequence which is computationally expensive and prone to errors as it does not pay special attention to key poses, a process that has been the cornerstone of animation for decades. We propose KeyMotion, that generates plausible human motion sequences corresponding to input text by first generating keyframes followed by in-filling. We use a Variational Autoencoder (VAE) with Kullback-Leibler regularization to project the keyframes into a latent space to reduce dimensionality and further accelerate the subsequent diffusion process. For the reverse diffusion, we propose a novel Parallel Skip Transformer that performs cross-modal attention between the keyframe latents and text condition. To complete the motion sequence, we propose a text-guided Transformer designed to perform motion-in-filling, ensuring the preservation of both fidelity and adherence to the physical constraints of human motion. Experiments show that our method achieves state-of-theart results on the HumanML3D dataset outperforming others on all R-precision metrics and MultiModal Distance. KeyMotion also achieves competitive performance on the KIT dataset, achieving the best results on Top3 R-precision, FID, and Diversity metrics.

Read more


Towards Open Domain Text-Driven Synthesis of Multi-Person Motions

Towards Open Domain Text-Driven Synthesis of Multi-Person Motions

Mengyi Shan, Lu Dong, Yutao Han, Yuan Yao, Tao Liu, Ifeoma Nwogu, Guo-Jun Qi, Mitch Hill





This work aims to generate natural and diverse group motions of multiple humans from textual descriptions. While single-person text-to-motion generation is extensively studied, it remains challenging to synthesize motions for more than one or two subjects from in-the-wild prompts, mainly due to the lack of available datasets. In this work, we curate human pose and motion datasets by estimating pose information from large-scale image and video datasets. Our models use a transformer-based diffusion framework that accommodates multiple datasets with any number of subjects or frames. Experiments explore both generation of multi-person static poses and generation of multi-person motion sequences. To our knowledge, our method is the first to generate multi-subject motion sequences with high diversity and fidelity from a large variety of textual prompts.

Read more
