Expressive Forecasting of 3D Whole-body Human Motions






Published 4/5/2024 by Pengxiang Ding, Qiongjie Cui, Min Zhang, Mengyuan Liu, Haofan Wang, Donglin Wang
Expressive Forecasting of 3D Whole-body Human Motions


Human motion forecasting, with the goal of estimating future human behavior over a period of time, is a fundamental task in many real-world applications. However, existing works typically concentrate on predicting the major joints of the human body without considering the delicate movements of the human hands. In practical applications, hand gesture plays an important role in human communication with the real world, and expresses the primary intention of human beings. In this work, we are the first to formulate a whole-body human pose forecasting task, which jointly predicts the future body and hand activities. Correspondingly, we propose a novel Encoding-Alignment-Interaction (EAI) framework that aims to predict both coarse (body joints) and fine-grained (gestures) activities collaboratively, enabling expressive and cross-facilitated forecasting of 3D whole-body human motions. Specifically, our model involves two key constituents: cross-context alignment (XCA) and cross-context interaction (XCI). Considering the heterogeneous information within the whole-body, XCA aims to align the latent features of various human components, while XCI focuses on effectively capturing the context interaction among the human components. We conduct extensive experiments on a newly-introduced large-scale benchmark and achieve state-of-the-art performance. The code is public for research purposes at

Create account to get full access


If you already have an account, we'll log you in


ā€¢ This paper introduces a novel method for expressive forecasting of 3D whole-body human motions, which aims to generate realistic and diverse human motion sequences. ā€¢ The proposed approach leverages intra-context encoding and inter-context consistency to capture the rich semantics and dynamics of human movements. ā€¢ The method is evaluated on several challenging datasets, demonstrating its superiority over existing state-of-the-art techniques for human motion prediction.

Plain English Explanation

The paper presents a new way to forecast, or predict, the future movements of a person's full body in 3D. This is an important task for applications like virtual reality, animation, and human-robot interaction, where realistic and varied human motion is crucial.

The key ideas are:

  1. Intra-context Encoding: The method captures the inherent meaning and patterns within individual motion sequences, allowing it to generate more expressive and natural-looking movements.

  2. Inter-context Consistency: The approach also ensures that the predicted motions are consistent with the overall context, such as the person's personality or the task they are performing. This helps create more coherent and plausible motion sequences.

By leveraging these two principles, the proposed technique is able to produce human motions that are more realistic and diverse compared to previous state-of-the-art methods. The authors demonstrate the effectiveness of their approach through extensive evaluations on several challenging datasets.

Technical Explanation

The paper introduces a novel expressive forecasting framework for generating 3D whole-body human motion sequences. The core components of the method are:

  1. Intra-context Encoding: This module learns to encode the rich semantics and dynamics within individual motion contexts, capturing the inherent patterns and expressiveness of human movements. It uses a series of transformers and recurrent neural networks to model the complex spatio-temporal relationships in the data.

  2. Inter-context Consistency: To ensure the predicted motions are coherent with the overall context, this component enforces consistency across multiple motion sequences. It does this by introducing latent consistency and adversarial training techniques, which encourage the generated motions to align with the target context.

The authors evaluate their approach on several public datasets, including Towards More Realistic Human Motion Prediction, Co-speech Gesture Video Generation via Motion, and Freeman: Towards Benchmarking 3D Human Pose Estimation. The results demonstrate that their method outperforms existing state-of-the-art approaches in terms of motion quality, diversity, and consistency.

Critical Analysis

The paper presents a compelling approach for expressive forecasting of 3D whole-body human motions. However, there are a few potential limitations and areas for further research:

  1. Dataset Bias: The evaluations are conducted on a limited set of datasets, which may not fully capture the diversity of human movements in the real world. Expanding the evaluation to more diverse datasets could provide a more comprehensive assessment of the method's capabilities.

  2. Real-time Performance: The paper does not discuss the computational efficiency of the proposed approach, which is an important consideration for real-time applications, such as text-guided 3D motion generation for hands or local geometry-aware hand-object interaction. Further optimizations may be needed to deploy the method in such scenarios.

  3. Generalization to Diverse Domains: While the method demonstrates strong performance on the evaluated datasets, it is unclear how well it would generalize to human motions in different contexts, such as sports, dance, or sign language. Exploring the transferability of the approach to these domains could broaden its applicability.

Overall, the paper presents a promising step towards more expressive and realistic human motion forecasting. The authors' thoughtful consideration of both intra-context and inter-context factors is a compelling contribution to the field.


This paper introduces a novel framework for expressive forecasting of 3D whole-body human motions. The key innovations are the intra-context encoding and inter-context consistency modules, which enable the generation of realistic and diverse human movement sequences.

The authors demonstrate the effectiveness of their approach through extensive evaluations on several challenging datasets, outperforming existing state-of-the-art methods. This work represents an important advancement in human motion prediction, with potential applications in virtual reality, animation, and human-robot interaction.

While the paper presents a compelling solution, there are a few areas for further exploration, such as dataset bias, real-time performance, and generalization to diverse domains. Addressing these aspects could help solidify the method's applicability across a wider range of real-world scenarios.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers


Scene-aware Human Motion Forecasting via Mutual Distance Prediction

Chaoyue Xing, Wei Mao, Miaomiao Liu





In this paper, we tackle the problem of scene-aware 3D human motion forecasting. A key challenge of this task is to predict future human motions that are consistent with the scene by modeling the human-scene interactions. While recent works have demonstrated that explicit constraints on human-scene interactions can prevent the occurrence of ghost motion, they only provide constraints on partial human motion e.g., the global motion of the human or a few joints contacting the scene, leaving the rest of the motion unconstrained. To address this limitation, we propose to model the human-scene interaction with the mutual distance between the human body and the scene. Such mutual distances constrain both the local and global human motion, resulting in a whole-body motion constrained prediction. In particular, mutual distance constraints consist of two components, the signed distance of each vertex on the human mesh to the scene surface and the distance of basis scene points to the human mesh. We further introduce a global scene representation learned from a signed distance function (SDF) volume to ensure coherence between the global scene representation and the explicit constraint from the mutual distance. We develop a pipeline with two sequential steps: predicting the future mutual distances first, followed by forecasting future human motion. During training, we explicitly encourage consistency between predicted poses and mutual distances. Extensive evaluations on the existing synthetic and real datasets demonstrate that our approach consistently outperforms the state-of-the-art methods.

Read more



Multimodal Sense-Informed Prediction of 3D Human Motions

Zhenyu Lou, Qiongjie Cui, Haofan Wang, Xu Tang, Hong Zhou





Predicting future human pose is a fundamental application for machine intelligence, which drives robots to plan their behavior and paths ahead of time to seamlessly accomplish human-robot collaboration in real-world 3D scenarios. Despite encouraging results, existing approaches rarely consider the effects of the external scene on the motion sequence, leading to pronounced artifacts and physical implausibilities in the predictions. To address this limitation, this work introduces a novel multi-modal sense-informed motion prediction approach, which conditions high-fidelity generation on two modal information: external 3D scene, and internal human gaze, and is able to recognize their salience for future human activity. Furthermore, the gaze information is regarded as the human intention, and combined with both motion and scene features, we construct a ternary intention-aware attention to supervise the generation to match where the human wants to reach. Meanwhile, we introduce semantic coherence-aware attention to explicitly distinguish the salient point clouds and the underlying ones, to ensure a reasonable interaction of the generated sequence with the 3D scene. On two real-world benchmarks, the proposed method achieves state-of-the-art performance both in 3D human pose and trajectory prediction.

Read more



FutureHuman3D: Forecasting Complex Long-Term 3D Human Behavior from Video Observations

Christian Diller, Thomas Funkhouser, Angela Dai





We present a generative approach to forecast long-term future human behavior in 3D, requiring only weak supervision from readily available 2D human action data. This is a fundamental task enabling many downstream applications. The required ground-truth data is hard to capture in 3D (mocap suits, expensive setups) but easy to acquire in 2D (simple RGB cameras). Thus, we design our method to only require 2D RGB data at inference time while being able to generate 3D human motion sequences. We use a differentiable 2D projection scheme in an autoregressive manner for weak supervision, and an adversarial loss for 3D regularization. Our method predicts long and complex human behavior sequences (e.g., cooking, assembly) consisting of multiple sub-actions. We tackle this in a semantically hierarchical manner, jointly predicting high-level coarse action labels together with their low-level fine-grained realizations as characteristic 3D human poses. We observe that these two action representations are coupled in nature, and joint prediction benefits both action and pose forecasting. Our experiments demonstrate the complementary nature of joint action and 3D pose prediction: our joint approach outperforms each task treated individually, enables robust longer-term sequence prediction, and improves over alternative approaches to forecast actions and characteristic 3D poses.

Read more


Multi-agent Long-term 3D Human Pose Forecasting via Interaction-aware Trajectory Conditioning

Multi-agent Long-term 3D Human Pose Forecasting via Interaction-aware Trajectory Conditioning

Jaewoo Jeong, Daehee Park, Kuk-Jin Yoon





Human pose forecasting garners attention for its diverse applications. However, challenges in modeling the multi-modal nature of human motion and intricate interactions among agents persist, particularly with longer timescales and more agents. In this paper, we propose an interaction-aware trajectory-conditioned long-term multi-agent human pose forecasting model, utilizing a coarse-to-fine prediction approach: multi-modal global trajectories are initially forecasted, followed by respective local pose forecasts conditioned on each mode. In doing so, our Trajectory2Pose model introduces a graph-based agent-wise interaction module for a reciprocal forecast of local motion-conditioned global trajectory and trajectory-conditioned local pose. Our model effectively handles the multi-modality of human motion and the complexity of long-term multi-agent interactions, improving performance in complex environments. Furthermore, we address the lack of long-term (6s+) multi-agent (5+) datasets by constructing a new dataset from real-world images and 2D annotations, enabling a comprehensive evaluation of our proposed model. State-of-the-art prediction performance on both complex and simpler datasets confirms the generalized effectiveness of our method. The code is available at

Read more
