Two-Person Interaction Augmentation with Skeleton Priors

Read original: arXiv:2404.05490 - Published 4/11/2024 by Baiyi Li, Edmond S. L. Ho, Hubert P. H. Shum, He Wang

Two-Person Interaction Augmentation with Skeleton Priors

Overview

This paper presents a method for augmenting two-person interactions using skeleton priors.
It explores how to generate realistic interactions between two people by leveraging existing knowledge about human skeletal structure and motion.
The proposed approach aims to improve the quality and realism of generated two-person interactions compared to previous methods.

Plain English Explanation

The paper focuses on a problem in computer graphics and animation: how to create realistic interactions between two people in a virtual environment. Traditionally, this has been a challenge, as it's difficult to capture the nuanced and dynamic movements that occur during human-to-human interactions.

The researchers in this study have developed a new method that uses "skeleton priors" to help generate more believable two-person interactions. Skeleton priors refer to the existing knowledge we have about the structure and typical movements of the human skeletal system. By incorporating this prior information into their model, the researchers were able to create virtual interactions that look and feel more natural and lifelike.

This work could have important applications in fields like video game development, film and television production, and even virtual reality experiences. By making two-person interactions more realistic, it can help create more immersive and engaging digital environments that better reflect the complexities of human behavior.

Technical Explanation

The paper proposes a novel approach for augmenting two-person interactions using skeleton priors. The key idea is to leverage existing knowledge about human skeletal structure and motion to generate more realistic interactions between virtual characters.

The method first extracts skeletal pose information from input videos of two-person interactions. It then uses this data to train a deep neural network that can learn the underlying patterns and dynamics of these interactions. By incorporating the skeleton priors into the model, the network is able to generate new interaction sequences that adhere to the expected constraints and characteristics of human movement.

The researchers evaluate their approach through a series of experiments, comparing the generated interactions to ground truth data as well as previous state-of-the-art methods. The results demonstrate that their technique is able to produce two-person interactions that are more natural and coherent than those generated by other approaches.

Some of the technical insights from the paper include the use of a Transformer-based architecture to model the temporal dependencies in the interaction sequences, and the incorporation of adversarial training to further improve the realism of the generated outputs.

Critical Analysis

The paper presents a promising approach for enhancing the realism of two-person interactions in virtual environments. By leveraging existing knowledge about human skeletal structure and motion, the researchers are able to generate more natural and coherent interactions compared to previous methods.

One potential limitation of the work is that it relies on input videos of two-person interactions to train the model. This means the approach may be limited by the availability and diversity of the training data, and may struggle to generalize to novel interaction scenarios not present in the original dataset.

Additionally, the paper does not explore the potential biases or limitations that may be inherent in the skeleton priors used by the model. It's possible that these priors could reflect certain cultural or demographic norms, which could lead to generated interactions that lack diversity or inclusivity.

Further research could investigate ways to make the approach more robust and adaptable, such as by incorporating more diverse training data or exploring techniques for dynamically updating the skeleton priors based on the context of the interaction. Exploring the ethical implications of using such technology in real-world applications would also be an important area for further study.

Conclusion

In summary, this paper presents a novel approach for augmenting two-person interactions using skeleton priors. By leveraging existing knowledge about human skeletal structure and motion, the researchers were able to generate more realistic and coherent interactions between virtual characters. The proposed method demonstrates promising results and could have important applications in fields like video game development, animation, and virtual reality.

While the paper represents an important step forward, there are still opportunities to further improve the approach and address potential limitations and ethical considerations. Overall, this work highlights the value of incorporating domain-specific knowledge into AI systems to enhance the realism and quality of synthetic human behavior.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Two-Person Interaction Augmentation with Skeleton Priors

Baiyi Li, Edmond S. L. Ho, Hubert P. H. Shum, He Wang

Close and continuous interaction with rich contacts is a crucial aspect of human activities (e.g. hugging, dancing) and of interest in many domains like activity recognition, motion prediction, character animation, etc. However, acquiring such skeletal motion is challenging. While direct motion capture is expensive and slow, motion editing/generation is also non-trivial, as complex contact patterns with topological and geometric constraints have to be retained. To this end, we propose a new deep learning method for two-body skeletal interaction motion augmentation, which can generate variations of contact-rich interactions with varying body sizes and proportions while retaining the key geometric/topological relations between two bodies. Our system can learn effectively from a relatively small amount of data and generalize to drastically different skeleton sizes. Through exhaustive evaluation and comparison, we show it can generate high-quality motions, has strong generalizability and outperforms traditional optimization-based methods and alternative deep learning solutions.

4/11/2024

AvatarPose: Avatar-guided 3D Pose Estimation of Close Human Interaction from Sparse Multi-view Videos

Feichi Lu, Zijian Dong, Jie Song, Otmar Hilliges

Despite progress in human motion capture, existing multi-view methods often face challenges in estimating the 3D pose and shape of multiple closely interacting people. This difficulty arises from reliance on accurate 2D joint estimations, which are hard to obtain due to occlusions and body contact when people are in close interaction. To address this, we propose a novel method leveraging the personalized implicit neural avatar of each individual as a prior, which significantly improves the robustness and precision of this challenging pose estimation task. Concretely, the avatars are efficiently reconstructed via layered volume rendering from sparse multi-view videos. The reconstructed avatar prior allows for the direct optimization of 3D poses based on color and silhouette rendering loss, bypassing the issues associated with noisy 2D detections. To handle interpenetration, we propose a collision loss on the overlapping shape regions of avatars to add penetration constraints. Moreover, both 3D poses and avatars are optimized in an alternating manner. Our experimental results demonstrate state-of-the-art performance on several public datasets.

8/21/2024

New!Contact-aware Human Motion Generation from Textual Descriptions

Sihan Ma, Qiong Cao, Jing Zhang, Dacheng Tao

This paper addresses the problem of generating 3D interactive human motion from text. Given a textual description depicting the actions of different body parts in contact with static objects, we synthesize sequences of 3D body poses that are visually natural and physically plausible. Yet, this task poses a significant challenge due to the inadequate consideration of interactions by physical contacts in both motion and textual descriptions, leading to unnatural and implausible sequences. To tackle this challenge, we create a novel dataset named RICH-CAT, representing Contact-Aware Texts constructed from the RICH dataset. RICH-CAT comprises high-quality motion, accurate human-object contact labels, and detailed textual descriptions, encompassing over 8,500 motion-text pairs across 26 indoor/outdoor actions. Leveraging RICH-CAT, we propose a novel approach named CATMO for text-driven interactive human motion synthesis that explicitly integrates human body contacts as evidence. We employ two VQ-VAE models to encode motion and body contact sequences into distinct yet complementary latent spaces and an intertwined GPT for generating human motions and contacts in a mutually conditioned manner. Additionally, we introduce a pre-trained text encoder to learn textual embeddings that better discriminate among various contact types, allowing for more precise control over synthesized motions and contacts. Our experiments demonstrate the superior performance of our approach compared to existing text-to-motion methods, producing stable, contact-aware motion sequences. Code and data will be available for research purposes at https://xymsh.github.io/RICH-CAT/

9/17/2024

Kinematics-based 3D Human-Object Interaction Reconstruction from Single View

Yuhang Chen, Chenxing Wang

Reconstructing 3D human-object interaction (HOI) from single-view RGB images is challenging due to the absence of depth information and potential occlusions. Existing methods simply predict the body poses merely rely on network training on some indoor datasets, which cannot guarantee the rationality of the results if some body parts are invisible due to occlusions that appear easily. Inspired by the end-effector localization task in robotics, we propose a kinematics-based method that can drive the joints of human body to the human-object contact regions accurately. After an improved forward kinematics algorithm is proposed, the Multi-Layer Perceptron is introduced into the solution of inverse kinematics process to determine the poses of joints, which achieves precise results than the commonly-used numerical methods in robotics. Besides, a Contact Region Recognition Network (CRRNet) is also proposed to robustly determine the contact regions using a single-view video. Experimental results demonstrate that our method outperforms the state-of-the-art on benchmark BEHAVE. Additionally, our approach shows good portability and can be seamlessly integrated into other methods for optimizations.

7/22/2024