RecMoDiffuse: Recurrent Flow Diffusion for Human Motion Generation

2406.07169

Published 6/12/2024 by Mirgahney Mohamed, Harry Jake Cunningham, Marc P. Deisenroth, Lourdes Agapito

RecMoDiffuse: Recurrent Flow Diffusion for Human Motion Generation

Abstract

Human motion generation has paramount importance in computer animation. It is a challenging generative temporal modelling task due to the vast possibilities of human motion, high human sensitivity to motion coherence and the difficulty of accurately generating fine-grained motions. Recently, diffusion methods have been proposed for human motion generation due to their high sample quality and expressiveness. However, generated sequences still suffer from motion incoherence, and are limited to short duration, and simpler motion and take considerable time during inference. To address these limitations, we propose textit{RecMoDiffuse: Recurrent Flow Diffusion}, a new recurrent diffusion formulation for temporal modelling. Unlike previous work, which applies diffusion to the whole sequence without any temporal dependency, an approach that inherently makes temporal consistency hard to achieve. Our method explicitly enforces temporal constraints with the means of normalizing flow models in the diffusion process and thereby extends diffusion to the temporal dimension. We demonstrate the effectiveness of RecMoDiffuse in the temporal modelling of human motion. Our experiments show that RecMoDiffuse achieves comparable results with state-of-the-art methods while generating coherent motion sequences and reducing the computational overhead in the inference stage.

Create account to get full access

Overview

This paper presents RecMoDiffuse, a novel method for generating human motion using a recurrent diffusion model.
The model learns to generate realistic and diverse human motion sequences from a collection of motion capture data.
The authors leverage recent advancements in diffusion models, which have shown impressive results in other domains like image synthesis, and apply them to the task of human motion generation.

Plain English Explanation

The paper introduces a new way to create realistic animations of people moving, using a technique called "diffusion." Diffusion models work by starting with random noise and gradually transforming it into a desired output, like an image or, in this case, a sequence of human motions.

The key innovation in this work is the use of a "recurrent" diffusion model, which means the model remembers and builds upon its previous outputs as it generates a new motion sequence. This allows the model to create smooth, coherent motion that flows naturally over time, rather than generating each frame independently.

The authors trained their RecMoDiffuse model on a large dataset of motion capture data, which records the precise movements of real people. By learning from this data, the model can generate new motion sequences that mimic the style and dynamics of natural human movement.

This type of motion generation system could be useful for a variety of applications, such as creating animations for movies, games, or virtual reality experiences, where realistic human motion is important for creating an immersive and believable environment.

Technical Explanation

The authors of this paper propose a novel approach called RecMoDiffuse for generating realistic human motion sequences using a recurrent diffusion model.

Diffusion models have recently shown impressive results in image synthesis, and the authors hypothesize that they can also be effective for generating human motion. The key idea is to start with random noise and gradually transform it into a desired motion sequence, similar to how diffusion models work for images.

To capture the temporal dynamics of human motion, the authors introduce a recurrent architecture, where the model remembers and builds upon its previous outputs as it generates a new motion frame. This allows the model to create smooth, coherent motion that flows naturally over time.

The authors train their RecMoDiffuse model on a large dataset of motion capture data, which provides precise recordings of real human movements. By learning from this data, the model can generate new motion sequences that closely mimic the style and dynamics of natural human motion.

Through extensive experiments, the authors demonstrate that RecMoDiffuse outperforms state-of-the-art methods for human motion generation in terms of both realism and diversity of the generated motions.

Critical Analysis

The authors present a compelling approach to human motion generation, leveraging the power of diffusion models and introducing a novel recurrent architecture to capture the temporal dynamics of human movement.

One potential limitation of the work is that the model is trained on a specific dataset of motion capture data, which may not fully capture the full range of human motion and movement styles. The authors acknowledge this and suggest that further research could investigate ways to improve the model's generalization capabilities.

Additionally, while the authors demonstrate impressive results in terms of realism and diversity of the generated motions, it would be interesting to see how the model performs in more real-world applications, such as generating motion for virtual characters in interactive environments or for use in animation production pipelines.

Overall, this paper represents an exciting step forward in the field of human motion generation, and the RecMoDiffuse model could serve as a valuable tool for researchers and practitioners working on creating more realistic and engaging animated content.

Conclusion

The RecMoDiffuse model introduced in this paper demonstrates the potential of using recurrent diffusion techniques for generating realistic and diverse human motion sequences. By leveraging the strengths of diffusion models and incorporating a recurrent architecture, the authors have developed a system that can create natural-looking animations of human movement.

This work opens up new possibilities for applications that require realistic human motion, such as virtual reality experiences, video games, and animated films. As the field of motion generation continues to advance, models like RecMoDiffuse could become increasingly important tools for creating immersive and believable digital environments.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Shape Conditioned Human Motion Generation with Diffusion Model

Kebing Xue, Hyewon Seo

Human motion synthesis is an important task in computer graphics and computer vision. While focusing on various conditioning signals such as text, action class, or audio to guide the generation process, most existing methods utilize skeleton-based pose representation, requiring additional skinning to produce renderable meshes. Given that human motion is a complex interplay of bones, joints, and muscles, considering solely the skeleton for generation may neglect their inherent interdependency, which can limit the variability and precision of the generated results. To address this issue, we propose a Shape-conditioned Motion Diffusion model (SMD), which enables the generation of motion sequences directly in mesh format, conditioned on a specified target mesh. In SMD, the input meshes are transformed into spectral coefficients using graph Laplacian, to efficiently represent meshes. Subsequently, we propose a Spectral-Temporal Autoencoder (STAE) to leverage cross-temporal dependencies within the spectral domain. Extensive experimental evaluations show that SMD not only produces vivid and realistic motions but also achieves competitive performance in text-to-motion and action-to-motion tasks when compared to state-of-the-art methods.

5/14/2024

cs.CV cs.GR

🧠

RoHM: Robust Human Motion Reconstruction via Diffusion

Siwei Zhang, Bharat Lal Bhatnagar, Yuanlu Xu, Alexander Winkler, Petr Kadlecek, Siyu Tang, Federica Bogo

We propose RoHM, an approach for robust 3D human motion reconstruction from monocular RGB(-D) videos in the presence of noise and occlusions. Most previous approaches either train neural networks to directly regress motion in 3D or learn data-driven motion priors and combine them with optimization at test time. The former do not recover globally coherent motion and fail under occlusions; the latter are time-consuming, prone to local minima, and require manual tuning. To overcome these shortcomings, we exploit the iterative, denoising nature of diffusion models. RoHM is a novel diffusion-based motion model that, conditioned on noisy and occluded input data, reconstructs complete, plausible motions in consistent global coordinates. Given the complexity of the problem -- requiring one to address different tasks (denoising and infilling) in different solution spaces (local and global motion) -- we decompose it into two sub-tasks and learn two models, one for global trajectory and one for local motion. To capture the correlations between the two, we then introduce a novel conditioning module, combining it with an iterative inference scheme. We apply RoHM to a variety of tasks -- from motion reconstruction and denoising to spatial and temporal infilling. Extensive experiments on three popular datasets show that our method outperforms state-of-the-art approaches qualitatively and quantitatively, while being faster at test time. The code is available at https://sanweiliti.github.io/ROHM/ROHM.html.

4/16/2024

cs.CV

🛸

StableMoFusion: Towards Robust and Efficient Diffusion-based Motion Generation Framework

Yiheng Huang, Hui Yang, Chuanchen Luo, Yuxi Wang, Shibiao Xu, Zhaoxiang Zhang, Man Zhang, Junran Peng

Thanks to the powerful generative capacity of diffusion models, recent years have witnessed rapid progress in human motion generation. Existing diffusion-based methods employ disparate network architectures and training strategies. The effect of the design of each component is still unclear. In addition, the iterative denoising process consumes considerable computational overhead, which is prohibitive for real-time scenarios such as virtual characters and humanoid robots. For this reason, we first conduct a comprehensive investigation into network architectures, training strategies, and inference processs. Based on the profound analysis, we tailor each component for efficient high-quality human motion generation. Despite the promising performance, the tailored model still suffers from foot skating which is an ubiquitous issue in diffusion-based solutions. To eliminate footskate, we identify foot-ground contact and correct foot motions along the denoising process. By organically combining these well-designed components together, we present StableMoFusion, a robust and efficient framework for human motion generation. Extensive experimental results show that our StableMoFusion performs favorably against current state-of-the-art methods. Project page: https://h-y1heng.github.io/StableMoFusion-page/

5/10/2024

cs.CV cs.MM

🔄

On-the-fly Learning to Transfer Motion Style with Diffusion Models: A Semantic Guidance Approach

Lei Hu, Zihao Zhang, Yongjing Ye, Yiwen Xu, Shihong Xia

In recent years, the emergence of generative models has spurred development of human motion generation, among which the generation of stylized human motion has consistently been a focal point of research. The conventional approach for stylized human motion generation involves transferring the style from given style examples to new motions. Despite decades of research in human motion style transfer, it still faces three main challenges: 1) difficulties in decoupling the motion content and style; 2) generalization to unseen motion style. 3) requirements of dedicated motion style dataset; To address these issues, we propose an on-the-fly human motion style transfer learning method based on the diffusion model, which can learn a style transfer model in a few minutes of fine-tuning to transfer an unseen style to diverse content motions. The key idea of our method is to consider the denoising process of the diffusion model as a motion translation process that learns the difference between the style-neutral motion pair, thereby avoiding the challenge of style and content decoupling. Specifically, given an unseen style example, we first generate the corresponding neutral motion through the proposed Style-Neutral Motion Pair Generation module. We then add noise to the generated neutral motion and denoise it to be close to the style example to fine-tune the style transfer diffusion model. We only need one style example and a text-to-motion dataset with predominantly neutral motion (e.g. HumanML3D). The qualitative and quantitative evaluations demonstrate that our method can achieve state-of-the-art performance and has practical applications.

5/14/2024

cs.GR cs.CV