On-the-fly Learning to Transfer Motion Style with Diffusion Models: A Semantic Guidance Approach

2405.06646

Published 5/14/2024 by Lei Hu, Zihao Zhang, Yongjing Ye, Yiwen Xu, Shihong Xia

🔄

Abstract

In recent years, the emergence of generative models has spurred development of human motion generation, among which the generation of stylized human motion has consistently been a focal point of research. The conventional approach for stylized human motion generation involves transferring the style from given style examples to new motions. Despite decades of research in human motion style transfer, it still faces three main challenges: 1) difficulties in decoupling the motion content and style; 2) generalization to unseen motion style. 3) requirements of dedicated motion style dataset; To address these issues, we propose an on-the-fly human motion style transfer learning method based on the diffusion model, which can learn a style transfer model in a few minutes of fine-tuning to transfer an unseen style to diverse content motions. The key idea of our method is to consider the denoising process of the diffusion model as a motion translation process that learns the difference between the style-neutral motion pair, thereby avoiding the challenge of style and content decoupling. Specifically, given an unseen style example, we first generate the corresponding neutral motion through the proposed Style-Neutral Motion Pair Generation module. We then add noise to the generated neutral motion and denoise it to be close to the style example to fine-tune the style transfer diffusion model. We only need one style example and a text-to-motion dataset with predominantly neutral motion (e.g. HumanML3D). The qualitative and quantitative evaluations demonstrate that our method can achieve state-of-the-art performance and has practical applications.

Create account to get full access

Overview

Researchers have developed a new method for transferring motion styles to diverse human motions using a diffusion model.
The key innovation is treating the denoising process of the diffusion model as a motion translation process, which avoids the challenge of decoupling motion content and style.
This approach only requires a single style example and a text-to-motion dataset with predominantly neutral motions, making it more practical than previous motion style transfer methods.

Plain English Explanation

Generating human motions with a specific style, like someone walking with swagger or dancing gracefully, has been a major focus of research in recent years. The typical approach has been to take example motions with a certain style and try to apply that style to new motions. However, this has faced several challenges:

It's difficult to separate the actual movement (the "content") from the style.
The methods don't work well when trying to apply a style that hasn't been seen before.
Collecting a comprehensive dataset of different motion styles is very time-consuming.

The researchers propose a new method that sidesteps these issues by using a diffusion model. The key insight is that the process of "denoising" the motion in a diffusion model can be seen as translating the motion from a neutral, style-free version to the desired styled version.

So, given just a single example of the desired motion style, the method can quickly learn to apply that style to any new motion, without needing a large dataset of labeled motion styles. This makes it much more practical and flexible than previous approaches.

Technical Explanation

The researchers' method uses a diffusion model to learn a style transfer model that can apply an unseen motion style to diverse content motions. The key innovation is in how they formulate the style transfer problem.

Rather than trying to explicitly model the separation between motion content and style, they treat the denoising process of the diffusion model as a motion translation task. Specifically, they first generate a "neutral" version of the desired motion using a Style-Neutral Motion Pair Generation module. They then add noise to this neutral motion and use the diffusion model to denoise it, effectively translating it to match the provided style example.

This approach only requires a single style example and a text-to-motion dataset with predominantly neutral motions (like HumanML3D), making it much more practical than previous motion style transfer methods that needed large, annotated datasets.

The researchers evaluate their method qualitatively and quantitatively, showing that it achieves state-of-the-art performance on motion style transfer tasks.

Critical Analysis

The researchers demonstrate a clever and effective solution to the challenging problem of motion style transfer. By reframing the task as a motion translation problem within a diffusion model framework, they are able to sidestep the difficulties of explicitly modeling content and style.

That said, the method does rely on the availability of a text-to-motion dataset with predominantly neutral motions. While the researchers show it works well with HumanML3D, the generalization to other datasets or real-world motion capture data is not explored. There may also be limitations in the types of styles that can be effectively transferred using this approach.

Additionally, the researchers do not discuss the computational efficiency or inference speed of their method, which would be an important consideration for practical applications like animation or music style transfer.

Overall, this is a promising new direction for motion style transfer that addresses several key challenges in the field. Further research is needed to fully understand the capabilities and limitations of this diffusion-based approach.

Conclusion

The researchers have developed an innovative method for human motion style transfer that sidesteps the traditional challenges of decoupling motion content and style, generalizing to unseen styles, and requiring large annotated datasets.

By reframing the problem as a motion translation task within a diffusion model framework, their approach can learn to apply a new motion style using just a single example, making it much more practical than previous methods. The promising results demonstrate the potential of this technique to enable more flexible and accessible motion style transfer for a variety of applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Shape Conditioned Human Motion Generation with Diffusion Model

Kebing Xue, Hyewon Seo

Human motion synthesis is an important task in computer graphics and computer vision. While focusing on various conditioning signals such as text, action class, or audio to guide the generation process, most existing methods utilize skeleton-based pose representation, requiring additional skinning to produce renderable meshes. Given that human motion is a complex interplay of bones, joints, and muscles, considering solely the skeleton for generation may neglect their inherent interdependency, which can limit the variability and precision of the generated results. To address this issue, we propose a Shape-conditioned Motion Diffusion model (SMD), which enables the generation of motion sequences directly in mesh format, conditioned on a specified target mesh. In SMD, the input meshes are transformed into spectral coefficients using graph Laplacian, to efficiently represent meshes. Subsequently, we propose a Spectral-Temporal Autoencoder (STAE) to leverage cross-temporal dependencies within the spectral domain. Extensive experimental evaluations show that SMD not only produces vivid and realistic motions but also achieves competitive performance in text-to-motion and action-to-motion tasks when compared to state-of-the-art methods.

5/14/2024

cs.CV cs.GR

🔄

SMCD: High Realism Motion Style Transfer via Mamba-based Diffusion

Ziyun Qian, Zeyu Xiao, Zhenyi Wu, Dingkang Yang, Mingcheng Li, Shunli Wang, Shuaibing Wang, Dongliang Kou, Lihua Zhang

Motion style transfer is a significant research direction in multimedia applications. It enables the rapid switching of different styles of the same motion for virtual digital humans, thus vastly increasing the diversity and realism of movements. It is widely applied in multimedia scenarios such as movies, games, and the Metaverse. However, most of the current work in this field adopts the GAN, which may lead to instability and convergence issues, making the final generated motion sequence somewhat chaotic and unable to reflect a highly realistic and natural style. To address these problems, we consider style motion as a condition and propose the Style Motion Conditioned Diffusion (SMCD) framework for the first time, which can more comprehensively learn the style features of motion. Moreover, we apply Mamba model for the first time in the motion style transfer field, introducing the Motion Style Mamba (MSM) module to handle longer motion sequences. Thirdly, aiming at the SMCD framework, we propose Diffusion-based Content Consistency Loss and Content Consistency Loss to assist the overall framework's training. Finally, we conduct extensive experiments. The results reveal that our method surpasses state-of-the-art methods in both qualitative and quantitative comparisons, capable of generating more realistic motion sequences.

5/7/2024

cs.CV

🛸

DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models

Zhiyao Sun, Tian Lv, Sheng Ye, Matthieu Lin, Jenny Sheng, Yu-Hui Wen, Minjing Yu, Yong-Jin Liu

The generation of stylistic 3D facial animations driven by speech presents a significant challenge as it requires learning a many-to-many mapping between speech, style, and the corresponding natural facial motion. However, existing methods either employ a deterministic model for speech-to-motion mapping or encode the style using a one-hot encoding scheme. Notably, the one-hot encoding approach fails to capture the complexity of the style and thus limits generalization ability. In this paper, we propose DiffPoseTalk, a generative framework based on the diffusion model combined with a style encoder that extracts style embeddings from short reference videos. During inference, we employ classifier-free guidance to guide the generation process based on the speech and style. In particular, our style includes the generation of head poses, thereby enhancing user perception. Additionally, we address the shortage of scanned 3D talking face data by training our model on reconstructed 3DMM parameters from a high-quality, in-the-wild audio-visual dataset. Extensive experiments and user study demonstrate that our approach outperforms state-of-the-art methods. The code and dataset are at https://diffposetalk.github.io .

5/15/2024

cs.CV cs.GR

🏋️

Video Diffusion Models are Training-free Motion Interpreter and Controller

Zeqi Xiao, Yifan Zhou, Shuai Yang, Xingang Pan

Video generation primarily aims to model authentic and customized motion across frames, making understanding and controlling the motion a crucial topic. Most diffusion-based studies on video motion focus on motion customization with training-based paradigms, which, however, demands substantial training resources and necessitates retraining for diverse models. Crucially, these approaches do not explore how video diffusion models encode cross-frame motion information in their features, lacking interpretability and transparency in their effectiveness. To answer this question, this paper introduces a novel perspective to understand, localize, and manipulate motion-aware features in video diffusion models. Through analysis using Principal Component Analysis (PCA), our work discloses that robust motion-aware feature already exists in video diffusion models. We present a new MOtion FeaTure (MOFT) by eliminating content correlation information and filtering motion channels. MOFT provides a distinct set of benefits, including the ability to encode comprehensive motion information with clear interpretability, extraction without the need for training, and generalizability across diverse architectures. Leveraging MOFT, we propose a novel training-free video motion control framework. Our method demonstrates competitive performance in generating natural and faithful motion, providing architecture-agnostic insights and applicability in a variety of downstream tasks.

5/24/2024

cs.CV