On-the-fly Learning to Transfer Motion Style with Diffusion Models: A Semantic Guidance Approach

Read original: arXiv:2405.06646 - Published 8/9/2024 by Lei Hu, Zihao Zhang, Yongjing Ye, Yiwen Xu, Shihong Xia

🔄

Overview

Researchers have developed a new method for transferring motion styles to diverse human motions using a diffusion model.
The key innovation is treating the denoising process of the diffusion model as a motion translation process, which avoids the challenge of decoupling motion content and style.
This approach only requires a single style example and a text-to-motion dataset with predominantly neutral motions, making it more practical than previous motion style transfer methods.

Plain English Explanation

Generating human motions with a specific style, like someone walking with swagger or dancing gracefully, has been a major focus of research in recent years. The typical approach has been to take example motions with a certain style and try to apply that style to new motions. However, this has faced several challenges:

It's difficult to separate the actual movement (the "content") from the style.
The methods don't work well when trying to apply a style that hasn't been seen before.
Collecting a comprehensive dataset of different motion styles is very time-consuming.

The researchers propose a new method that sidesteps these issues by using a diffusion model. The key insight is that the process of "denoising" the motion in a diffusion model can be seen as translating the motion from a neutral, style-free version to the desired styled version.

So, given just a single example of the desired motion style, the method can quickly learn to apply that style to any new motion, without needing a large dataset of labeled motion styles. This makes it much more practical and flexible than previous approaches.

Technical Explanation

The researchers' method uses a diffusion model to learn a style transfer model that can apply an unseen motion style to diverse content motions. The key innovation is in how they formulate the style transfer problem.

Rather than trying to explicitly model the separation between motion content and style, they treat the denoising process of the diffusion model as a motion translation task. Specifically, they first generate a "neutral" version of the desired motion using a Style-Neutral Motion Pair Generation module. They then add noise to this neutral motion and use the diffusion model to denoise it, effectively translating it to match the provided style example.

This approach only requires a single style example and a text-to-motion dataset with predominantly neutral motions (like HumanML3D), making it much more practical than previous motion style transfer methods that needed large, annotated datasets.

The researchers evaluate their method qualitatively and quantitatively, showing that it achieves state-of-the-art performance on motion style transfer tasks.

Critical Analysis

The researchers demonstrate a clever and effective solution to the challenging problem of motion style transfer. By reframing the task as a motion translation problem within a diffusion model framework, they are able to sidestep the difficulties of explicitly modeling content and style.

That said, the method does rely on the availability of a text-to-motion dataset with predominantly neutral motions. While the researchers show it works well with HumanML3D, the generalization to other datasets or real-world motion capture data is not explored. There may also be limitations in the types of styles that can be effectively transferred using this approach.

Additionally, the researchers do not discuss the computational efficiency or inference speed of their method, which would be an important consideration for practical applications like animation or music style transfer.

Overall, this is a promising new direction for motion style transfer that addresses several key challenges in the field. Further research is needed to fully understand the capabilities and limitations of this diffusion-based approach.

Conclusion

The researchers have developed an innovative method for human motion style transfer that sidesteps the traditional challenges of decoupling motion content and style, generalizing to unseen styles, and requiring large annotated datasets.

By reframing the problem as a motion translation task within a diffusion model framework, their approach can learn to apply a new motion style using just a single example, making it much more practical than previous methods. The promising results demonstrate the potential of this technique to enable more flexible and accessible motion style transfer for a variety of applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔄

On-the-fly Learning to Transfer Motion Style with Diffusion Models: A Semantic Guidance Approach

Lei Hu, Zihao Zhang, Yongjing Ye, Yiwen Xu, Shihong Xia

3D Human motion style transfer is a fundamental problem in computer graphic and animation processing. Existing AdaIN- based methods necessitate datasets with balanced style distribution and content/style labels to train the clustered latent space. However, we may encounter a single unseen style example in practical scenarios, but not in sufficient quantity to constitute a style cluster for AdaIN-based methods. Therefore, in this paper, we propose a novel two-stage framework for few-shot style transfer learning based on the diffusion model. Specifically, in the first stage, we pre-train a diffusion-based text-to-motion model as a generative prior so that it can cope with various content motion inputs. In the second stage, based on the single style example, we fine-tune the pre-trained diffusion model in a few-shot manner to make it capable of style transfer. The key idea is regarding the reverse process of diffusion as a motion-style translation process since the motion styles can be viewed as special motion variations. During the fine-tuning for style transfer, a simple yet effective semantic-guided style transfer loss coordinated with style example reconstruction loss is introduced to supervise the style transfer in CLIP semantic space. The qualitative and quantitative evaluations demonstrate that our method can achieve state-of-the-art performance and has practical applications.

8/9/2024

SMooDi: Stylized Motion Diffusion Model

Lei Zhong, Yiming Xie, Varun Jampani, Deqing Sun, Huaizu Jiang

We introduce a novel Stylized Motion Diffusion model, dubbed SMooDi, to generate stylized motion driven by content texts and style motion sequences. Unlike existing methods that either generate motion of various content or transfer style from one sequence to another, SMooDi can rapidly generate motion across a broad range of content and diverse styles. To this end, we tailor a pre-trained text-to-motion model for stylization. Specifically, we propose style guidance to ensure that the generated motion closely matches the reference style, alongside a lightweight style adaptor that directs the motion towards the desired style while ensuring realism. Experiments across various applications demonstrate that our proposed framework outperforms existing methods in stylized motion generation.

7/18/2024

Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation

Peng Jin, Hao Li, Zesen Cheng, Kehan Li, Runyi Yu, Chang Liu, Xiangyang Ji, Li Yuan, Jie Chen

Text-to-motion generation requires not only grounding local actions in language but also seamlessly blending these individual actions to synthesize diverse and realistic global motions. However, existing motion generation methods primarily focus on the direct synthesis of global motions while neglecting the importance of generating and controlling local actions. In this paper, we propose the local action-guided motion diffusion model, which facilitates global motion generation by utilizing local actions as fine-grained control signals. Specifically, we provide an automated method for reference local action sampling and leverage graph attention networks to assess the guiding weight of each local action in the overall motion synthesis. During the diffusion process for synthesizing global motion, we calculate the local-action gradient to provide conditional guidance. This local-to-global paradigm reduces the complexity associated with direct global motion generation and promotes motion diversity via sampling diverse actions as conditions. Extensive experiments on two human motion datasets, i.e., HumanML3D and KIT, demonstrate the effectiveness of our method. Furthermore, our method provides flexibility in seamlessly combining various local actions and continuous guiding weight adjustment, accommodating diverse user preferences, which may hold potential significance for the community. The project page is available at https://jpthu17.github.io/GuidedMotion-project/.

7/16/2024

FreeStyle: Free Lunch for Text-guided Style Transfer using Diffusion Models

Feihong He, Gang Li, Mengyuan Zhang, Leilei Yan, Lingyu Si, Fanzhang Li, Li Shen

The rapid development of generative diffusion models has significantly advanced the field of style transfer. However, most current style transfer methods based on diffusion models typically involve a slow iterative optimization process, e.g., model fine-tuning and textual inversion of style concept. In this paper, we introduce FreeStyle, an innovative style transfer method built upon a pre-trained large diffusion model, requiring no further optimization. Besides, our method enables style transfer only through a text description of the desired style, eliminating the necessity of style images. Specifically, we propose a dual-stream encoder and single-stream decoder architecture, replacing the conventional U-Net in diffusion models. In the dual-stream encoder, two distinct branches take the content image and style text prompt as inputs, achieving content and style decoupling. In the decoder, we further modulate features from the dual streams based on a given content image and the corresponding style text prompt for precise style transfer. Our experimental results demonstrate high-quality synthesis and fidelity of our method across various content images and style text prompts. Compared with state-of-the-art methods that require training, our FreeStyle approach notably reduces the computational burden by thousands of iterations, while achieving comparable or superior performance across multiple evaluation metrics including CLIP Aesthetic Score, CLIP Score, and Preference. We have released the code anonymously at: href{https://anonymous.4open.science/r/FreeStyleAnonymous-0F9B}

7/19/2024