DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models

2310.00434

Published 5/15/2024 by Zhiyao Sun, Tian Lv, Sheng Ye, Matthieu Lin, Jenny Sheng, Yu-Hui Wen, Minjing Yu, Yong-Jin Liu

🛸

Abstract

The generation of stylistic 3D facial animations driven by speech presents a significant challenge as it requires learning a many-to-many mapping between speech, style, and the corresponding natural facial motion. However, existing methods either employ a deterministic model for speech-to-motion mapping or encode the style using a one-hot encoding scheme. Notably, the one-hot encoding approach fails to capture the complexity of the style and thus limits generalization ability. In this paper, we propose DiffPoseTalk, a generative framework based on the diffusion model combined with a style encoder that extracts style embeddings from short reference videos. During inference, we employ classifier-free guidance to guide the generation process based on the speech and style. In particular, our style includes the generation of head poses, thereby enhancing user perception. Additionally, we address the shortage of scanned 3D talking face data by training our model on reconstructed 3DMM parameters from a high-quality, in-the-wild audio-visual dataset. Extensive experiments and user study demonstrate that our approach outperforms state-of-the-art methods. The code and dataset are at https://diffposetalk.github.io .

Create account to get full access

Overview

This paper presents a novel generative framework called DiffPoseTalk for producing stylistic 3D facial animations driven by speech.
Existing methods either use a deterministic model for mapping speech to motion or encode style using a one-hot encoding scheme, which fails to capture the complexity of the style and limits generalization.
DiffPoseTalk combines a diffusion model with a style encoder that extracts style embeddings from short reference videos, allowing for more expressive and generalizable style representation.
The model also generates head poses to enhance user perception, and is trained on reconstructed 3DMM parameters from a high-quality, in-the-wild audio-visual dataset to address the shortage of scanned 3D talking face data.

Plain English Explanation

Creating realistic 3D facial animations that match a person's speech and style is a challenging task. Previous methods have either used a simple, fixed model to translate speech into facial movements, or tried to encode the person's style using a basic one-hot encoding scheme. However, these approaches are limited and fail to fully capture the nuances and complexity of a person's unique speaking style.

The researchers behind this paper have developed a new system called DiffPoseTalk that takes a more sophisticated approach. Their system uses a generative "diffusion" model, which is trained on examples of people speaking, to generate new facial animations that match both the input speech and the style of a reference video. This allows the system to produce a wider range of natural-looking facial expressions and head movements that reflect the speaker's individual style.

To further enhance the realism, the DiffPoseTalk model also generates appropriate head poses, which can significantly impact how the animation is perceived by the viewer. And to overcome the lack of high-quality 3D facial data available for training, the researchers used a technique called 3DMM to reconstruct 3D facial parameters from a large, real-world audio-visual dataset.

Through extensive testing and a user study, the researchers have shown that their DiffPoseTalk system outperforms previous state-of-the-art methods in generating stylistic 3D facial animations from speech. This is an important step forward in creating more natural and expressive talking avatars, which could have applications in areas like virtual assistants, digital characters, and animated content.

Technical Explanation

The key innovation of the DiffPoseTalk framework is the combination of a diffusion model for generating the facial animations and a style encoder that extracts style embeddings from reference videos. The diffusion model is trained to generate facial motion parameters (i.e., 3DMM coefficients) conditioned on input speech and the extracted style embeddings.

During inference, the system uses a technique called "classifier-free guidance" to steer the generation process towards the desired speech and style. This allows for more precise control over the final output and ensures that the generated animations accurately reflect the input speech and the target style.

To address the lack of high-quality 3D talking face data, the researchers leveraged a large, in-the-wild audio-visual dataset and used 3DMM to reconstruct the 3D facial parameters from the 2D video frames. This provided a rich source of training data for the DiffPoseTalk model.

The researchers conducted extensive experiments to evaluate the performance of their system, including comparisons to state-of-the-art speech-driven facial animation methods. They also carried out a user study to assess the perceptual quality of the generated animations. The results demonstrate that DiffPoseTalk outperforms existing approaches in terms of both objective metrics and subjective user evaluations.

Critical Analysis

One potential limitation of the DiffPoseTalk system is that it relies on the availability of a reference video to extract the style embeddings. In scenarios where such a reference is not available, the system's ability to generate stylistically appropriate facial animations may be compromised. The researchers acknowledge this and suggest that further work is needed to explore alternative ways of encoding style, such as using textual descriptions or other modalities.

Additionally, while the use of 3DMM reconstruction to augment the training data is a clever solution, the quality of the reconstructed 3D facial parameters may still be inferior to scanned 3D data. This could introduce artifacts or limit the realism of the generated animations, especially for highly expressive or nuanced facial movements.

It would be interesting to see how the DiffPoseTalk system performs on a wider range of speaking styles, accents, and cultural backgrounds. The current evaluation focuses primarily on a Western, English-speaking context, and the generalization to more diverse settings remains to be explored.

Overall, the DiffPoseTalk framework represents a significant advancement in the field of speech-driven facial animation, and the researchers have made a valuable contribution to the ongoing efforts to create more natural and expressive digital characters. As the technology continues to evolve, it will be important to address the remaining challenges and explore new frontiers in this exciting area of research.

Conclusion

The DiffPoseTalk system presented in this paper offers a novel solution to the challenge of generating stylistic 3D facial animations driven by speech. By combining a powerful diffusion model with a style encoder, the researchers have developed a framework that can capture the complexity of a speaker's unique style and generate more natural and expressive facial movements.

The inclusion of head pose generation and the use of 3DMM reconstruction to address the data scarcity further enhance the realism and quality of the output. The extensive experiments and user study demonstrate the superiority of DiffPoseTalk over existing state-of-the-art methods, making it a promising step forward in the pursuit of more natural and engaging digital characters.

As the field of speech-driven facial animation continues to evolve, the insights and techniques presented in this paper will undoubtedly inspire and inform future research, ultimately leading to more advanced and user-friendly applications across a wide range of industries.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🛸

Listen, Disentangle, and Control: Controllable Speech-Driven Talking Head Generation

Changpeng Cai, Guinan Guo, Jiao Li, Junhao Su, Chenghao He, Jing Xiao, Yuanxu Chen, Lei Dai, Feiyu Zhu

Most earlier investigations on talking face generation have focused on the synchronization of lip motion and speech content. However, human head pose and facial emotions are equally important characteristics of natural human faces. While audio-driven talking face generation has seen notable advancements, existing methods either overlook facial emotions or are limited to specific individuals and cannot be applied to arbitrary subjects. In this paper, we propose a one-shot Talking Head Generation framework (SPEAK) that distinguishes itself from general Talking Face Generation by enabling emotional and postural control. Specifically, we introduce the Inter-Reconstructed Feature Disentanglement (IRFD) method to decouple human facial features into three latent spaces. We then design a face editing module that modifies speech content and facial latent codes into a single latent space. Subsequently, we present a novel generator that employs modified latent codes derived from the editing module to regulate emotional expression, head poses, and speech content in synthesizing facial animations. Extensive trials demonstrate that our method can generate realistic talking head with coordinated lip motions, authentic facial emotions, and smooth head movements. The demo video is available at the anonymous link: https://anonymous.4open.science/r/SPEAK-F56E

5/14/2024

cs.CV

🔮

AdaMesh: Personalized Facial Expressions and Head Poses for Adaptive Speech-Driven 3D Facial Animation

Liyang Chen, Weihong Bao, Shun Lei, Boshi Tang, Zhiyong Wu, Shiyin Kang, Haozhi Huang, Helen Meng

Speech-driven 3D facial animation aims at generating facial movements that are synchronized with the driving speech, which has been widely explored recently. Existing works mostly neglect the person-specific talking style in generation, including facial expression and head pose styles. Several works intend to capture the personalities by fine-tuning modules. However, limited training data leads to the lack of vividness. In this work, we propose AdaMesh, a novel adaptive speech-driven facial animation approach, which learns the personalized talking style from a reference video of about 10 seconds and generates vivid facial expressions and head poses. Specifically, we propose mixture-of-low-rank adaptation (MoLoRA) to fine-tune the expression adapter, which efficiently captures the facial expression style. For the personalized pose style, we propose a pose adapter by building a discrete pose prior and retrieving the appropriate style embedding with a semantic-aware pose style matrix without fine-tuning. Extensive experimental results show that our approach outperforms state-of-the-art methods, preserves the talking style in the reference video, and generates vivid facial animation. The supplementary video and code will be available at https://adamesh.github.io.

6/21/2024

cs.CV cs.MM

🔄

On-the-fly Learning to Transfer Motion Style with Diffusion Models: A Semantic Guidance Approach

Lei Hu, Zihao Zhang, Yongjing Ye, Yiwen Xu, Shihong Xia

In recent years, the emergence of generative models has spurred development of human motion generation, among which the generation of stylized human motion has consistently been a focal point of research. The conventional approach for stylized human motion generation involves transferring the style from given style examples to new motions. Despite decades of research in human motion style transfer, it still faces three main challenges: 1) difficulties in decoupling the motion content and style; 2) generalization to unseen motion style. 3) requirements of dedicated motion style dataset; To address these issues, we propose an on-the-fly human motion style transfer learning method based on the diffusion model, which can learn a style transfer model in a few minutes of fine-tuning to transfer an unseen style to diverse content motions. The key idea of our method is to consider the denoising process of the diffusion model as a motion translation process that learns the difference between the style-neutral motion pair, thereby avoiding the challenge of style and content decoupling. Specifically, given an unseen style example, we first generate the corresponding neutral motion through the proposed Style-Neutral Motion Pair Generation module. We then add noise to the generated neutral motion and denoise it to be close to the style example to fine-tune the style transfer diffusion model. We only need one style example and a text-to-motion dataset with predominantly neutral motion (e.g. HumanML3D). The qualitative and quantitative evaluations demonstrate that our method can achieve state-of-the-art performance and has practical applications.

5/14/2024

cs.GR cs.CV

Talk3D: High-Fidelity Talking Portrait Synthesis via Personalized 3D Generative Prior

Jaehoon Ko, Kyusun Cho, Joungbin Lee, Heeji Yoon, Sangmin Lee, Sangjun Ahn, Seungryong Kim

Recent methods for audio-driven talking head synthesis often optimize neural radiance fields (NeRF) on a monocular talking portrait video, leveraging its capability to render high-fidelity and 3D-consistent novel-view frames. However, they often struggle to reconstruct complete face geometry due to the absence of comprehensive 3D information in the input monocular videos. In this paper, we introduce a novel audio-driven talking head synthesis framework, called Talk3D, that can faithfully reconstruct its plausible facial geometries by effectively adopting the pre-trained 3D-aware generative prior. Given the personalized 3D generative model, we present a novel audio-guided attention U-Net architecture that predicts the dynamic face variations in the NeRF space driven by audio. Furthermore, our model is further modulated by audio-unrelated conditioning tokens which effectively disentangle variations unrelated to audio features. Compared to existing methods, our method excels in generating realistic facial geometries even under extreme head poses. We also conduct extensive experiments showing our approach surpasses state-of-the-art benchmarks in terms of both quantitative and qualitative evaluations.

4/1/2024

cs.CV