AdaMesh: Personalized Facial Expressions and Head Poses for Adaptive Speech-Driven 3D Facial Animation

2310.07236

Published 6/21/2024 by Liyang Chen, Weihong Bao, Shun Lei, Boshi Tang, Zhiyong Wu, Shiyin Kang, Haozhi Huang, Helen Meng

cs.CV cs.MM

🔮

Abstract

Speech-driven 3D facial animation aims at generating facial movements that are synchronized with the driving speech, which has been widely explored recently. Existing works mostly neglect the person-specific talking style in generation, including facial expression and head pose styles. Several works intend to capture the personalities by fine-tuning modules. However, limited training data leads to the lack of vividness. In this work, we propose AdaMesh, a novel adaptive speech-driven facial animation approach, which learns the personalized talking style from a reference video of about 10 seconds and generates vivid facial expressions and head poses. Specifically, we propose mixture-of-low-rank adaptation (MoLoRA) to fine-tune the expression adapter, which efficiently captures the facial expression style. For the personalized pose style, we propose a pose adapter by building a discrete pose prior and retrieving the appropriate style embedding with a semantic-aware pose style matrix without fine-tuning. Extensive experimental results show that our approach outperforms state-of-the-art methods, preserves the talking style in the reference video, and generates vivid facial animation. The supplementary video and code will be available at https://adamesh.github.io.

Create account to get full access

Overview

The paper proposes a novel approach called AdaMesh for speech-driven 3D facial animation that can learn and preserve the personalized talking style from a reference video.
It introduces two key components: Mixture-of-Low-Rank Adaptation (MoLoRA) to capture the facial expression style, and a pose adapter that leverages a semantic-aware pose style matrix to preserve the head pose style.
The approach aims to generate vivid and personalized facial animations that are synchronized with the driving speech, overcoming limitations of existing methods that often neglect person-specific talking styles.

Plain English Explanation

Speech-driven 3D facial animation is the process of generating realistic facial movements that match the rhythm and tone of a spoken audio input. This is a key technology for creating lifelike virtual characters and avatars.

Existing approaches often struggle to capture the unique talking style of an individual, including their facial expressions and head movements. While some methods try to fine-tune the model to a person's style, the limited training data available can result in a lack of natural, vivid animations.

The AdaMesh method proposed in this paper aims to solve this problem. It can learn a person's specific talking style from just a 10-second reference video and then use that information to generate highly personalized facial animations that sync perfectly with the speech.

The key innovations are:

Mixture-of-Low-Rank Adaptation (MoLoRA): This adapts the facial expression generation to match the person's unique expression style.
Pose Adapter: This captures the person's head pose patterns using a semantic-aware style matrix, without needing to fine-tune the entire model.

By combining these techniques, AdaMesh can create vivid, lifelike 3D facial animations that preserve the talking style of the individual, going beyond what previous speech-driven animation approaches have been able to achieve.

Technical Explanation

The AdaMesh approach consists of two main components:

Mixture-of-Low-Rank Adaptation (MoLoRA) for Expression Style: The paper proposes this technique to efficiently fine-tune the facial expression generation to match the style observed in a reference video of the target speaker. MoLoRA leverages a mixture of low-rank matrices to capture the person-specific expression patterns, requiring only a small amount of adaptation data.
Pose Adapter for Head Pose Style: To preserve the individual's head pose style, the authors build a discrete pose prior and retrieve the appropriate style embedding using a semantic-aware pose style matrix. This allows the model to generate personalized head movements without the need for full fine-tuning.

The overall AdaMesh architecture takes in speech audio as input and generates 3D facial animations as output. It consists of a speech encoder, an expression adapter, a pose adapter, and a rendering module. The expression adapter uses MoLoRA to adapt the expression generation, while the pose adapter retrieves the appropriate pose style embedding to drive the head movements.

Extensive experiments show that AdaMesh outperforms state-of-the-art speech-driven facial animation methods in preserving the talking style from the reference video and generating vivid, natural-looking facial animations.

Critical Analysis

The AdaMesh approach presents a promising solution to the challenge of capturing person-specific talking styles in speech-driven 3D facial animation. By introducing the MoLoRA and pose adapter components, the method is able to effectively adapt to an individual's unique expression and head pose patterns using only a short reference video.

One potential limitation mentioned in the paper is that the approach still relies on having a reference video of the target speaker, which may not always be available. Additionally, the authors note that further research is needed to explore how AdaMesh would perform on more diverse datasets and speaking styles.

It would also be interesting to see how this approach could be extended to other aspects of speech-driven animation, such as lip synchronization and emotional expressiveness. Integrating AdaMesh with techniques for learning expressive talking face representations could further enhance the realism and personalization of the generated animations.

Overall, the AdaMesh method represents a significant advancement in the field of speech-driven 3D facial animation, and the researchers' focus on preserving individual talking styles is a valuable contribution that could have important implications for applications in virtual communication, entertainment, and beyond.

Conclusion

The AdaMesh paper presents a novel approach for speech-driven 3D facial animation that can learn and preserve the personalized talking style of an individual from a short reference video. By introducing Mixture-of-Low-Rank Adaptation (MoLoRA) for capturing expression style and a pose adapter for head pose style, the method is able to generate vivid and natural-looking facial animations that closely match the unique characteristics of the target speaker.

This work addresses an important limitation of existing speech-driven animation techniques, which often struggle to maintain the personality and expressiveness of the speaker. AdaMesh's ability to adapt to individual talking styles while still synchronizing the animations with the input speech is a significant step forward in this field.

As virtual communication, gaming, and other applications continue to rely on lifelike, personalized virtual characters, the AdaMesh approach could play a crucial role in making these experiences more natural and engaging for users. The researchers' focus on balancing personalization and speech synchronization is a valuable contribution that merits further exploration and refinement.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🛸

DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models

Zhiyao Sun, Tian Lv, Sheng Ye, Matthieu Lin, Jenny Sheng, Yu-Hui Wen, Minjing Yu, Yong-Jin Liu

The generation of stylistic 3D facial animations driven by speech presents a significant challenge as it requires learning a many-to-many mapping between speech, style, and the corresponding natural facial motion. However, existing methods either employ a deterministic model for speech-to-motion mapping or encode the style using a one-hot encoding scheme. Notably, the one-hot encoding approach fails to capture the complexity of the style and thus limits generalization ability. In this paper, we propose DiffPoseTalk, a generative framework based on the diffusion model combined with a style encoder that extracts style embeddings from short reference videos. During inference, we employ classifier-free guidance to guide the generation process based on the speech and style. In particular, our style includes the generation of head poses, thereby enhancing user perception. Additionally, we address the shortage of scanned 3D talking face data by training our model on reconstructed 3DMM parameters from a high-quality, in-the-wild audio-visual dataset. Extensive experiments and user study demonstrate that our approach outperforms state-of-the-art methods. The code and dataset are at https://diffposetalk.github.io .

5/15/2024

cs.CV cs.GR

Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

Jiadong Liang, Feng Lu

Vivid talking face generation holds immense potential applications across diverse multimedia domains, such as film and game production. While existing methods accurately synchronize lip movements with input audio, they typically ignore crucial alignments between emotion and facial cues, which include expression, gaze, and head pose. These alignments are indispensable for synthesizing realistic videos. To address these issues, we propose a two-stage audio-driven talking face generation framework that employs 3D facial landmarks as intermediate variables. This framework achieves collaborative alignment of expression, gaze, and pose with emotions through self-supervised learning. Specifically, we decompose this task into two key steps, namely speech-to-landmarks synthesis and landmarks-to-face generation. The first step focuses on simultaneously synthesizing emotionally aligned facial cues, including normalized landmarks that represent expressions, gaze, and head pose. These cues are subsequently reassembled into relocated facial landmarks. In the second step, these relocated landmarks are mapped to latent key points using self-supervised learning and then input into a pretrained model to create high-quality face images. Extensive experiments on the MEAD dataset demonstrate that our model significantly advances the state-of-the-art performance in both visual quality and emotional alignment.

6/13/2024

cs.CV

👨‍🏫

CSTalk: Correlation Supervised Speech-driven 3D Emotional Facial Animation Generation

Xiangyu Liang, Wenlin Zhuang, Tianyong Wang, Guangxing Geng, Guangyue Geng, Haifeng Xia, Siyu Xia

Speech-driven 3D facial animation technology has been developed for years, but its practical application still lacks expectations. The main challenges lie in data limitations, lip alignment, and the naturalness of facial expressions. Although lip alignment has seen many related studies, existing methods struggle to synthesize natural and realistic expressions, resulting in a mechanical and stiff appearance of facial animations. Even with some research extracting emotional features from speech, the randomness of facial movements limits the effective expression of emotions. To address this issue, this paper proposes a method called CSTalk (Correlation Supervised) that models the correlations among different regions of facial movements and supervises the training of the generative model to generate realistic expressions that conform to human facial motion patterns. To generate more intricate animations, we employ a rich set of control parameters based on the metahuman character model and capture a dataset for five different emotions. We train a generative network using an autoencoder structure and input an emotion embedding vector to achieve the generation of user-control expressions. Experimental results demonstrate that our method outperforms existing state-of-the-art methods.

4/30/2024

cs.CV cs.AI

🛸

Listen, Disentangle, and Control: Controllable Speech-Driven Talking Head Generation

Changpeng Cai, Guinan Guo, Jiao Li, Junhao Su, Chenghao He, Jing Xiao, Yuanxu Chen, Lei Dai, Feiyu Zhu

Most earlier investigations on talking face generation have focused on the synchronization of lip motion and speech content. However, human head pose and facial emotions are equally important characteristics of natural human faces. While audio-driven talking face generation has seen notable advancements, existing methods either overlook facial emotions or are limited to specific individuals and cannot be applied to arbitrary subjects. In this paper, we propose a one-shot Talking Head Generation framework (SPEAK) that distinguishes itself from general Talking Face Generation by enabling emotional and postural control. Specifically, we introduce the Inter-Reconstructed Feature Disentanglement (IRFD) method to decouple human facial features into three latent spaces. We then design a face editing module that modifies speech content and facial latent codes into a single latent space. Subsequently, we present a novel generator that employs modified latent codes derived from the editing module to regulate emotional expression, head poses, and speech content in synthesizing facial animations. Extensive trials demonstrate that our method can generate realistic talking head with coordinated lip motions, authentic facial emotions, and smooth head movements. The demo video is available at the anonymous link: https://anonymous.4open.science/r/SPEAK-F56E

5/14/2024

cs.CV