EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face with Mesh Attention

Read original: arXiv:2408.11518 - Published 8/22/2024 by Yihong Lin, Liang Peng, Jianqiao Hu, Xiandong Li, Wenxiong Kang, Songju Lei, Xianjia Wu, Huang Xu

EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face with Mesh Attention

Overview

The paper presents a method called "EmoFace" for generating 3D talking faces that can express emotions in sync with speech.
The key innovation is the use of "emotion-content disentanglement" to separate the emotional expression from the speech content.
The system uses a mesh attention mechanism to focus on relevant regions of the face during animation.

Plain English Explanation

The EmoFace system aims to create lifelike 3D animated faces that can convey emotions while speaking. Most existing speech-driven facial animation approaches struggle to separate the emotional expression from the actual speech content. EmoFace solves this by disentangling the emotion and speech information, allowing for more natural and nuanced facial animations.

At the core of EmoFace is a "mesh attention" mechanism that focuses the animation on the most relevant regions of the face. This allows the system to emphasize the parts of the face that are critical for expressing different emotions, rather than just rigidly following the speech input.

The result is 3D talking faces that can convey a wide range of emotional states, from joyful to angry, in a way that is well-synchronized with the speech being generated. This could have applications in areas like virtual assistants, conversational AI, and digital avatars.

Technical Explanation

The EmoFace system takes audio input and generates a corresponding 3D talking face animation that expresses the appropriate emotions. The key innovations are:

Emotion-Content Disentanglement: EmoFace uses a disentanglement module to separate the emotional information from the speech content in the audio input. This allows the system to generate facial animations that convey emotion independently from the actual words being spoken.
Mesh Attention: Rather than applying the same animation uniformly across the face, EmoFace uses a mesh attention mechanism to focus the animation on the most relevant regions of the face for expressing different emotions. This selective animation helps create more naturalistic and expressive facial movements.
End-to-End Training: The entire EmoFace pipeline, from audio input to 3D face animation, is trained end-to-end. This allows the system to learn the optimal relationships between the speech, emotion, and facial movement data.

The authors evaluate EmoFace on several datasets and find that it outperforms previous state-of-the-art speech-driven facial animation approaches in terms of both objective metrics and human perceptual assessments.

Critical Analysis

The EmoFace paper makes a compelling contribution to the field of speech-driven facial animation. The key strengths are the disentanglement of emotion and content, the selective mesh attention mechanism, and the end-to-end training approach.

However, the authors acknowledge that EmoFace has some limitations. For example, the system is currently limited to generating animations from a single, pre-specified 3D face model. Extending the approach to handle multiple face models or even personalized avatars could be an interesting area for future research.

Additionally, while EmoFace demonstrates impressive results on standard benchmarks, its performance on more diverse or challenging real-world scenarios is not fully explored. Evaluating the system's robustness to factors like background noise, accents, or spontaneous speech could provide valuable insights.

Overall, the EmoFace work represents an important step forward in the quest for more expressive and natural-looking speech-driven facial animations. The techniques introduced in this paper could have far-reaching implications for fields like virtual assistants, social robots, and digital entertainment.

Conclusion

The EmoFace system presents a novel approach to generating 3D talking faces that can convey emotions in sync with speech. By disentangling the emotional and content information in the audio input and using a selective mesh attention mechanism, EmoFace is able to create more naturalistic and expressive facial animations.

The technical innovations and strong empirical results demonstrated in this paper suggest that EmoFace could be a valuable tool for a wide range of applications, from virtual assistants to digital avatars. As the field of speech-driven facial animation continues to advance, the concepts introduced in this work will likely serve as an important foundation for future research and development.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face with Mesh Attention

Yihong Lin, Liang Peng, Jianqiao Hu, Xiandong Li, Wenxiong Kang, Songju Lei, Xianjia Wu, Huang Xu

The creation of increasingly vivid 3D virtual digital humans has become a hot topic in recent years. Currently, most speech-driven work focuses on training models to learn the relationship between phonemes and visemes to achieve more realistic lips. However, they fail to capture the correlations between emotions and facial expressions effectively. To solve this problem, we propose a new model, termed EmoFace. EmoFace employs a novel Mesh Attention mechanism, which helps to learn potential feature dependencies between mesh vertices in time and space. We also adopt, for the first time to our knowledge, an effective self-growing training scheme that combines teacher-forcing and scheduled sampling in a 3D face animation task. Additionally, since EmoFace is an autoregressive model, there is no requirement that the first frame of the training data must be a silent frame, which greatly reduces the data limitations and contributes to solve the current dilemma of insufficient datasets. Comprehensive quantitative and qualitative evaluations on our proposed high-quality reconstructed 3D emotional facial animation dataset, 3D-RAVDESS ($5.0343times 10^{-5}$mm for LVE and $1.0196times 10^{-5}$mm for EVE), and publicly available dataset VOCASET ($2.8669times 10^{-5}$mm for LVE and $0.4664times 10^{-5}$mm for EVE), demonstrate that our algorithm achieves state-of-the-art performance.

8/22/2024

EmoVOCA: Speech-Driven Emotional 3D Talking Heads

Federico Nocentini, Claudio Ferrari, Stefano Berretti

The domain of 3D talking head generation has witnessed significant progress in recent years. A notable challenge in this field consists in blending speech-related motions with expression dynamics, which is primarily caused by the lack of comprehensive 3D datasets that combine diversity in spoken sentences with a variety of facial expressions. Whereas literature works attempted to exploit 2D video data and parametric 3D models as a workaround, these still show limitations when jointly modeling the two motions. In this work, we address this problem from a different perspective, and propose an innovative data-driven technique that we used for creating a synthetic dataset, called EmoVOCA, obtained by combining a collection of inexpressive 3D talking heads and a set of 3D expressive sequences. To demonstrate the advantages of this approach, and the quality of the dataset, we then designed and trained an emotional 3D talking head generator that accepts a 3D face, an audio file, an emotion label, and an intensity value as inputs, and learns to animate the audio-synchronized lip movements with expressive traits of the face. Comprehensive experiments, both quantitative and qualitative, using our data and generator evidence superior ability in synthesizing convincing animations, when compared with the best performing methods in the literature. Our code and pre-trained model will be made available.

9/12/2024

EmoFace: Audio-driven Emotional 3D Face Animation

Chang Liu, Qunfen Lin, Zijiao Zeng, Ye Pan

Audio-driven emotional 3D face animation aims to generate emotionally expressive talking heads with synchronized lip movements. However, previous research has often overlooked the influence of diverse emotions on facial expressions or proved unsuitable for driving MetaHuman models. In response to this deficiency, we introduce EmoFace, a novel audio-driven methodology for creating facial animations with vivid emotional dynamics. Our approach can generate facial expressions with multiple emotions, and has the ability to generate random yet natural blinks and eye movements, while maintaining accurate lip synchronization. We propose independent speech encoders and emotion encoders to learn the relationship between audio, emotion and corresponding facial controller rigs, and finally map into the sequence of controller values. Additionally, we introduce two post-processing techniques dedicated to enhancing the authenticity of the animation, particularly in blinks and eye movements. Furthermore, recognizing the scarcity of emotional audio-visual data suitable for MetaHuman model manipulation, we contribute an emotional audio-visual dataset and derive control parameters for each frames. Our proposed methodology can be applied in producing dialogues animations of non-playable characters (NPCs) in video games, and driving avatars in virtual reality environments. Our further quantitative and qualitative experiments, as well as an user study comparing with existing researches show that our approach demonstrates superior results in driving 3D facial models. The code and sample data are available at https://github.com/SJTU-Lucy/EmoFace.

7/18/2024

EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head

Qianyun He, Xinya Ji, Yicheng Gong, Yuanxun Lu, Zhengyu Diao, Linjia Huang, Yao Yao, Siyu Zhu, Zhan Ma, Songcen Xu, Xiaofei Wu, Zixiao Zhang, Xun Cao, Hao Zhu

We present a novel approach for synthesizing 3D talking heads with controllable emotion, featuring enhanced lip synchronization and rendering quality. Despite significant progress in the field, prior methods still suffer from multi-view consistency and a lack of emotional expressiveness. To address these issues, we collect EmoTalk3D dataset with calibrated multi-view videos, emotional annotations, and per-frame 3D geometry. By training on the EmoTalk3D dataset, we propose a textit{`Speech-to-Geometry-to-Appearance'} mapping framework that first predicts faithful 3D geometry sequence from the audio features, then the appearance of a 3D talking head represented by 4D Gaussians is synthesized from the predicted geometry. The appearance is further disentangled into canonical and dynamic Gaussians, learned from multi-view videos, and fused to render free-view talking head animation. Moreover, our model enables controllable emotion in the generated talking heads and can be rendered in wide-range views. Our method exhibits improved rendering quality and stability in lip motion generation while capturing dynamic facial details such as wrinkles and subtle expressions. Experiments demonstrate the effectiveness of our approach in generating high-fidelity and emotion-controllable 3D talking heads. The code and EmoTalk3D dataset are released at https://nju-3dv.github.io/projects/EmoTalk3D.

8/2/2024