EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head

Read original: arXiv:2408.00297 - Published 8/2/2024 by Qianyun He, Xinya Ji, Yicheng Gong, Yuanxun Lu, Zhengyu Diao, Linjia Huang, Yao Yao, Siyu Zhu, Zhan Ma, Songcen Xu and 4 others

EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head

Overview

EmoTalk3D is a system for high-fidelity, free-view synthesis of emotional 3D talking heads.
It can generate realistic 3D facial animations driven by audio input, with accurate emotional expressions.
The system uses a 3D Gaussian splatting approach to produce smooth and natural-looking animations.

Plain English Explanation

EmoTalk3D is a technology that can create 3D animated faces that look and move very realistically. It uses audio recordings as input to drive the animation, and it can also accurately express different emotional expressions like happiness, sadness, or anger.

The key innovation is the use of a 3D Gaussian splatting technique. This means the system renders the facial features, like the mouth, eyes, and eyebrows, as smooth, blended shapes rather than just a collection of discrete points. This results in animations that look much more natural and lifelike compared to older approaches.

The system is able to generate these high-quality 3D talking heads from audio input alone, without needing to capture complex 3D facial motion data. This makes it more practical and scalable than some previous methods that required specialized equipment or extensive data collection.

Overall, EmoTalk3D represents an important advance in the field of emotional 3D facial animation, with applications in areas like virtual assistants, animated characters, and human-computer interaction.

Technical Explanation

The EmoTalk3D system takes a novel approach to generating realistic 3D facial animations driven by audio input. Rather than relying on complex 3D motion capture or blendshape models, it uses a 3D Gaussian splatting technique to produce smooth, natural-looking expressions.

The architecture consists of several key components. First, a speech recognition module transcribes the input audio and extracts relevant acoustic features. These features are then processed by an emotion recognition module to estimate the emotional state of the speaker.

The system then uses a 3D face model and a set of predefined facial expressions to generate the target facial animation. The 3D Gaussian splatting approach smooths out the individual facial landmarks, creating a more cohesive and natural-looking result.

Experiments show that EmoTalk3D is able to generate high-fidelity 3D talking head animations with accurate emotional expressions. The system outperforms previous state-of-the-art methods in terms of both objective metrics and subjective user evaluations.

Critical Analysis

The EmoTalk3D paper presents a compelling approach to generating realistic 3D facial animations from audio input. The use of 3D Gaussian splatting is a clever technique that addresses some of the limitations of earlier blendshape-based methods.

However, the paper does acknowledge some potential limitations. For example, the system is currently limited to a fixed set of predefined emotional expressions, and may struggle with more nuanced or intermediate emotional states. There is also the question of how well the system would generalize to more diverse speaker characteristics or speaking styles.

Additionally, while the paper demonstrates impressive results, it would be helpful to see more analysis of the system's robustness and failure modes. Evaluating its performance in noisy or challenging real-world environments would also be valuable.

Overall, EmoTalk3D represents an important step forward in the field of 3D facial animation. The innovative use of 3D Gaussian splatting is a promising direction, and future work could explore ways to further enhance the system's flexibility and generalization capabilities.

Conclusion

EmoTalk3D is a state-of-the-art system for generating high-fidelity, free-view 3D facial animations from audio input. By using a 3D Gaussian splatting technique, the system is able to create smooth, natural-looking expressions that accurately convey emotional states.

This technology has exciting potential applications in areas like virtual assistants, animated characters, and human-computer interaction. As the field of 3D facial animation continues to evolve, systems like EmoTalk3D will play an increasingly important role in bridging the gap between digital avatars and lifelike, expressive virtual characters.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head

Qianyun He, Xinya Ji, Yicheng Gong, Yuanxun Lu, Zhengyu Diao, Linjia Huang, Yao Yao, Siyu Zhu, Zhan Ma, Songcen Xu, Xiaofei Wu, Zixiao Zhang, Xun Cao, Hao Zhu

We present a novel approach for synthesizing 3D talking heads with controllable emotion, featuring enhanced lip synchronization and rendering quality. Despite significant progress in the field, prior methods still suffer from multi-view consistency and a lack of emotional expressiveness. To address these issues, we collect EmoTalk3D dataset with calibrated multi-view videos, emotional annotations, and per-frame 3D geometry. By training on the EmoTalk3D dataset, we propose a textit{`Speech-to-Geometry-to-Appearance'} mapping framework that first predicts faithful 3D geometry sequence from the audio features, then the appearance of a 3D talking head represented by 4D Gaussians is synthesized from the predicted geometry. The appearance is further disentangled into canonical and dynamic Gaussians, learned from multi-view videos, and fused to render free-view talking head animation. Moreover, our model enables controllable emotion in the generated talking heads and can be rendered in wide-range views. Our method exhibits improved rendering quality and stability in lip motion generation while capturing dynamic facial details such as wrinkles and subtle expressions. Experiments demonstrate the effectiveness of our approach in generating high-fidelity and emotion-controllable 3D talking heads. The code and EmoTalk3D dataset are released at https://nju-3dv.github.io/projects/EmoTalk3D.

8/2/2024

EmoVOCA: Speech-Driven Emotional 3D Talking Heads

Federico Nocentini, Claudio Ferrari, Stefano Berretti

The domain of 3D talking head generation has witnessed significant progress in recent years. A notable challenge in this field consists in blending speech-related motions with expression dynamics, which is primarily caused by the lack of comprehensive 3D datasets that combine diversity in spoken sentences with a variety of facial expressions. Whereas literature works attempted to exploit 2D video data and parametric 3D models as a workaround, these still show limitations when jointly modeling the two motions. In this work, we address this problem from a different perspective, and propose an innovative data-driven technique that we used for creating a synthetic dataset, called EmoVOCA, obtained by combining a collection of inexpressive 3D talking heads and a set of 3D expressive sequences. To demonstrate the advantages of this approach, and the quality of the dataset, we then designed and trained an emotional 3D talking head generator that accepts a 3D face, an audio file, an emotion label, and an intensity value as inputs, and learns to animate the audio-synchronized lip movements with expressive traits of the face. Comprehensive experiments, both quantitative and qualitative, using our data and generator evidence superior ability in synthesizing convincing animations, when compared with the best performing methods in the literature. Our code and pre-trained model will be made available.

9/12/2024

EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face with Mesh Attention

Yihong Lin, Liang Peng, Jianqiao Hu, Xiandong Li, Wenxiong Kang, Songju Lei, Xianjia Wu, Huang Xu

The creation of increasingly vivid 3D virtual digital humans has become a hot topic in recent years. Currently, most speech-driven work focuses on training models to learn the relationship between phonemes and visemes to achieve more realistic lips. However, they fail to capture the correlations between emotions and facial expressions effectively. To solve this problem, we propose a new model, termed EmoFace. EmoFace employs a novel Mesh Attention mechanism, which helps to learn potential feature dependencies between mesh vertices in time and space. We also adopt, for the first time to our knowledge, an effective self-growing training scheme that combines teacher-forcing and scheduled sampling in a 3D face animation task. Additionally, since EmoFace is an autoregressive model, there is no requirement that the first frame of the training data must be a silent frame, which greatly reduces the data limitations and contributes to solve the current dilemma of insufficient datasets. Comprehensive quantitative and qualitative evaluations on our proposed high-quality reconstructed 3D emotional facial animation dataset, 3D-RAVDESS ($5.0343times 10^{-5}$mm for LVE and $1.0196times 10^{-5}$mm for EVE), and publicly available dataset VOCASET ($2.8669times 10^{-5}$mm for LVE and $0.4664times 10^{-5}$mm for EVE), demonstrate that our algorithm achieves state-of-the-art performance.

8/22/2024

EmoFace: Audio-driven Emotional 3D Face Animation

Chang Liu, Qunfen Lin, Zijiao Zeng, Ye Pan

Audio-driven emotional 3D face animation aims to generate emotionally expressive talking heads with synchronized lip movements. However, previous research has often overlooked the influence of diverse emotions on facial expressions or proved unsuitable for driving MetaHuman models. In response to this deficiency, we introduce EmoFace, a novel audio-driven methodology for creating facial animations with vivid emotional dynamics. Our approach can generate facial expressions with multiple emotions, and has the ability to generate random yet natural blinks and eye movements, while maintaining accurate lip synchronization. We propose independent speech encoders and emotion encoders to learn the relationship between audio, emotion and corresponding facial controller rigs, and finally map into the sequence of controller values. Additionally, we introduce two post-processing techniques dedicated to enhancing the authenticity of the animation, particularly in blinks and eye movements. Furthermore, recognizing the scarcity of emotional audio-visual data suitable for MetaHuman model manipulation, we contribute an emotional audio-visual dataset and derive control parameters for each frames. Our proposed methodology can be applied in producing dialogues animations of non-playable characters (NPCs) in video games, and driving avatars in virtual reality environments. Our further quantitative and qualitative experiments, as well as an user study comparing with existing researches show that our approach demonstrates superior results in driving 3D facial models. The code and sample data are available at https://github.com/SJTU-Lucy/EmoFace.

7/18/2024