EmoVOCA: Speech-Driven Emotional 3D Talking Heads

Read original: arXiv:2403.12886 - Published 9/12/2024 by Federico Nocentini, Claudio Ferrari, Stefano Berretti

EmoVOCA: Speech-Driven Emotional 3D Talking Heads

Overview

Introduces a system called EmoVOCA that can generate emotional 3D talking heads from speech input
Combines multiple 3D facial animation and emotion prediction models to create expressive digital characters
Aims to provide a more natural and engaging user experience for speech-driven 3D avatars

Plain English Explanation

The paper describes a system called EmoVOCA that can create animated 3D characters who speak and display emotions based on audio input. The key idea is to combine several AI models to generate facial movements, emotions, and other expressive elements that make the digital characters appear more lifelike and engaging.

At a high level, the system takes speech as input and uses it to drive the animation of a 3D character model. This includes not just the lip movements to match the words, but also changes in facial expression to convey the emotional tone of the speech. The researchers integrate multiple specialized models to handle different aspects of this task, such as predicting emotions from the audio and translating those into appropriate 3D facial features.

The goal is to create 3D avatars that can communicate in a more natural and compelling way, with expressive faces that reflect the meaning and sentiment behind the words being spoken. This could have applications in areas like virtual assistants, video games, and online communication, where realistic and emotive digital characters can enhance the user experience.

Technical Explanation

The EmoVOCA system combines several key components to generate emotional 3D talking heads from speech input:

Audio Feature Extraction: The system extracts relevant acoustic features from the input speech, such as pitch, energy, and spectral characteristics.
Emotion Prediction: A deep learning model is used to predict the emotional state (e.g. happy, sad, angry) from the audio features.
3D Face Animation: A separate model generates the 3D facial movements to match the speech, including lip sync and other expressive elements.
Emotion-Aware 3D Animation: The predicted emotion is then used to modulate the 3D facial animation, adding appropriate expressions to the talking head.

By integrating these components, the EmoVOCA system can produce 3D talking heads that not only move their mouths to match the speech, but also dynamically display the corresponding emotions on their face. This results in a more natural and engaging animation that conveys both the semantic and affective content of the audio.

The researchers evaluate EmoVOCA on several benchmark datasets and compare it to prior speech-driven 3D animation approaches. Their results show that the combined emotion and animation modeling leads to significant improvements in perceptual quality and user engagement.

Critical Analysis

The EmoVOCA paper makes a compelling case for the value of integrating emotion prediction and expressive 3D animation to create more engaging digital characters. By going beyond simple lip sync to also convey the emotional tone of the speech, the system produces talking heads that feel more lifelike and natural.

However, the paper does acknowledge some limitations of the current approach. For example, the emotion prediction model is trained on a relatively small dataset, which may limit its ability to handle a diverse range of emotional expressions. Additionally, the 3D face animation is still somewhat generic and could benefit from more personalization or customization to individual characters.

Further research could explore ways to address these limitations, such as using larger and more diverse emotion datasets, or incorporating additional modalities (e.g. facial landmarks, head pose) to better capture the nuances of emotional expression. Integrating the EmoVOCA system with other animation techniques, such as body gestures or eye gaze, could also help create even more convincing and holistic digital characters.

Conclusion

The EmoVOCA system represents an important step towards more natural and engaging speech-driven 3D animation. By combining emotion prediction and expressive 3D facial modeling, it can generate talking heads that convey both the semantic and affective content of the audio input. This technology has the potential to significantly improve user experiences in various applications, from virtual assistants to video games and beyond.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

EmoVOCA: Speech-Driven Emotional 3D Talking Heads

Federico Nocentini, Claudio Ferrari, Stefano Berretti

The domain of 3D talking head generation has witnessed significant progress in recent years. A notable challenge in this field consists in blending speech-related motions with expression dynamics, which is primarily caused by the lack of comprehensive 3D datasets that combine diversity in spoken sentences with a variety of facial expressions. Whereas literature works attempted to exploit 2D video data and parametric 3D models as a workaround, these still show limitations when jointly modeling the two motions. In this work, we address this problem from a different perspective, and propose an innovative data-driven technique that we used for creating a synthetic dataset, called EmoVOCA, obtained by combining a collection of inexpressive 3D talking heads and a set of 3D expressive sequences. To demonstrate the advantages of this approach, and the quality of the dataset, we then designed and trained an emotional 3D talking head generator that accepts a 3D face, an audio file, an emotion label, and an intensity value as inputs, and learns to animate the audio-synchronized lip movements with expressive traits of the face. Comprehensive experiments, both quantitative and qualitative, using our data and generator evidence superior ability in synthesizing convincing animations, when compared with the best performing methods in the literature. Our code and pre-trained model will be made available.

9/12/2024

EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head

Qianyun He, Xinya Ji, Yicheng Gong, Yuanxun Lu, Zhengyu Diao, Linjia Huang, Yao Yao, Siyu Zhu, Zhan Ma, Songcen Xu, Xiaofei Wu, Zixiao Zhang, Xun Cao, Hao Zhu

We present a novel approach for synthesizing 3D talking heads with controllable emotion, featuring enhanced lip synchronization and rendering quality. Despite significant progress in the field, prior methods still suffer from multi-view consistency and a lack of emotional expressiveness. To address these issues, we collect EmoTalk3D dataset with calibrated multi-view videos, emotional annotations, and per-frame 3D geometry. By training on the EmoTalk3D dataset, we propose a textit{`Speech-to-Geometry-to-Appearance'} mapping framework that first predicts faithful 3D geometry sequence from the audio features, then the appearance of a 3D talking head represented by 4D Gaussians is synthesized from the predicted geometry. The appearance is further disentangled into canonical and dynamic Gaussians, learned from multi-view videos, and fused to render free-view talking head animation. Moreover, our model enables controllable emotion in the generated talking heads and can be rendered in wide-range views. Our method exhibits improved rendering quality and stability in lip motion generation while capturing dynamic facial details such as wrinkles and subtle expressions. Experiments demonstrate the effectiveness of our approach in generating high-fidelity and emotion-controllable 3D talking heads. The code and EmoTalk3D dataset are released at https://nju-3dv.github.io/projects/EmoTalk3D.

8/2/2024

EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face with Mesh Attention

Yihong Lin, Liang Peng, Jianqiao Hu, Xiandong Li, Wenxiong Kang, Songju Lei, Xianjia Wu, Huang Xu

The creation of increasingly vivid 3D virtual digital humans has become a hot topic in recent years. Currently, most speech-driven work focuses on training models to learn the relationship between phonemes and visemes to achieve more realistic lips. However, they fail to capture the correlations between emotions and facial expressions effectively. To solve this problem, we propose a new model, termed EmoFace. EmoFace employs a novel Mesh Attention mechanism, which helps to learn potential feature dependencies between mesh vertices in time and space. We also adopt, for the first time to our knowledge, an effective self-growing training scheme that combines teacher-forcing and scheduled sampling in a 3D face animation task. Additionally, since EmoFace is an autoregressive model, there is no requirement that the first frame of the training data must be a silent frame, which greatly reduces the data limitations and contributes to solve the current dilemma of insufficient datasets. Comprehensive quantitative and qualitative evaluations on our proposed high-quality reconstructed 3D emotional facial animation dataset, 3D-RAVDESS ($5.0343times 10^{-5}$mm for LVE and $1.0196times 10^{-5}$mm for EVE), and publicly available dataset VOCASET ($2.8669times 10^{-5}$mm for LVE and $0.4664times 10^{-5}$mm for EVE), demonstrate that our algorithm achieves state-of-the-art performance.

8/22/2024

EmoFace: Audio-driven Emotional 3D Face Animation

Chang Liu, Qunfen Lin, Zijiao Zeng, Ye Pan

Audio-driven emotional 3D face animation aims to generate emotionally expressive talking heads with synchronized lip movements. However, previous research has often overlooked the influence of diverse emotions on facial expressions or proved unsuitable for driving MetaHuman models. In response to this deficiency, we introduce EmoFace, a novel audio-driven methodology for creating facial animations with vivid emotional dynamics. Our approach can generate facial expressions with multiple emotions, and has the ability to generate random yet natural blinks and eye movements, while maintaining accurate lip synchronization. We propose independent speech encoders and emotion encoders to learn the relationship between audio, emotion and corresponding facial controller rigs, and finally map into the sequence of controller values. Additionally, we introduce two post-processing techniques dedicated to enhancing the authenticity of the animation, particularly in blinks and eye movements. Furthermore, recognizing the scarcity of emotional audio-visual data suitable for MetaHuman model manipulation, we contribute an emotional audio-visual dataset and derive control parameters for each frames. Our proposed methodology can be applied in producing dialogues animations of non-playable characters (NPCs) in video games, and driving avatars in virtual reality environments. Our further quantitative and qualitative experiments, as well as an user study comparing with existing researches show that our approach demonstrates superior results in driving 3D facial models. The code and sample data are available at https://github.com/SJTU-Lucy/EmoFace.

7/18/2024