Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

2406.07895

Published 6/13/2024 by Jiadong Liang, Feng Lu

Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

Abstract

Vivid talking face generation holds immense potential applications across diverse multimedia domains, such as film and game production. While existing methods accurately synchronize lip movements with input audio, they typically ignore crucial alignments between emotion and facial cues, which include expression, gaze, and head pose. These alignments are indispensable for synthesizing realistic videos. To address these issues, we propose a two-stage audio-driven talking face generation framework that employs 3D facial landmarks as intermediate variables. This framework achieves collaborative alignment of expression, gaze, and pose with emotions through self-supervised learning. Specifically, we decompose this task into two key steps, namely speech-to-landmarks synthesis and landmarks-to-face generation. The first step focuses on simultaneously synthesizing emotionally aligned facial cues, including normalized landmarks that represent expressions, gaze, and head pose. These cues are subsequently reassembled into relocated facial landmarks. In the second step, these relocated landmarks are mapped to latent key points using self-supervised learning and then input into a pretrained model to create high-quality face images. Extensive experiments on the MEAD dataset demonstrate that our model significantly advances the state-of-the-art performance in both visual quality and emotional alignment.

Create account to get full access

Overview

This paper presents a method for generating cohesive facial expressions, gaze, and head pose in talking face animations driven by audio input.
The approach leverages cross-modal and temporal information to produce more natural and coherent talking face videos.
The authors introduce a novel transformer-based architecture and training strategy to achieve this goal.

Plain English Explanation

The paper describes a new way to create animated talking faces that look and behave more naturally. When people talk, their facial expressions, eye movements, and head positioning all work together in a coordinated way to convey the full meaning of what they're saying. However, many existing methods for generating talking face animations struggle to capture this cohesive, natural behavior.

The researchers developed a novel artificial intelligence (AI) system that can take an audio recording of someone speaking and generate a corresponding video of a talking face that exhibits coordinated expressions, gaze, and pose. By modeling the cross-relationships between these different facial and head movements, and how they unfold over time, the system is able to produce more life-like and emotionally expressive talking face animations.

This could have important applications in areas like virtual assistants, computer-generated characters, and video conferencing, where realistic and engaging facial animations are crucial for natural interactions.

Technical Explanation

The core of the system is a transformer-based neural network architecture that takes an audio signal as input and predicts the corresponding facial expressions, gaze direction, and head pose over time. This is done in a unified and coherent way, rather than modeling each output component independently.

The network leverages cross-modal and temporal information to capture the intricate relationships between the different aspects of facial animation. For example, it learns that certain emotional expressions tend to co-occur with specific head movements and eye gazes. By modeling these interdependencies, the system can generate more natural and synchronized talking face videos.

The authors also introduce a novel training strategy that involves jointly optimizing multiple loss functions to ensure the generated facial animations are cohesive and realistic. This includes losses that encourage consistency between the expression, gaze, and pose predictions, as well as losses that enforce temporal smoothness and realism.

Through extensive experiments, the researchers demonstrate that their approach outperforms previous state-of-the-art methods for talking face generation, both in terms of objective metrics and subjective human evaluations. The generated talking face videos exhibit more natural and emotionally expressive behaviors compared to baselines.

Critical Analysis

One potential limitation of the proposed approach is that it relies on having a large, high-quality dataset of talking face videos in order to train the AI system effectively. The quality of the generated animations is ultimately bounded by the diversity and realism of the training data.

Additionally, while the system is able to produce more natural-looking talking faces, there may still be room for improvement in terms of capturing the nuances of human facial behavior, such as micro-expressions and subtle shifts in gaze. Further research into modeling the fine-grained dynamics of facial movements could help address this.

It would also be interesting to see how the system generalizes to different speakers, languages, and cultural contexts. The authors mention that their method is agnostic to the speaker identity, but evaluating its performance across a broader range of scenarios could uncover additional challenges or opportunities for improvement.

Conclusion

This paper presents a novel approach for generating cohesive and expressive talking face animations from audio input. By leveraging cross-modal and temporal relationships, the proposed transformer-based system is able to produce more natural and emotionally engaging facial behaviors compared to previous methods.

This work represents an important step forward in the field of audio-driven facial animation, with potential applications in areas like virtual assistants, computer-generated characters, and video conferencing. The insights and techniques developed in this research could also inform future work on modeling the intricate dynamics of human facial expressions and their connections to other forms of nonverbal communication.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🛸

Listen, Disentangle, and Control: Controllable Speech-Driven Talking Head Generation

Changpeng Cai, Guinan Guo, Jiao Li, Junhao Su, Chenghao He, Jing Xiao, Yuanxu Chen, Lei Dai, Feiyu Zhu

Most earlier investigations on talking face generation have focused on the synchronization of lip motion and speech content. However, human head pose and facial emotions are equally important characteristics of natural human faces. While audio-driven talking face generation has seen notable advancements, existing methods either overlook facial emotions or are limited to specific individuals and cannot be applied to arbitrary subjects. In this paper, we propose a one-shot Talking Head Generation framework (SPEAK) that distinguishes itself from general Talking Face Generation by enabling emotional and postural control. Specifically, we introduce the Inter-Reconstructed Feature Disentanglement (IRFD) method to decouple human facial features into three latent spaces. We then design a face editing module that modifies speech content and facial latent codes into a single latent space. Subsequently, we present a novel generator that employs modified latent codes derived from the editing module to regulate emotional expression, head poses, and speech content in synthesizing facial animations. Extensive trials demonstrate that our method can generate realistic talking head with coordinated lip motions, authentic facial emotions, and smooth head movements. The demo video is available at the anonymous link: https://anonymous.4open.science/r/SPEAK-F56E

5/14/2024

cs.CV

👨‍🏫

CSTalk: Correlation Supervised Speech-driven 3D Emotional Facial Animation Generation

Xiangyu Liang, Wenlin Zhuang, Tianyong Wang, Guangxing Geng, Guangyue Geng, Haifeng Xia, Siyu Xia

Speech-driven 3D facial animation technology has been developed for years, but its practical application still lacks expectations. The main challenges lie in data limitations, lip alignment, and the naturalness of facial expressions. Although lip alignment has seen many related studies, existing methods struggle to synthesize natural and realistic expressions, resulting in a mechanical and stiff appearance of facial animations. Even with some research extracting emotional features from speech, the randomness of facial movements limits the effective expression of emotions. To address this issue, this paper proposes a method called CSTalk (Correlation Supervised) that models the correlations among different regions of facial movements and supervises the training of the generative model to generate realistic expressions that conform to human facial motion patterns. To generate more intricate animations, we employ a rich set of control parameters based on the metahuman character model and capture a dataset for five different emotions. We train a generative network using an autoencoder structure and input an emotion embedding vector to achieve the generation of user-control expressions. Experimental results demonstrate that our method outperforms existing state-of-the-art methods.

4/30/2024

cs.CV cs.AI

Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs

Uttaran Bhattacharya, Aniket Bera, Dinesh Manocha

We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters using RGB video data captured using commodity cameras. Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions. Given a speech audio waveform and a token sequence of the speaker's face landmark motion and body-joint motion computed from a video, our method synthesizes the motion sequences for the speaker's face landmarks and body joints to match the content and the affect of the speech. We design a generator consisting of a set of encoders to transform all the inputs into a multimodal embedding space capturing their correlations, followed by a pair of decoders to synthesize the desired face and pose motions. To enhance the plausibility of synthesis, we use an adversarial discriminator that learns to differentiate between the face and pose motions computed from the original videos and our synthesized motions based on their affective expressions. To evaluate our approach, we extend the TED Gesture Dataset to include view-normalized, co-speech face landmarks in addition to body gestures. We demonstrate the performance of our method through thorough quantitative and qualitative experiments on multiple evaluation metrics and via a user study. We observe that our method results in low reconstruction error and produces synthesized samples with diverse facial expressions and body gestures for digital characters.

6/27/2024

cs.CV

Controllable Talking Face Generation by Implicit Facial Keypoints Editing

Dong Zhao, Jiaying Shi, Wenjun Li, Shudong Wang, Shenghui Xu, Zhaoming Pan

Audio-driven talking face generation has garnered significant interest within the domain of digital human research. Existing methods are encumbered by intricate model architectures that are intricately dependent on each other, complicating the process of re-editing image or video inputs. In this work, we present ControlTalk, a talking face generation method to control face expression deformation based on driven audio, which can construct the head pose and facial expression including lip motion for both single image or sequential video inputs in a unified manner. By utilizing a pre-trained video synthesis renderer and proposing the lightweight adaptation, ControlTalk achieves precise and naturalistic lip synchronization while enabling quantitative control over mouth opening shape. Our experiments show that our method is superior to state-of-the-art performance on widely used benchmarks, including HDTF and MEAD. The parameterized adaptation demonstrates remarkable generalization capabilities, effectively handling expression deformation across same-ID and cross-ID scenarios, and extending its utility to out-of-domain portraits, regardless of languages.

6/6/2024

cs.CV cs.AI