Listen, Disentangle, and Control: Controllable Speech-Driven Talking Head Generation

2405.07257

Published 5/14/2024 by Changpeng Cai, Guinan Guo, Jiao Li, Junhao Su, Chenghao He, Jing Xiao, Yuanxu Chen, Lei Dai, Feiyu Zhu

cs.CV

🛸

Abstract

Most earlier investigations on talking face generation have focused on the synchronization of lip motion and speech content. However, human head pose and facial emotions are equally important characteristics of natural human faces. While audio-driven talking face generation has seen notable advancements, existing methods either overlook facial emotions or are limited to specific individuals and cannot be applied to arbitrary subjects. In this paper, we propose a one-shot Talking Head Generation framework (SPEAK) that distinguishes itself from general Talking Face Generation by enabling emotional and postural control. Specifically, we introduce the Inter-Reconstructed Feature Disentanglement (IRFD) method to decouple human facial features into three latent spaces. We then design a face editing module that modifies speech content and facial latent codes into a single latent space. Subsequently, we present a novel generator that employs modified latent codes derived from the editing module to regulate emotional expression, head poses, and speech content in synthesizing facial animations. Extensive trials demonstrate that our method can generate realistic talking head with coordinated lip motions, authentic facial emotions, and smooth head movements. The demo video is available at the anonymous link: https://anonymous.4open.science/r/SPEAK-F56E

Create account to get full access

Overview

Earlier research on talking face generation has focused on synchronizing lip movement with speech, but overlooked other important facial cues like head pose and emotions.
Existing audio-driven talking face generation methods either don't capture facial emotions or are limited to specific individuals.
This paper proposes a framework called SPEAK that can generate realistic talking head animations with coordinated lip motions, authentic facial emotions, and smooth head movements.

Plain English Explanation

The research paper discusses a new approach to generating realistic-looking animated faces that can speak and convey emotions. Previous work in this area has mainly focused on syncing lip movements with the audio, but hasn't done a good job of capturing other important facial characteristics like head position and emotional expressions.

The SPEAK framework proposed in this paper aims to address these limitations. It can generate talking head animations that not only have proper lip sync, but also show natural-looking head movements and convincing emotional expressions. This is achieved by decomposing the facial features into separate latent spaces, which allows the system to independently control the speech content, head pose, and emotional state of the generated face.

The core innovation is an "Inter-Reconstructed Feature Disentanglement" method that separates the facial features into these three distinct latent spaces. A face editing module then combines these latent codes to produce a single latent representation that can be used by the generator to create the final talking head animation with coordinated lip motions, head movements, and emotional expressions.

Technical Explanation

The key technical components of the SPEAK framework are:

Inter-Reconstructed Feature Disentanglement (IRFD): This method decomposes the facial features into three latent spaces - one for speech content, one for head pose, and one for emotional expression. This disentanglement allows for independent control over these different aspects of the facial animation.
Face Editing Module: This module takes the three latent codes from the IRFD step and combines them into a single latent representation that can be fed into the generator.
Novel Generator: The generator uses the modified latent codes from the editing module to synthesize the final talking head animation, regulating the lip motions, head movements, and emotional expressions.

Extensive experiments demonstrate that the SPEAK framework can generate highly realistic talking head videos with synchronized lip movements, natural head poses, and authentic emotional expressions. This goes beyond the capabilities of prior audio-driven talking face generation methods, which either overlooked facial emotions or were limited to specific individuals.

Critical Analysis

The main strength of the SPEAK framework is its ability to independently control the different aspects of a talking face animation - speech content, head pose, and emotional expression. This disentanglement allows for a high degree of flexibility and control compared to previous approaches.

However, the paper does not address some potential limitations or areas for further research:

The framework is still limited to generating talking head animations from audio input. Extending it to work with arbitrary video sources or to generate full-body animations could be valuable.
The paper does not provide a detailed analysis of the computational efficiency or real-time capabilities of the SPEAK framework. This would be an important consideration for practical applications.
The evaluation is primarily focused on qualitative assessments of the generated animations. Incorporating more quantitative metrics to assess the fidelity and realism of the results could strengthen the analysis.

Overall, the SPEAK framework represents a significant advancement in the state of the art for audio-driven talking face generation, but there are still opportunities for further research and development to address these potential limitations.

Conclusion

This research paper introduces the SPEAK framework, a novel approach to generating realistic talking head animations that can independently control the speech content, head pose, and emotional expressions of the synthesized face. By decomposing the facial features into separate latent spaces and using a dedicated face editing module, SPEAK is able to produce talking head videos with coordinated lip motions, natural head movements, and convincing emotional expressions.

The key innovation of the SPEAK framework is the Inter-Reconstructed Feature Disentanglement (IRFD) method, which allows for this independent control over the different aspects of the facial animation. This represents a significant advancement over previous audio-driven talking face generation techniques, which either overlooked facial emotions or were limited to specific individuals.

While the paper demonstrates the effectiveness of the SPEAK framework through extensive qualitative evaluations, there are still opportunities for further research to address potential limitations, such as extending the approach to work with arbitrary video sources or incorporating more quantitative metrics. Nevertheless, this work represents an important step forward in the field of talking face generation, paving the way for more realistic and expressive animated characters in a variety of applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis

Shuai Tan, Bin Ji, Mengxiao Bi, Ye Pan

Achieving disentangled control over multiple facial motions and accommodating diverse input modalities greatly enhances the application and entertainment of the talking head generation. This necessitates a deep exploration of the decoupling space for facial features, ensuring that they a) operate independently without mutual interference and b) can be preserved to share with different modal input, both aspects often neglected in existing methods. To address this gap, this paper proposes a novel Efficient Disentanglement framework for Talking head generation (EDTalk). Our framework enables individual manipulation of mouth shape, head pose, and emotional expression, conditioned on video or audio inputs. Specifically, we employ three lightweight modules to decompose the facial dynamics into three distinct latent spaces representing mouth, pose, and expression, respectively. Each space is characterized by a set of learnable bases whose linear combinations define specific motions. To ensure independence and accelerate training, we enforce orthogonality among bases and devise an efficient training strategy to allocate motion responsibilities to each space without relying on external knowledge. The learned bases are then stored in corresponding banks, enabling shared visual priors with audio input. Furthermore, considering the properties of each space, we propose an Audio-to-Motion module for audio-driven talking head synthesis. Experiments are conducted to demonstrate the effectiveness of EDTalk. We recommend watching the project website: https://tanshuai0219.github.io/EDTalk/

4/3/2024

cs.CV

Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

Jiadong Liang, Feng Lu

Vivid talking face generation holds immense potential applications across diverse multimedia domains, such as film and game production. While existing methods accurately synchronize lip movements with input audio, they typically ignore crucial alignments between emotion and facial cues, which include expression, gaze, and head pose. These alignments are indispensable for synthesizing realistic videos. To address these issues, we propose a two-stage audio-driven talking face generation framework that employs 3D facial landmarks as intermediate variables. This framework achieves collaborative alignment of expression, gaze, and pose with emotions through self-supervised learning. Specifically, we decompose this task into two key steps, namely speech-to-landmarks synthesis and landmarks-to-face generation. The first step focuses on simultaneously synthesizing emotionally aligned facial cues, including normalized landmarks that represent expressions, gaze, and head pose. These cues are subsequently reassembled into relocated facial landmarks. In the second step, these relocated landmarks are mapped to latent key points using self-supervised learning and then input into a pretrained model to create high-quality face images. Extensive experiments on the MEAD dataset demonstrate that our model significantly advances the state-of-the-art performance in both visual quality and emotional alignment.

6/13/2024

cs.CV

Controllable Talking Face Generation by Implicit Facial Keypoints Editing

Dong Zhao, Jiaying Shi, Wenjun Li, Shudong Wang, Shenghui Xu, Zhaoming Pan

Audio-driven talking face generation has garnered significant interest within the domain of digital human research. Existing methods are encumbered by intricate model architectures that are intricately dependent on each other, complicating the process of re-editing image or video inputs. In this work, we present ControlTalk, a talking face generation method to control face expression deformation based on driven audio, which can construct the head pose and facial expression including lip motion for both single image or sequential video inputs in a unified manner. By utilizing a pre-trained video synthesis renderer and proposing the lightweight adaptation, ControlTalk achieves precise and naturalistic lip synchronization while enabling quantitative control over mouth opening shape. Our experiments show that our method is superior to state-of-the-art performance on widely used benchmarks, including HDTF and MEAD. The parameterized adaptation demonstrates remarkable generalization capabilities, effectively handling expression deformation across same-ID and cross-ID scenarios, and extending its utility to out-of-domain portraits, regardless of languages.

6/6/2024

cs.CV cs.AI

👨‍🏫

CSTalk: Correlation Supervised Speech-driven 3D Emotional Facial Animation Generation

Xiangyu Liang, Wenlin Zhuang, Tianyong Wang, Guangxing Geng, Guangyue Geng, Haifeng Xia, Siyu Xia

Speech-driven 3D facial animation technology has been developed for years, but its practical application still lacks expectations. The main challenges lie in data limitations, lip alignment, and the naturalness of facial expressions. Although lip alignment has seen many related studies, existing methods struggle to synthesize natural and realistic expressions, resulting in a mechanical and stiff appearance of facial animations. Even with some research extracting emotional features from speech, the randomness of facial movements limits the effective expression of emotions. To address this issue, this paper proposes a method called CSTalk (Correlation Supervised) that models the correlations among different regions of facial movements and supervises the training of the generative model to generate realistic expressions that conform to human facial motion patterns. To generate more intricate animations, we employ a rich set of control parameters based on the metahuman character model and capture a dataset for five different emotions. We train a generative network using an autoencoder structure and input an emotion embedding vector to achieve the generation of user-control expressions. Experimental results demonstrate that our method outperforms existing state-of-the-art methods.

4/30/2024

cs.CV cs.AI