Learn2Talk: 3D Talking Face Learns from 2D Talking Face

2404.12888

Published 4/22/2024 by Yixiang Zhuang, Baoping Cheng, Yao Cheng, Yuntao Jin, Renshuai Liu, Chengyang Li, Xuan Cheng, Jing Liao, Juncong Lin

cs.CV cs.GR cs.LG

Learn2Talk: 3D Talking Face Learns from 2D Talking Face

Abstract

Speech-driven facial animation methods usually contain two main classes, 3D and 2D talking face, both of which attract considerable research attention in recent years. However, to the best of our knowledge, the research on 3D talking face does not go deeper as 2D talking face, in the aspect of lip-synchronization (lip-sync) and speech perception. To mind the gap between the two sub-fields, we propose a learning framework named Learn2Talk, which can construct a better 3D talking face network by exploiting two expertise points from the field of 2D talking face. Firstly, inspired by the audio-video sync network, a 3D sync-lip expert model is devised for the pursuit of lip-sync between audio and 3D facial motion. Secondly, a teacher model selected from 2D talking face methods is used to guide the training of the audio-to-3D motions regression network to yield more 3D vertex accuracy. Extensive experiments show the advantages of the proposed framework in terms of lip-sync, vertex accuracy and speech perception, compared with state-of-the-arts. Finally, we show two applications of the proposed framework: audio-visual speech recognition and speech-driven 3D Gaussian Splatting based avatar animation.

Get summaries of the top AI research delivered straight to your inbox:

Overview

• This paper proposes a novel approach called "Learn2Talk" that enables the synthesis of high-fidelity 3D talking faces from 2D talking faces. • The model leverages a Transformer-based architecture and 3D Gaussian splatting to generate realistic 3D facial animations driven by speech. • The authors demonstrate the effectiveness of their method on various datasets and compare it to state-of-the-art approaches.

Plain English Explanation

The paper presents a new way to create realistic 3D animations of a person's face moving and speaking, using only 2D video footage of the person as the starting point. The key idea is to train a machine learning model, called "Learn2Talk," to learn the connection between the 2D facial movements in the video and the corresponding 3D facial movements.

Once the model is trained, it can then take a new 2D video of someone speaking and generate a high-quality 3D animation of their face moving and speaking in sync with the audio. This is a significant advancement, as previous methods required complex setups with specialized 3D scanning equipment to create these types of 3D talking face animations.

By using only 2D video as input, the "Learn2Talk" approach is more accessible and can be applied to a wider range of scenarios, such as creating digital avatars, enhancing video conferencing, or generating synthetic media. The use of a Transformer-based architecture and 3D Gaussian splatting allows the model to capture the nuanced details of facial movements and produce lifelike 3D talking face animations.

Technical Explanation

The core of the "Learn2Talk" approach is a Transformer-based neural network that learns to map 2D facial landmarks extracted from video footage to the corresponding 3D facial geometry. The model takes as input a sequence of 2D facial landmarks and predicts the 3D facial mesh that should be animated.

To achieve high-fidelity 3D facial animations, the authors employ a 3D Gaussian splatting technique, which allows the model to generate smooth and realistic 3D facial movements. The Gaussian splatting process involves representing each vertex in the 3D facial mesh as a 3D Gaussian distribution, which can then be efficiently rendered to produce the final 3D talking face animation.

The authors train and evaluate their "Learn2Talk" model on several datasets, including the Talk3D dataset, the AudioIs-All-One dataset, and the EDTalk dataset. Their results demonstrate the superior performance of "Learn2Talk" compared to state-of-the-art approaches, such as VASA and Neural Sign Actors, in terms of both visual quality and alignment with the input audio.

Critical Analysis

The paper presents a well-designed and carefully evaluated approach to generating high-fidelity 3D talking face animations from 2D video inputs. The authors acknowledge that their method relies on the availability of 2D facial landmark data, which may not always be readily available or accurately extracted, especially in challenging real-world scenarios.

Furthermore, the paper does not explore the potential limitations or biases that may arise from the training datasets used, which could affect the model's performance on diverse populations or under different environmental conditions. Investigating the robustness and generalizability of the "Learn2Talk" approach to a wider range of settings and use cases would be a valuable direction for future research.

Additionally, the authors do not discuss the computational efficiency or real-time capabilities of their method, which are important considerations for practical applications such as video conferencing or interactive digital avatars.

Conclusion

The "Learn2Talk" approach represents a significant advancement in the field of speech-driven 3D facial animation. By leveraging 2D video inputs and a Transformer-based architecture with 3D Gaussian splatting, the model can generate highly realistic 3D talking face animations that align closely with the input audio.

This technology has the potential to enable a wide range of applications, from enhancing video communication to creating compelling digital avatars and synthetic media. As the field of speech-driven animation continues to evolve, the insights and techniques presented in this paper can serve as a valuable foundation for further research and development.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🛸

SwapTalk: Audio-Driven Talking Face Generation with One-Shot Customization in Latent Space

Zeren Zhang, Haibo Qin, Jiayu Huang, Yixin Li, Hui Lin, Yitao Duan, Jinwen Ma

Combining face swapping with lip synchronization technology offers a cost-effective solution for customized talking face generation. However, directly cascading existing models together tends to introduce significant interference between tasks and reduce video clarity because the interaction space is limited to the low-level semantic RGB space. To address this issue, we propose an innovative unified framework, SwapTalk, which accomplishes both face swapping and lip synchronization tasks in the same latent space. Referring to recent work on face generation, we choose the VQ-embedding space due to its excellent editability and fidelity performance. To enhance the framework's generalization capabilities for unseen identities, we incorporate identity loss during the training of the face swapping module. Additionally, we introduce expert discriminator supervision within the latent space during the training of the lip synchronization module to elevate synchronization quality. In the evaluation phase, previous studies primarily focused on the self-reconstruction of lip movements in synchronous audio-visual videos. To better approximate real-world applications, we expand the evaluation scope to asynchronous audio-video scenarios. Furthermore, we introduce a novel identity consistency metric to more comprehensively assess the identity consistency over time series in generated facial videos. Experimental results on the HDTF demonstrate that our method significantly surpasses existing techniques in video quality, lip synchronization accuracy, face swapping fidelity, and identity consistency. Our demo is available at http://swaptalk.cc.

5/10/2024

cs.CV cs.AI

Talk3D: High-Fidelity Talking Portrait Synthesis via Personalized 3D Generative Prior

Jaehoon Ko, Kyusun Cho, Joungbin Lee, Heeji Yoon, Sangmin Lee, Sangjun Ahn, Seungryong Kim

Recent methods for audio-driven talking head synthesis often optimize neural radiance fields (NeRF) on a monocular talking portrait video, leveraging its capability to render high-fidelity and 3D-consistent novel-view frames. However, they often struggle to reconstruct complete face geometry due to the absence of comprehensive 3D information in the input monocular videos. In this paper, we introduce a novel audio-driven talking head synthesis framework, called Talk3D, that can faithfully reconstruct its plausible facial geometries by effectively adopting the pre-trained 3D-aware generative prior. Given the personalized 3D generative model, we present a novel audio-guided attention U-Net architecture that predicts the dynamic face variations in the NeRF space driven by audio. Furthermore, our model is further modulated by audio-unrelated conditioning tokens which effectively disentangle variations unrelated to audio features. Compared to existing methods, our method excels in generating realistic facial geometries even under extreme head poses. We also conduct extensive experiments showing our approach surpasses state-of-the-art benchmarks in terms of both quantitative and qualitative evaluations.

4/1/2024

cs.CV

👨‍🏫

CSTalk: Correlation Supervised Speech-driven 3D Emotional Facial Animation Generation

Xiangyu Liang, Wenlin Zhuang, Tianyong Wang, Guangxing Geng, Guangyue Geng, Haifeng Xia, Siyu Xia

Speech-driven 3D facial animation technology has been developed for years, but its practical application still lacks expectations. The main challenges lie in data limitations, lip alignment, and the naturalness of facial expressions. Although lip alignment has seen many related studies, existing methods struggle to synthesize natural and realistic expressions, resulting in a mechanical and stiff appearance of facial animations. Even with some research extracting emotional features from speech, the randomness of facial movements limits the effective expression of emotions. To address this issue, this paper proposes a method called CSTalk (Correlation Supervised) that models the correlations among different regions of facial movements and supervises the training of the generative model to generate realistic expressions that conform to human facial motion patterns. To generate more intricate animations, we employ a rich set of control parameters based on the metahuman character model and capture a dataset for five different emotions. We train a generative network using an autoencoder structure and input an emotion embedding vector to achieve the generation of user-control expressions. Experimental results demonstrate that our method outperforms existing state-of-the-art methods.

4/30/2024

cs.CV cs.AI

GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian Splatting

Hongyun Yu, Zhan Qu, Qihang Yu, Jianchuan Chen, Zhonghua Jiang, Zhiwen Chen, Shengyu Zhang, Jimin Xu, Fei Wu, Chengfei Lv, Gang Yu

Recent works on audio-driven talking head synthesis using Neural Radiance Fields (NeRF) have achieved impressive results. However, due to inadequate pose and expression control caused by NeRF implicit representation, these methods still have some limitations, such as unsynchronized or unnatural lip movements, and visual jitter and artifacts. In this paper, we propose GaussianTalker, a novel method for audio-driven talking head synthesis based on 3D Gaussian Splatting. With the explicit representation property of 3D Gaussians, intuitive control of the facial motion is achieved by binding Gaussians to 3D facial models. GaussianTalker consists of two modules, Speaker-specific Motion Translator and Dynamic Gaussian Renderer. Speaker-specific Motion Translator achieves accurate lip movements specific to the target speaker through universalized audio feature extraction and customized lip motion generation. Dynamic Gaussian Renderer introduces Speaker-specific BlendShapes to enhance facial detail representation via a latent pose, delivering stable and realistic rendered videos. Extensive experimental results suggest that GaussianTalker outperforms existing state-of-the-art methods in talking head synthesis, delivering precise lip synchronization and exceptional visual quality. Our method achieves rendering speeds of 130 FPS on NVIDIA RTX4090 GPU, significantly exceeding the threshold for real-time rendering performance, and can potentially be deployed on other hardware platforms.

4/30/2024

cs.CV cs.MM