GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

Read original: arXiv:2408.01826 - Published 8/19/2024 by Yihong Lin, Zhaoxin Fan, Lingyu Xiong, Liang Peng, Xiandong Li, Wenxiong Kang, Xianjia Wu, Songju Lei, Huang Xu

GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

Overview

This paper presents GLDiTalker, a system for generating 3D facial animations from speech input.
The key innovation is the use of a Graph Latent Diffusion Transformer to model the complex dynamics of facial movements.
The system can produce high-quality 3D facial animations that are synchronized with the input speech.

Plain English Explanation

The researchers have developed a system called GLDiTalker that can create 3D animations of a person's face based solely on their speech. This could be useful for applications like video conferencing, animation, or virtual assistants.

The core of the system is a machine learning model that has been trained on examples of facial movements and the corresponding speech. When given new speech input, the model can predict how the face should move and animate accordingly.

The key innovation is the use of a "Graph Latent Diffusion Transformer". This allows the model to effectively capture the complex, dynamic relationships between different parts of the face as they move in response to speech. Previous approaches may have struggled to model these intricate facial movements.

The end result is a system that can generate realistic 3D facial animations that are tightly synchronized with the input audio. This could have applications in creating more natural-looking virtual avatars, dubbing foreign language films, or assisting people with speech disabilities.

Technical Explanation

The core of GLDiTalker is a Graph Latent Diffusion Transformer (GLDT) model that learns to map speech features to 3D facial movements. The GLDT models the face as a graph structure, with nodes representing different facial landmarks and edges capturing the relationships between them.

The GLDT uses a diffusion process to progressively refine the predicted facial movements, starting from an initial noisy prediction and iteratively denoising it. This allows the model to capture the complex, dynamic nature of facial expressions in response to speech.

The researchers train and evaluate the GLDiTalker system on a large dataset of 3D facial scans synchronized with corresponding speech audio. They demonstrate that GLDiTalker can generate high-quality 3D facial animations that are well-aligned with the input speech, outperforming previous state-of-the-art speech-driven facial animation approaches.

Critical Analysis

The paper provides a thorough technical description of the GLDiTalker system and presents compelling results. However, it does not discuss some potential limitations or areas for future work.

For example, the system was trained and evaluated on a limited set of speakers, so it's unclear how well it would generalize to a wider range of voices and speaking styles. Additionally, the 3D facial animations, while realistic, may still have room for improvement in terms of capturing more subtle nuances of facial expressions.

It would also be interesting to see how GLDiTalker compares to other recent advances in speech-driven facial animation, such as methods that leverage generative adversarial networks or self-supervised learning. A more extensive comparative evaluation could further highlight the strengths and weaknesses of the proposed approach.

Conclusion

The GLDiTalker system represents an important advancement in speech-driven 3D facial animation. By leveraging a Graph Latent Diffusion Transformer, the researchers have developed a model that can generate highly realistic and synchronized facial animations from speech input alone.

This technology could have significant implications for a wide range of applications, from virtual avatars and digital dubbing to assistive technologies for individuals with speech-related disabilities. As the field of speech-driven facial animation continues to evolve, GLDiTalker's innovative approach provides a valuable contribution and a promising direction for future research.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

Yihong Lin, Zhaoxin Fan, Lingyu Xiong, Liang Peng, Xiandong Li, Wenxiong Kang, Xianjia Wu, Songju Lei, Huang Xu

Speech-driven talking head generation is an important but challenging task for many downstream applications such as augmented reality. Existing methods have achieved remarkable performance by utilizing autoregressive models or diffusion models. However, most still suffer from modality inconsistencies, specifically the misalignment between audio and mesh modalities, which causes inconsistencies in motion diversity and lip-sync accuracy. To address this issue, this paper introduces GLDiTalker, a novel speech-driven 3D facial animation model that employs a Graph Latent Diffusion Transformer. The core idea behind GLDiTalker is that the audio-mesh modality misalignment can be resolved by diffusing the signal in a latent quantilized spatial-temporal space. To achieve this, GLDiTalker builds upon a quantilized space-time diffusion training pipeline, which consists of a Graph Enhanced Quantilized Space Learning Stage and a Space-Time Powered Latent Diffusion Stage. The first stage ensures lip-sync accuracy, while the second stage enhances motion diversity. Together, these stages enable GLDiTalker to generate temporally and spatially stable, realistic models. Extensive evaluations on several widely used benchmarks demonstrate that our method achieves superior performance compared to existing methods.

8/19/2024

🛸

DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models

Zhiyao Sun, Tian Lv, Sheng Ye, Matthieu Lin, Jenny Sheng, Yu-Hui Wen, Minjing Yu, Yong-Jin Liu

The generation of stylistic 3D facial animations driven by speech presents a significant challenge as it requires learning a many-to-many mapping between speech, style, and the corresponding natural facial motion. However, existing methods either employ a deterministic model for speech-to-motion mapping or encode the style using a one-hot encoding scheme. Notably, the one-hot encoding approach fails to capture the complexity of the style and thus limits generalization ability. In this paper, we propose DiffPoseTalk, a generative framework based on the diffusion model combined with a style encoder that extracts style embeddings from short reference videos. During inference, we employ classifier-free guidance to guide the generation process based on the speech and style. In particular, our style includes the generation of head poses, thereby enhancing user perception. Additionally, we address the shortage of scanned 3D talking face data by training our model on reconstructed 3DMM parameters from a high-quality, in-the-wild audio-visual dataset. Extensive experiments and user study demonstrate that our approach outperforms state-of-the-art methods. The code and dataset are at https://diffposetalk.github.io .

5/15/2024

🛸

SwapTalk: Audio-Driven Talking Face Generation with One-Shot Customization in Latent Space

Zeren Zhang, Haibo Qin, Jiayu Huang, Yixin Li, Hui Lin, Yitao Duan, Jinwen Ma

Combining face swapping with lip synchronization technology offers a cost-effective solution for customized talking face generation. However, directly cascading existing models together tends to introduce significant interference between tasks and reduce video clarity because the interaction space is limited to the low-level semantic RGB space. To address this issue, we propose an innovative unified framework, SwapTalk, which accomplishes both face swapping and lip synchronization tasks in the same latent space. Referring to recent work on face generation, we choose the VQ-embedding space due to its excellent editability and fidelity performance. To enhance the framework's generalization capabilities for unseen identities, we incorporate identity loss during the training of the face swapping module. Additionally, we introduce expert discriminator supervision within the latent space during the training of the lip synchronization module to elevate synchronization quality. In the evaluation phase, previous studies primarily focused on the self-reconstruction of lip movements in synchronous audio-visual videos. To better approximate real-world applications, we expand the evaluation scope to asynchronous audio-video scenarios. Furthermore, we introduce a novel identity consistency metric to more comprehensively assess the identity consistency over time series in generated facial videos. Experimental results on the HDTF demonstrate that our method significantly surpasses existing techniques in video quality, lip synchronization accuracy, face swapping fidelity, and identity consistency. Our demo is available at http://swaptalk.cc.

5/10/2024

GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian Splatting

Hongyun Yu, Zhan Qu, Qihang Yu, Jianchuan Chen, Zhonghua Jiang, Zhiwen Chen, Shengyu Zhang, Jimin Xu, Fei Wu, Chengfei Lv, Gang Yu

Recent works on audio-driven talking head synthesis using Neural Radiance Fields (NeRF) have achieved impressive results. However, due to inadequate pose and expression control caused by NeRF implicit representation, these methods still have some limitations, such as unsynchronized or unnatural lip movements, and visual jitter and artifacts. In this paper, we propose GaussianTalker, a novel method for audio-driven talking head synthesis based on 3D Gaussian Splatting. With the explicit representation property of 3D Gaussians, intuitive control of the facial motion is achieved by binding Gaussians to 3D facial models. GaussianTalker consists of two modules, Speaker-specific Motion Translator and Dynamic Gaussian Renderer. Speaker-specific Motion Translator achieves accurate lip movements specific to the target speaker through universalized audio feature extraction and customized lip motion generation. Dynamic Gaussian Renderer introduces Speaker-specific BlendShapes to enhance facial detail representation via a latent pose, delivering stable and realistic rendered videos. Extensive experimental results suggest that GaussianTalker outperforms existing state-of-the-art methods in talking head synthesis, delivering precise lip synchronization and exceptional visual quality. Our method achieves rendering speeds of 130 FPS on NVIDIA RTX4090 GPU, significantly exceeding the threshold for real-time rendering performance, and can potentially be deployed on other hardware platforms.

8/12/2024