GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian Splatting

2404.14037

Published 4/30/2024 by Hongyun Yu, Zhan Qu, Qihang Yu, Jianchuan Chen, Zhonghua Jiang, Zhiwen Chen, Shengyu Zhang, Jimin Xu, Fei Wu, Chengfei Lv and 1 other

cs.CV cs.MM

GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian Splatting

Abstract

Recent works on audio-driven talking head synthesis using Neural Radiance Fields (NeRF) have achieved impressive results. However, due to inadequate pose and expression control caused by NeRF implicit representation, these methods still have some limitations, such as unsynchronized or unnatural lip movements, and visual jitter and artifacts. In this paper, we propose GaussianTalker, a novel method for audio-driven talking head synthesis based on 3D Gaussian Splatting. With the explicit representation property of 3D Gaussians, intuitive control of the facial motion is achieved by binding Gaussians to 3D facial models. GaussianTalker consists of two modules, Speaker-specific Motion Translator and Dynamic Gaussian Renderer. Speaker-specific Motion Translator achieves accurate lip movements specific to the target speaker through universalized audio feature extraction and customized lip motion generation. Dynamic Gaussian Renderer introduces Speaker-specific BlendShapes to enhance facial detail representation via a latent pose, delivering stable and realistic rendered videos. Extensive experimental results suggest that GaussianTalker outperforms existing state-of-the-art methods in talking head synthesis, delivering precise lip synchronization and exceptional visual quality. Our method achieves rendering speeds of 130 FPS on NVIDIA RTX4090 GPU, significantly exceeding the threshold for real-time rendering performance, and can potentially be deployed on other hardware platforms.

Get summaries of the top AI research delivered straight to your inbox:

Overview

This paper introduces GaussianTalker, a method for synthesizing speaker-specific 3D talking head animations from audio input.
It uses a 3D Gaussian splatting approach to generate high-fidelity talking head videos that accurately capture the speaker's identity and facial movements.
The technique leverages the Talk3D and Learn2Talk models for facial landmark extraction and Gavatar for 3D Gaussian avatar generation.
The 3D Geometry-Aware Deformable Gaussian Splatting and GPS models are also leveraged for dynamic face animation.

Plain English Explanation

GaussianTalker is a system that can create realistic 3D animations of a person talking, starting from just an audio recording of their voice. It works by first analyzing the audio to understand the movements of the person's mouth and face. Then, it uses that information to generate a 3D model of the person's head that moves realistically, matching the audio.

The key innovation is the use of "Gaussian splatting" to create the 3D animations. This means that instead of trying to model every individual feature of the face, GaussianTalker represents the face as a series of 3D "blobs" or Gaussian distributions. This allows it to capture the overall shape and movement of the face in a more efficient and natural way.

GaussianTalker builds on several existing models and techniques for facial analysis and 3D animation, including Talk3D, Learn2Talk, Gavatar, 3D Geometry-Aware Deformable Gaussian Splatting, and GPS. By combining these techniques, GaussianTalker is able to generate high-quality, speaker-specific talking head animations from audio alone.

Technical Explanation

The core of the GaussianTalker approach is the use of 3D Gaussian splatting to represent the speaker's face and model its deformation over time. This allows the system to capture the overall shape and movement of the face in an efficient and natural way, without needing to explicitly model every individual facial feature.

The system first extracts 2D facial landmarks from the input audio using the Talk3D and Learn2Talk models. These landmarks are then used to drive the deformation of a 3D Gaussian avatar, generated using the Gavatar model.

The 3D Geometry-Aware Deformable Gaussian Splatting and GPS models are then leveraged to animate the 3D Gaussian avatar in a way that accurately captures the speaker's facial movements and expressions. This results in high-fidelity, speaker-specific talking head videos.

Critical Analysis

The GaussianTalker approach offers several advantages over previous work in talking head synthesis, such as its ability to capture speaker-specific nuances and generate high-quality 3D animations from audio input alone. However, the paper also acknowledges some limitations and areas for future research.

One potential limitation is the reliance on the external models (e.g., Talk3D, Learn2Talk, Gavatar) for various components of the pipeline. While this allows GaussianTalker to leverage state-of-the-art techniques, it also means the overall system performance is dependent on the performance of these individual models.

Additionally, the paper does not provide a detailed analysis of the computational complexity or runtime performance of the GaussianTalker system, which could be an important practical consideration for real-world applications.

Future research could explore ways to further improve the realism and fidelity of the generated talking head animations, such as by incorporating more advanced facial modeling techniques or leveraging additional modalities (e.g., video) to constrain the animation process.

Conclusion

The GaussianTalker system presents a novel approach to speaker-specific talking head synthesis that leverages 3D Gaussian splatting to generate high-quality animations from audio input alone. By drawing on a range of state-of-the-art models for facial landmark extraction, 3D avatar generation, and dynamic face animation, GaussianTalker is able to capture the unique characteristics of individual speakers in a realistic and efficient manner.

While the paper acknowledges some limitations and areas for further research, the GaussianTalker technique represents an important advancement in the field of talking head synthesis, with potential applications in virtual avatars, dubbing, and other multimedia applications where realistic, speaker-specific animations are desired.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

GaussianTalker: Real-Time High-Fidelity Talking Head Synthesis with Audio-Driven 3D Gaussian Splatting

Kyusun Cho, Joungbin Lee, Heeji Yoon, Yeobin Hong, Jaehoon Ko, Sangjun Ahn, Seungryong Kim

We propose GaussianTalker, a novel framework for real-time generation of pose-controllable talking heads. It leverages the fast rendering capabilities of 3D Gaussian Splatting (3DGS) while addressing the challenges of directly controlling 3DGS with speech audio. GaussianTalker constructs a canonical 3DGS representation of the head and deforms it in sync with the audio. A key insight is to encode the 3D Gaussian attributes into a shared implicit feature representation, where it is merged with audio features to manipulate each Gaussian attribute. This design exploits the spatial-aware features and enforces interactions between neighboring points. The feature embeddings are then fed to a spatial-audio attention module, which predicts frame-wise offsets for the attributes of each Gaussian. It is more stable than previous concatenation or multiplication approaches for manipulating the numerous Gaussians and their intricate parameters. Experimental results showcase GaussianTalker's superiority in facial fidelity, lip synchronization accuracy, and rendering speed compared to previous methods. Specifically, GaussianTalker achieves a remarkable rendering speed up to 120 FPS, surpassing previous benchmarks. Our code is made available at https://github.com/KU-CVLAB/GaussianTalker/ .

4/26/2024

cs.CV cs.MM

TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting

Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Xin Ning, Jun Zhou, Lin Gu

Radiance fields have demonstrated impressive performance in synthesizing lifelike 3D talking heads. However, due to the difficulty in fitting steep appearance changes, the prevailing paradigm that presents facial motions by directly modifying point appearance may lead to distortions in dynamic regions. To tackle this challenge, we introduce TalkingGaussian, a deformation-based radiance fields framework for high-fidelity talking head synthesis. Leveraging the point-based Gaussian Splatting, facial motions can be represented in our method by applying smooth and continuous deformations to persistent Gaussian primitives, without requiring to learn the difficult appearance change like previous methods. Due to this simplification, precise facial motions can be synthesized while keeping a highly intact facial feature. Under such a deformation paradigm, we further identify a face-mouth motion inconsistency that would affect the learning of detailed speaking motions. To address this conflict, we decompose the model into two branches separately for the face and inside mouth areas, therefore simplifying the learning tasks to help reconstruct more accurate motion and structure of the mouth region. Extensive experiments demonstrate that our method renders high-quality lip-synchronized talking head videos, with better facial fidelity and higher efficiency compared with previous methods.

4/24/2024

cs.CV

NeRFFaceSpeech: One-shot Audio-diven 3D Talking Head Synthesis via Generative Prior

Gihoon Kim, Kwanggyoon Seo, Sihun Cha, Junyong Noh

Audio-driven talking head generation is advancing from 2D to 3D content. Notably, Neural Radiance Field (NeRF) is in the spotlight as a means to synthesize high-quality 3D talking head outputs. Unfortunately, this NeRF-based approach typically requires a large number of paired audio-visual data for each identity, thereby limiting the scalability of the method. Although there have been attempts to generate audio-driven 3D talking head animations with a single image, the results are often unsatisfactory due to insufficient information on obscured regions in the image. In this paper, we mainly focus on addressing the overlooked aspect of 3D consistency in the one-shot, audio-driven domain, where facial animations are synthesized primarily in front-facing perspectives. We propose a novel method, NeRFFaceSpeech, which enables to produce high-quality 3D-aware talking head. Using prior knowledge of generative models combined with NeRF, our method can craft a 3D-consistent facial feature space corresponding to a single image. Our spatial synchronization method employs audio-correlated vertex dynamics of a parametric face model to transform static image features into dynamic visuals through ray deformation, ensuring realistic 3D facial motion. Moreover, we introduce LipaintNet that can replenish the lacking information in the inner-mouth area, which can not be obtained from a given single image. The network is trained in a self-supervised manner by utilizing the generative capabilities without additional data. The comprehensive experiments demonstrate the superiority of our method in generating audio-driven talking heads from a single image with enhanced 3D consistency compared to previous approaches. In addition, we introduce a quantitative way of measuring the robustness of a model against pose changes for the first time, which has been possible only qualitatively.

5/13/2024

cs.CV

GSTalker: Real-time Audio-Driven Talking Face Generation via Deformable Gaussian Splatting

Bo Chen, Shoukang Hu, Qi Chen, Chenpeng Du, Ran Yi, Yanmin Qian, Xie Chen

We present GStalker, a 3D audio-driven talking face generation model with Gaussian Splatting for both fast training (40 minutes) and real-time rendering (125 FPS) with a 3$sim$5 minute video for training material, in comparison with previous 2D and 3D NeRF-based modeling frameworks which require hours of training and seconds of rendering per frame. Specifically, GSTalker learns an audio-driven Gaussian deformation field to translate and transform 3D Gaussians to synchronize with audio information, in which multi-resolution hashing grid-based tri-plane and temporal smooth module are incorporated to learn accurate deformation for fine-grained facial details. In addition, a pose-conditioned deformation field is designed to model the stabilized torso. To enable efficient optimization of the condition Gaussian deformation field, we initialize 3D Gaussians by learning a coarse static Gaussian representation. Extensive experiments in person-specific videos with audio tracks validate that GSTalker can generate high-fidelity and audio-lips synchronized results with fast training and real-time rendering speed.

5/1/2024

cs.CV