Talk3D: High-Fidelity Talking Portrait Synthesis via Personalized 3D Generative Prior

2403.20153

Published 4/1/2024 by Jaehoon Ko, Kyusun Cho, Joungbin Lee, Heeji Yoon, Sangmin Lee, Sangjun Ahn, Seungryong Kim

Talk3D: High-Fidelity Talking Portrait Synthesis via Personalized 3D Generative Prior

Abstract

Recent methods for audio-driven talking head synthesis often optimize neural radiance fields (NeRF) on a monocular talking portrait video, leveraging its capability to render high-fidelity and 3D-consistent novel-view frames. However, they often struggle to reconstruct complete face geometry due to the absence of comprehensive 3D information in the input monocular videos. In this paper, we introduce a novel audio-driven talking head synthesis framework, called Talk3D, that can faithfully reconstruct its plausible facial geometries by effectively adopting the pre-trained 3D-aware generative prior. Given the personalized 3D generative model, we present a novel audio-guided attention U-Net architecture that predicts the dynamic face variations in the NeRF space driven by audio. Furthermore, our model is further modulated by audio-unrelated conditioning tokens which effectively disentangle variations unrelated to audio features. Compared to existing methods, our method excels in generating realistic facial geometries even under extreme head poses. We also conduct extensive experiments showing our approach surpasses state-of-the-art benchmarks in terms of both quantitative and qualitative evaluations.

Get summaries of the top AI research delivered straight to your inbox:

Introduction

This paper presents a novel audio-driven talking head synthesis framework called Talk3D. The task of audio-driven talking portrait synthesis aims to generate a video of a human face with lip movements synchronized to the input audio. This poses several challenges, such as accurately capturing phonemes, generating realistic facial dynamics, and achieving high-fidelity facial image synthesis.

Prior approaches have used 2D generative models to reconstruct lip motion, but they often struggle with controlling the head pose. Recent studies have integrated neural radiance fields (NeRF) to leverage its capability in rendering realistic and multi-view consistent images. However, directly constructing dynamic facial NeRF from monocular videos remains challenging due to the lack of diverse head poses and 3D information.

To address these limitations, the paper proposes leveraging 3D-aware generative priors, which can generate high-fidelity 3D-aware images. The key strategies are: 1) using a 3D-aware GAN prior for realistic geometry, and 2) directly editing the NeRF space beyond the GAN latent space. Specifically, the model predicts a deltaplane offset to the personalized triplane, which represents the precise lip movements conditioned on the input audio. Additionally, the architecture employs an attention-based module and accommodates extra features to disentangle local facial variations, improving lip-sync accuracy.

The paper claims that the proposed Talk3D framework demonstrates state-of-the-art talking head generation results through extensive experiments.

Related Work

The text discusses the challenges of audio-driven talking portrait synthesis, which aims to generate lip motions synchronized with audio while maintaining realistic facial structure and head poses. Early deep learning methods used 2D generative adversarial networks (GANs) to synthesize audio-synchronized lip motions but lacked control over head poses. Subsequent works attempted pose control using facial landmarks and 3D facial models, but this led to information loss. AD-NeRF first incorporated neural radiance fields (NeRF) to address the challenges of 3D head structure. Later works like SSP-NeRF, RAD-NeRF, and ER-NeRF made improvements in visual quality and efficiency using techniques like semantic sampling and Instant-NGP. However, these NeRF-based methods struggle to achieve multi-view consistency and realistic geometries, as they learn facial representations from only a single monocular video.

The text also discusses extending NeRF for generating images with multi-view consistency, using techniques like conditioning NeRF on sampled random vectors or semantic codes. Some works enabled high-resolution image synthesis by using super-resolution modules or hybrid 3D representations that integrate NeRF with explicit representations. EG3D, which is used as the base representation in this work, has achieved state-of-the-art image generation quality while maintaining 3D consistency.

Lastly, the text mentions how subsequent works have taken advantage of EG3D's efficient representation and diverse latent space for 3D facial reconstruction, using techniques like GAN inversion and deformation fields to enable facial animation from the triplane representation.

Methodology

This section describes the components of the proposed Talk3D framework for audio-driven high-fidelity talking portrait synthesis. The key points are:

Talk3D uses a personalized generator based on EG3D to produce an identity triplane 𝐏ID. This is done through a fine-tuning strategy to adapt the 3D-aware GAN to a specific person.
An audio-guided attention U-Net is used to predict a deltaplane Δ⁢𝐏n that represents the dynamic scene variation from the audio input. This deltaplane is combined with the identity triplane to form the final feature plane 𝐏n' that is input to the neural renderer.
The attention mechanism in the U-Net allows it to effectively capture localized facial dynamics from the audio input. Additional conditioning tokens related to head pose, eye movements, and background are also incorporated.
The model is trained using a combination of reconstruction, identity, and lip synchronization loss terms to generate high-quality talking portraits that match the input audio.
Quantitative results show Talk3D outperforming prior NeRF-based methods in terms of image fidelity and lip sync accuracy across different head poses.

Experiments

The paper discusses the experimental settings and evaluation of their proposed audio-driven talking head synthesis method, Talk3D. Key points:

Dataset: The authors use datasets from previous NeRF-based works, comprising person-centric videos with audio tracks. They also use a pre-trained Wav2Vec model to extract audio features.

Comparison baselines: The method is compared to 2D talking head methods like Wav2Lip and PC-AVS, as well as NeRF-based models like AD-NeRF, RAD-NeRF, and ER-NeRF. Additional comparisons to GeneFace and HFA-GP are provided in the supplementary material.

Quantitative evaluation: Three settings are used - novel-view synthesis, self-driven, and cross-driven. Metrics include PSNR, SSIM, LPIPS, FID, landmark distance, lip sync score, and action unit error. The authors also use an identity similarity metric.

Qualitative evaluation: The results show Talk3D outperforming previous NeRF-based methods in maintaining performance across diverse viewing angles and providing robust and accurate results in self-driven and cross-driven settings.

User study: A user study with 31 participants was conducted, which showed Talk3D outperforming other methods in terms of lip-sync accuracy, image quality, and video realism.

Ablation study: The ablation study validates the importance of the sync loss function and the augmented feature tokens used by Talk3D.

Conclusion

The paper introduces Talk3D, a novel framework for high-fidelity 3D talking head synthesis. Talk3D incorporates a 3D-aware GAN prior and a region-aware motion model to produce realistic 3D talking head avatars with accurate lip movements and explicit rendering viewpoint control.

The framework uses a personalized generator fine-tuned with the VIVE3D framework, which enables the synthesis of 3D-aware talking heads with realistic geometry. Additionally, the proposed audio-guided attention U-Net architecture enhances the disentanglement of local variations within the image frames, such as background, torso, and eye movements.

Extensive experiments demonstrate that the Talk3D model not only generates accurate lip movements synchronized with the input audio, but also enables rendering from novel viewpoints, addressing limitations of previous state-of-the-art approaches. The authors anticipate that this work will significantly impact digital media experiences, virtual interactions, and find applications in film-making, virtual avatars, and video conferencing.

The supplementary material provides implementation details, additional experiments, and further analyses, including ablation studies, attention map visualizations, and semantic manipulation of the generated portraits. The limitations and ethical considerations of the research are also briefly discussed.

Appendix 0.A Additional Implementation Details

The paper describes the pre-processing and feature extraction steps used in their model:

Pre-processing: The method follows the same image cropping approach as VIVE3D [22], which detects 6 facial landmarks in each video frame using an off-the-shelf detector, applies Gaussian smoothing to stabilize the landmarks over time, and calculates affine transformation matrices to crop the images. The cropping boundaries are slightly wider than those used in EG3D [8].

Augmenting feature extraction: For audio features, the method uses a pre-trained Wav2Vec [2] model followed by 1D convolutional layers. Low-dimensional augmenting features like eye movement, head rotation, and landmarks are upsampled using positional encodings and further encoded with multilayer perceptrons (MLPs). Each feature is encoded into a 64-dimensional token.

Network architecture: The deltaplane predictor encodes the 256-resolution triplane input into a 32-resolution feature map. The cross-attention layer takes the flattened image feature and conditioning tokens, and predicts a low-resolution feature map using learned query, key, and value representations.

The paper also mentions that the original super-resolution module in EG3D [8] was replaced with GFPGAN [57] to enhance rendering quality, but all quantitative evaluations were performed using the original EG3D super-resolution module for a fair comparison.

Appendix 0.B Additional Results and Comparisons

The provided text discusses the performance of the Talk3D model in generating images from extreme viewpoints and comparing it to previous NeRF-based methods. Key points:

Talk3D is able to generate images and depth maps for the entire synthesized portrait, including the background, unlike previous NeRF methods that lack depth information outside the head/torso region.
The authors also demonstrate the generalizability of Talk3D by conducting experiments on additional in-the-wild datasets from HDTF.
Comparisons are made to two related works, HFA-GP and GeneFace, though the authors note limitations in accurately reproducing the results of HFA-GP.
Overall, the results show Talk3D outperforming these prior NeRF-based approaches across various quantitative metrics for novel view synthesis and depth estimation.

Appendix 0.C Further Analysis

The paper provides several analyses and ablation studies to demonstrate the efficacy of the proposed attention-based network and triplane representation.

0.C.1 Analysis of attention: The attention maps show that the conditioning tokens successfully disentangle the local movements of the low-resolution feature map. Angles and landmarks capture different attention patterns, as angles are related to torso movement while landmarks are suitable for capturing background motion.

0.C.2 Analysis of triplane: The visualizations of the generated images and their corresponding triplanes (identity plane and deltaplane) confirm the method's ability to precisely manipulate specific regions like the lips, eyes, torso, and background.

0.C.3 Ablation studies:

Importance of each token selection: The ablation study demonstrates the crucial role of each feature token in controlling the corresponding regions, such as eye movement, torso, and background.
Deltaplane predictor: The evaluation highlights the importance of the design choices in the deltaplane predictor for optimizing image quality and lip-synchronization accuracy.
Effect of the personalized generator: The incorporation of the personalized generator leads to improved image quality, as shown by the higher performance metrics and the reduced noise around the nose and eye regions.

Appendix 0.D Facial Editing

This section introduces a new feature of the Talk3D model that distinguishes it from other NeRF-based methods - facial attribute manipulation. Talk3D is built on a pre-trained generative model, EG3D, which has a rich and diverse latent space that enables semantic editing by adding predefined style vectors to the input latent code.

The researchers exploit InterFaceGAN to find style vectors that represent semantic editing directions within the EG3D latent space. Since Talk3D directly predicts the triplane representation instead of a latent code, the researchers slightly alter the InterFaceGAN methodology by replacing the identity triplane with the edited triplane.

Specifically, the researchers construct the edited triplane by adding the editing style vector to the identity latent code and feeding it into the personalized generator. They then replace the identity triplane with the edited triplane to generate the edited image.

The results, shown in Fig. 15, demonstrate consistent manipulation of attributes like age and hair length without disrupting lip synchronization.

Figure 15: Facial attribute manipulation results.

Appendix 0.E Supplementary Video

The supplementary video provides a comprehensive visualization of the effectiveness of the proposed method for talking facial video synthesis. The video showcases the facial animations generated by the method and includes detailed comparisons with other relevant techniques. It also demonstrates the versatility of the approach across different languages and its robustness in rendering performance at extreme viewpoints. Additionally, the video includes an ablation study that examines the individual contributions of the components within the methodology.

Appendix 0.F Broader Impact

The text discusses the ethical considerations and limitations of the Talk3D system, which generates highly realistic talking portrait videos with accurate lip-audio synchronization.

The key ethical consideration is the potential for misuse of such technology to create "deepfake" content that is challenging for individuals to distinguish from authentic videos. To address this, the authors emphasize the importance of informing users about the authenticity of videos and supporting the enhancement of deepfake detection mechanisms. They also advocate for protective measures, such as digital watermarks, and the need for regulatory frameworks to prevent unintended negative consequences.

In terms of limitations, while Talk3D excels at high-fidelity talking portrait synthesis, especially from extreme viewpoints, it does not generalize well outside of photorealistic human faces like some other talking face synthesis works. The system's reliance on GAN inversion also introduces technical complexities in data preparation, such as the need for precise alignment and cropping of video frames. Future development will focus on overcoming these limitations to increase the method's adaptability and ensure consistent performance across a wider range of training data.

Related Papers

GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian Splatting

Hongyun Yu, Zhan Qu, Qihang Yu, Jianchuan Chen, Zhonghua Jiang, Zhiwen Chen, Shengyu Zhang, Jimin Xu, Fei Wu, Chengfei Lv, Gang Yu

Recent works on audio-driven talking head synthesis using Neural Radiance Fields (NeRF) have achieved impressive results. However, due to inadequate pose and expression control caused by NeRF implicit representation, these methods still have some limitations, such as unsynchronized or unnatural lip movements, and visual jitter and artifacts. In this paper, we propose GaussianTalker, a novel method for audio-driven talking head synthesis based on 3D Gaussian Splatting. With the explicit representation property of 3D Gaussians, intuitive control of the facial motion is achieved by binding Gaussians to 3D facial models. GaussianTalker consists of two modules, Speaker-specific Motion Translator and Dynamic Gaussian Renderer. Speaker-specific Motion Translator achieves accurate lip movements specific to the target speaker through universalized audio feature extraction and customized lip motion generation. Dynamic Gaussian Renderer introduces Speaker-specific BlendShapes to enhance facial detail representation via a latent pose, delivering stable and realistic rendered videos. Extensive experimental results suggest that GaussianTalker outperforms existing state-of-the-art methods in talking head synthesis, delivering precise lip synchronization and exceptional visual quality. Our method achieves rendering speeds of 130 FPS on NVIDIA RTX4090 GPU, significantly exceeding the threshold for real-time rendering performance, and can potentially be deployed on other hardware platforms.

4/30/2024

cs.CV cs.MM

Learn2Talk: 3D Talking Face Learns from 2D Talking Face

Yixiang Zhuang, Baoping Cheng, Yao Cheng, Yuntao Jin, Renshuai Liu, Chengyang Li, Xuan Cheng, Jing Liao, Juncong Lin

Speech-driven facial animation methods usually contain two main classes, 3D and 2D talking face, both of which attract considerable research attention in recent years. However, to the best of our knowledge, the research on 3D talking face does not go deeper as 2D talking face, in the aspect of lip-synchronization (lip-sync) and speech perception. To mind the gap between the two sub-fields, we propose a learning framework named Learn2Talk, which can construct a better 3D talking face network by exploiting two expertise points from the field of 2D talking face. Firstly, inspired by the audio-video sync network, a 3D sync-lip expert model is devised for the pursuit of lip-sync between audio and 3D facial motion. Secondly, a teacher model selected from 2D talking face methods is used to guide the training of the audio-to-3D motions regression network to yield more 3D vertex accuracy. Extensive experiments show the advantages of the proposed framework in terms of lip-sync, vertex accuracy and speech perception, compared with state-of-the-arts. Finally, we show two applications of the proposed framework: audio-visual speech recognition and speech-driven 3D Gaussian Splatting based avatar animation.

4/22/2024

cs.CV cs.GR cs.LG

GaussianTalker: Real-Time High-Fidelity Talking Head Synthesis with Audio-Driven 3D Gaussian Splatting

Kyusun Cho, Joungbin Lee, Heeji Yoon, Yeobin Hong, Jaehoon Ko, Sangjun Ahn, Seungryong Kim

We propose GaussianTalker, a novel framework for real-time generation of pose-controllable talking heads. It leverages the fast rendering capabilities of 3D Gaussian Splatting (3DGS) while addressing the challenges of directly controlling 3DGS with speech audio. GaussianTalker constructs a canonical 3DGS representation of the head and deforms it in sync with the audio. A key insight is to encode the 3D Gaussian attributes into a shared implicit feature representation, where it is merged with audio features to manipulate each Gaussian attribute. This design exploits the spatial-aware features and enforces interactions between neighboring points. The feature embeddings are then fed to a spatial-audio attention module, which predicts frame-wise offsets for the attributes of each Gaussian. It is more stable than previous concatenation or multiplication approaches for manipulating the numerous Gaussians and their intricate parameters. Experimental results showcase GaussianTalker's superiority in facial fidelity, lip synchronization accuracy, and rendering speed compared to previous methods. Specifically, GaussianTalker achieves a remarkable rendering speed up to 120 FPS, surpassing previous benchmarks. Our code is made available at https://github.com/KU-CVLAB/GaussianTalker/ .

4/26/2024

cs.CV cs.MM

Portrait3D: Text-Guided High-Quality 3D Portrait Generation Using Pyramid Representation and GANs Prior

Yiqian Wu, Hao Xu, Xiangjun Tang, Xien Chen, Siyu Tang, Zhebin Zhang, Chen Li, Xiaogang Jin

Existing neural rendering-based text-to-3D-portrait generation methods typically make use of human geometry prior and diffusion models to obtain guidance. However, relying solely on geometry information introduces issues such as the Janus problem, over-saturation, and over-smoothing. We present Portrait3D, a novel neural rendering-based framework with a novel joint geometry-appearance prior to achieve text-to-3D-portrait generation that overcomes the aforementioned issues. To accomplish this, we train a 3D portrait generator, 3DPortraitGAN-Pyramid, as a robust prior. This generator is capable of producing 360{deg} canonical 3D portraits, serving as a starting point for the subsequent diffusion-based generation process. To mitigate the grid-like artifact caused by the high-frequency information in the feature-map-based 3D representation commonly used by most 3D-aware GANs, we integrate a novel pyramid tri-grid 3D representation into 3DPortraitGAN-Pyramid. To generate 3D portraits from text, we first project a randomly generated image aligned with the given prompt into the pre-trained 3DPortraitGAN-Pyramid's latent space. The resulting latent code is then used to synthesize a pyramid tri-grid. Beginning with the obtained pyramid tri-grid, we use score distillation sampling to distill the diffusion model's knowledge into the pyramid tri-grid. Following that, we utilize the diffusion model to refine the rendered images of the 3D portrait and then use these refined images as training data to further optimize the pyramid tri-grid, effectively eliminating issues with unrealistic color and unnatural artifacts. Our experimental results show that Portrait3D can produce realistic, high-quality, and canonical 3D portraits that align with the prompt.

4/17/2024

cs.CV