Stable Video Portraits

Read original: arXiv:2409.18083 - Published 9/27/2024 by Mirela Ostrek, Justus Thies

Overview

This paper presents a method for generating stable and realistic video portraits of human faces.
The approach combines a neural renderer with a 3D head model to create high-quality, temporally consistent talking head videos.
The technique can be used to generate personalized avatars and virtual assistants with natural facial expressions and head motions.

Plain English Explanation

The paper describes a new way to create realistic-looking videos of human faces. The key idea is to combine a 3D model of the head with a neural network that can generate the video frames. This allows the system to produce talking head videos that are both visually appealing and temporally stable, meaning the person's face and head movements appear natural and consistent over time.

The researchers' method could be used to generate personalized avatars or virtual assistants with life-like facial expressions. This could be particularly useful for applications like virtual meetings, where having an avatar that behaves realistically can improve the user experience.

Overall, the paper presents an important advance in the field of neural rendering for generating high-quality video portraits.

Technical Explanation

The paper introduces a new method for creating stable and realistic video portraits of human faces. The key components are:

3D Head Model: The system uses a 3D morphable model of the human head to capture the underlying 3D structure and geometry of the face.
Neural Renderer: A neural network is trained to generate the 2D video frames from the 3D head model, producing natural-looking facial expressions and head motions.
Temporal Consistency: The neural renderer is designed to generate temporally stable outputs, ensuring the person's face and head movements appear coherent and natural over time.

The paper presents extensive experiments demonstrating the effectiveness of this approach. The researchers show that their method can generate high-quality talking head videos that are both visually appealing and temporally consistent, outperforming previous state-of-the-art techniques.

Critical Analysis

The paper presents a compelling solution for generating stable and realistic video portraits, but there are a few potential limitations and areas for further research:

Data Dependency: The quality of the results is likely dependent on the diversity and quality of the training data used to build the 3D head model and train the neural renderer. Expanding the data sources could improve the system's ability to handle a wider range of facial features and expressions.
Personalization: While the method can generate personalized avatars, the degree of customization may be limited. Exploring ways to further personalize the head model and facial animations could enhance the user experience.
Real-time Performance: The paper does not explicitly address the computational requirements or real-time performance of the system. Optimizing the architecture for efficient inference could be important for certain applications, such as virtual meetings or interactive games.

Overall, the paper presents a valuable contribution to the field of neural rendering and facial animation, with the potential for significant impact on applications requiring high-quality, temporally stable video portraits.

Conclusion

The researchers have developed a novel method for generating stable and realistic video portraits of human faces. By combining a 3D head model with a neural renderer, their approach can produce talking head videos with natural facial expressions and head motions that are temporally consistent.

This work represents an important advance in the field of neural rendering and could have widespread applications, from personalized avatars to virtual assistants and virtual meetings. The ability to generate high-quality, life-like video portraits has significant potential to enhance user experiences in a variety of contexts.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Stable Video Portraits

Mirela Ostrek, Justus Thies

Rapid advances in the field of generative AI and text-to-image methods in particular have transformed the way we interact with and perceive computer-generated imagery today. In parallel, much progress has been made in 3D face reconstruction, using 3D Morphable Models (3DMM). In this paper, we present SVP, a novel hybrid 2D/3D generation method that outputs photorealistic videos of talking faces leveraging a large pre-trained text-to-image prior (2D), controlled via a 3DMM (3D). Specifically, we introduce a person-specific fine-tuning of a general 2D stable diffusion model which we lift to a video model by providing temporal 3DMM sequences as conditioning and by introducing a temporal denoising procedure. As an output, this model generates temporally smooth imagery of a person with 3DMM-based controls, i.e., a person-specific avatar. The facial appearance of this person-specific avatar can be edited and morphed to text-defined celebrities, without any fine-tuning at test time. The method is analyzed quantitatively and qualitatively, and we show that our method outperforms state-of-the-art monocular head avatar methods.

9/27/2024

Coherent 3D Portrait Video Reconstruction via Triplane Fusion

Shengze Wang, Xueting Li, Chao Liu, Matthew Chan, Michael Stengel, Josef Spjut, Henry Fuchs, Shalini De Mello, Koki Nagano

Recent breakthroughs in single-image 3D portrait reconstruction have enabled telepresence systems to stream 3D portrait videos from a single camera in real-time, potentially democratizing telepresence. However, per-frame 3D reconstruction exhibits temporal inconsistency and forgets the user's appearance. On the other hand, self-reenactment methods can render coherent 3D portraits by driving a personalized 3D prior, but fail to faithfully reconstruct the user's per-frame appearance (e.g., facial expressions and lighting). In this work, we recognize the need to maintain both coherent identity and dynamic per-frame appearance to enable the best possible realism. To this end, we propose a new fusion-based method that fuses a personalized 3D subject prior with per-frame information, producing temporally stable 3D videos with faithful reconstruction of the user's per-frame appearances. Trained only using synthetic data produced by an expression-conditioned 3D GAN, our encoder-based method achieves both state-of-the-art 3D reconstruction accuracy and temporal consistency on in-studio and in-the-wild datasets.

5/3/2024

SVP: Style-Enhanced Vivid Portrait Talking Head Diffusion Model

Weipeng Tan, Chuming Lin, Chengming Xu, Xiaozhong Ji, Junwei Zhu, Chengjie Wang, Yanwei Fu

Talking Head Generation (THG), typically driven by audio, is an important and challenging task with broad application prospects in various fields such as digital humans, film production, and virtual reality. While diffusion model-based THG methods present high quality and stable content generation, they often overlook the intrinsic style which encompasses personalized features such as speaking habits and facial expressions of a video. As consequence, the generated video content lacks diversity and vividness, thus being limited in real life scenarios. To address these issues, we propose a novel framework named Style-Enhanced Vivid Portrait (SVP) which fully leverages style-related information in THG. Specifically, we first introduce the novel probabilistic style prior learning to model the intrinsic style as a Gaussian distribution using facial expressions and audio embedding. The distribution is learned through the 'bespoked' contrastive objective, effectively capturing the dynamic style information in each video. Then we finetune a pretrained Stable Diffusion (SD) model to inject the learned intrinsic style as a controlling signal via cross attention. Experiments show that our model generates diverse, vivid, and high-quality videos with flexible control over intrinsic styles, outperforming existing state-of-the-art methods.

9/6/2024

SPARK: Self-supervised Personalized Real-time Monocular Face Capture

Kelian Baert, Shrisha Bharadwaj, Fabien Castan, Benoit Maujean, Marc Christie, Victoria Abrevaya, Adnane Boukhayma

Feedforward monocular face capture methods seek to reconstruct posed faces from a single image of a person. Current state of the art approaches have the ability to regress parametric 3D face models in real-time across a wide range of identities, lighting conditions and poses by leveraging large image datasets of human faces. These methods however suffer from clear limitations in that the underlying parametric face model only provides a coarse estimation of the face shape, thereby limiting their practical applicability in tasks that require precise 3D reconstruction (aging, face swapping, digital make-up, ...). In this paper, we propose a method for high-precision 3D face capture taking advantage of a collection of unconstrained videos of a subject as prior information. Our proposal builds on a two stage approach. We start with the reconstruction of a detailed 3D face avatar of the person, capturing both precise geometry and appearance from a collection of videos. We then use the encoder from a pre-trained monocular face reconstruction method, substituting its decoder with our personalized model, and proceed with transfer learning on the video collection. Using our pre-estimated image formation model, we obtain a more precise self-supervision objective, enabling improved expression and pose alignment. This results in a trained encoder capable of efficiently regressing pose and expression parameters in real-time from previously unseen images, which combined with our personalized geometry model yields more accurate and high fidelity mesh inference. Through extensive qualitative and quantitative evaluation, we showcase the superiority of our final model as compared to state-of-the-art baselines, and demonstrate its generalization ability to unseen pose, expression and lighting.

9/14/2024