DPHMs: Diffusion Parametric Head Models for Depth-based Tracking

2312.01068

Published 4/9/2024 by Jiapeng Tang, Angela Dai, Yinyu Nie, Lev Markhasin, Justus Thies, Matthias Niessner

👀

Abstract

We introduce Diffusion Parametric Head Models (DPHMs), a generative model that enables robust volumetric head reconstruction and tracking from monocular depth sequences. While recent volumetric head models, such as NPHMs, can now excel in representing high-fidelity head geometries, tracking and reconstructing heads from real-world single-view depth sequences remains very challenging, as the fitting to partial and noisy observations is underconstrained. To tackle these challenges, we propose a latent diffusion-based prior to regularize volumetric head reconstruction and tracking. This prior-based regularizer effectively constrains the identity and expression codes to lie on the underlying latent manifold which represents plausible head shapes. To evaluate the effectiveness of the diffusion-based prior, we collect a dataset of monocular Kinect sequences consisting of various complex facial expression motions and rapid transitions. We compare our method to state-of-the-art tracking methods and demonstrate improved head identity reconstruction as well as robust expression tracking.

Create account to get full access

Overview

The researchers present a new generative model called Diffusion Parametric Head Models (DPHMs) that can reconstruct and track volumetric head models from monocular depth sequences.
This addresses challenges with existing volumetric head models, which struggle to fit to partial and noisy real-world depth data.
The key innovation is using a latent diffusion-based prior to regularize the head reconstruction and tracking, constraining the identity and expression codes to lie on a manifold of plausible head shapes.

Plain English Explanation

The paper introduces a new way to create 3D models of people's heads and track how their faces move over time, using only a single camera that measures depth. Existing models can represent detailed head geometries, but struggle when dealing with real-world depth data that is incomplete or noisy.

To overcome this, the researchers developed a "diffusion-based prior" - a set of rules that constrain the 3D head models to stay within the bounds of what real human heads actually look like. This prior acts as a regularizer, keeping the reconstructed heads plausible even when the input depth data is imperfect.

The team collected a dataset of Kinect depth camera videos showing various facial expressions and fast movements. They show that their diffusion-based method outperforms other state-of-the-art approaches, both in accurately reconstructing a person's head identity and in robustly tracking their facial expressions over time.

Technical Explanation

The core innovation of this work is the use of a latent diffusion-based prior to regularize the volumetric head reconstruction and tracking. Diffusion models like this one learn a generative process that starts from random noise and gradually transforms it into coherent samples.

The researchers leverage this learned diffusion process to define a prior distribution over the latent identity and expression codes that parameterize their volumetric head model. This prior effectively constrains the codes to lie on a manifold of plausible head shapes, enabling robust fitting to partial and noisy depth observations.

To evaluate their approach, the authors collected a dataset of Kinect depth sequences with complex facial expressions and rapid head motions. They compare their DPHM method to state-of-the-art monocular depth-based reconstruction techniques and demonstrate improved performance in both head identity reconstruction and expression tracking.

Critical Analysis

The key strength of this work is the innovative use of a diffusion-based prior to tackle the challenges of fitting volumetric head models to real-world depth data. By constraining the latent codes to a manifold of plausible head shapes, the method is able to produce more robust reconstructions.

That said, the paper does not provide a detailed analysis of the limitations of the approach. For example, it would be helpful to understand how the method performs on more extreme facial expressions or head poses that may fall outside the learned manifold. Additionally, the authors do not discuss how the diffusion-based prior compares to other possible regularization techniques, such as structured latent diffusion models for 3D human generation.

Overall, this is a promising piece of research that leverages the power of diffusion models to enable more robust head reconstruction and tracking. Further exploration of the method's limitations and comparisons to alternative approaches would help strengthen the contribution.

Conclusion

The Diffusion Parametric Head Models (DPHMs) presented in this paper offer a novel solution to the challenge of reconstructing and tracking 3D head models from monocular depth data. By incorporating a diffusion-based prior to constrain the latent representation, the method is able to produce more accurate and robust head reconstructions, even in the face of partial and noisy depth observations.

This work demonstrates the potential of diffusion models to enhance generative approaches in computer vision, opening up new possibilities for applications like facial animation, virtual reality, and human-computer interaction. As the field continues to explore the capabilities of diffusion-based priors, further advancements in this direction could lead to transformative improvements in 3D modeling and tracking from limited sensory data.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

MonoNPHM: Dynamic Head Reconstruction from Monocular Videos

Simon Giebenhain, Tobias Kirschstein, Markos Georgopoulos, Martin Runz, Lourdes Agapito, Matthias Nie{ss}ner

We present Monocular Neural Parametric Head Models (MonoNPHM) for dynamic 3D head reconstructions from monocular RGB videos. To this end, we propose a latent appearance space that parameterizes a texture field on top of a neural parametric model. We constrain predicted color values to be correlated with the underlying geometry such that gradients from RGB effectively influence latent geometry codes during inverse rendering. To increase the representational capacity of our expression space, we augment our backward deformation field with hyper-dimensions, thus improving color and geometry representation in topologically challenging expressions. Using MonoNPHM as a learned prior, we approach the task of 3D head reconstruction using signed distance field based volumetric rendering. By numerically inverting our backward deformation field, we incorporated a landmark loss using facial anchor points that are closely tied to our canonical geometry representation. To evaluate the task of dynamic face reconstruction from monocular RGB videos we record 20 challenging Kinect sequences under casual conditions. MonoNPHM outperforms all baselines with a significant margin, and makes an important step towards easily accessible neural parametric face models through RGB tracking.

5/31/2024

cs.CV

🧠

RoHM: Robust Human Motion Reconstruction via Diffusion

Siwei Zhang, Bharat Lal Bhatnagar, Yuanlu Xu, Alexander Winkler, Petr Kadlecek, Siyu Tang, Federica Bogo

We propose RoHM, an approach for robust 3D human motion reconstruction from monocular RGB(-D) videos in the presence of noise and occlusions. Most previous approaches either train neural networks to directly regress motion in 3D or learn data-driven motion priors and combine them with optimization at test time. The former do not recover globally coherent motion and fail under occlusions; the latter are time-consuming, prone to local minima, and require manual tuning. To overcome these shortcomings, we exploit the iterative, denoising nature of diffusion models. RoHM is a novel diffusion-based motion model that, conditioned on noisy and occluded input data, reconstructs complete, plausible motions in consistent global coordinates. Given the complexity of the problem -- requiring one to address different tasks (denoising and infilling) in different solution spaces (local and global motion) -- we decompose it into two sub-tasks and learn two models, one for global trajectory and one for local motion. To capture the correlations between the two, we then introduce a novel conditioning module, combining it with an iterative inference scheme. We apply RoHM to a variety of tasks -- from motion reconstruction and denoising to spatial and temporal infilling. Extensive experiments on three popular datasets show that our method outperforms state-of-the-art approaches qualitatively and quantitatively, while being faster at test time. The code is available at https://sanweiliti.github.io/ROHM/ROHM.html.

4/16/2024

cs.CV

✨

FitDiff: Robust monocular 3D facial shape and reflectance estimation using Diffusion Models

Stathis Galanakis, Alexandros Lattas, Stylianos Moschoglou, Stefanos Zafeiriou

The remarkable progress in 3D face reconstruction has resulted in high-detail and photorealistic facial representations. Recently, Diffusion Models have revolutionized the capabilities of generative methods by surpassing the performance of GANs. In this work, we present FitDiff, a diffusion-based 3D facial avatar generative model. Leveraging diffusion principles, our model accurately generates relightable facial avatars, utilizing an identity embedding extracted from an in-the-wild 2D facial image. The introduced multi-modal diffusion model is the first to concurrently output facial reflectance maps (diffuse and specular albedo and normals) and shapes, showcasing great generalization capabilities. It is solely trained on an annotated subset of a public facial dataset, paired with 3D reconstructions. We revisit the typical 3D facial fitting approach by guiding a reverse diffusion process using perceptual and face recognition losses. Being the first 3D LDM conditioned on face recognition embeddings, FitDiff reconstructs relightable human avatars, that can be used as-is in common rendering engines, starting only from an unconstrained facial image, and achieving state-of-the-art performance.

6/5/2024

cs.CV

DNPM: A Neural Parametric Model for the Synthesis of Facial Geometric Details

Haitao Cao, Baoping Cheng, Qiran Pu, Haocheng Zhang, Bin Luo, Yixiang Zhuang, Juncong Lin, Liyan Chen, Xuan Cheng

Parametric 3D models have enabled a wide variety of computer vision and graphics tasks, such as modeling human faces, bodies and hands. In 3D face modeling, 3DMM is the most widely used parametric model, but can't generate fine geometric details solely from identity and expression inputs. To tackle this limitation, we propose a neural parametric model named DNPM for the facial geometric details, which utilizes deep neural network to extract latent codes from facial displacement maps encoding details and wrinkles. Built upon DNPM, a novel 3DMM named Detailed3DMM is proposed, which augments traditional 3DMMs by including the synthesis of facial details only from the identity and expression inputs. Moreover, we show that DNPM and Detailed3DMM can facilitate two downstream applications: speech-driven detailed 3D facial animation and 3D face reconstruction from a degraded image. Extensive experiments have shown the usefulness of DNPM and Detailed3DMM, and the progressiveness of two proposed applications.

6/17/2024

cs.CV