Human-VDM: Learning Single-Image 3D Human Gaussian Splatting from Video Diffusion Models

Read original: arXiv:2409.02851 - Published 9/5/2024 by Zhibin Liu, Haoye Dong, Aviral Chharia, Hefeng Wu

Human-VDM: Learning Single-Image 3D Human Gaussian Splatting from Video Diffusion Models

Overview

The paper "Human-VDM: Learning Single-Image 3D Human Gaussian Splatting from Video Diffusion Models" presents a novel approach for generating 3D human models from a single input image.
The key idea is to leverage the power of video diffusion models, which have shown impressive results in video generation, to learn a mapping from 2D images to 3D human representations.
The proposed method, called Human-VDM, can produce high-quality 3D human models that capture the person's pose and shape, and can be used for various applications such as virtual reality, animation, and computer graphics.

Plain English Explanation

The paper introduces a new way to create 3D models of people from a single photograph. The key innovation is using a type of machine learning model called a "video diffusion model" to learn how to transform 2D images into 3D human representations.

Video diffusion models are powerful AI systems that can generate high-quality videos from scratch. The researchers found a way to adapt these models to work with still images instead of videos, enabling them to construct 3D human figures from a single input photo.

The resulting 3D models capture the person's pose and shape in a realistic way, which could be very useful for applications like virtual reality, animation, and computer graphics. For example, these 3D human models could be used to create customized avatars or insert people into digital scenes.

The key advantage of this approach is that it only requires a single input image, rather than needing multiple images or depth information to reconstruct the 3D shape. This makes it more practical and accessible compared to some other 3D human modeling techniques.

Technical Explanation

The core of the Human-VDM approach is a video diffusion model that has been adapted to work with single input images instead of video sequences. Diffusion models are a type of generative AI model that can learn to transform simple noise patterns into complex, realistic outputs.

The researchers trained this modified diffusion model on a large dataset of 3D human scans and corresponding 2D images. This allowed the model to learn a mapping from 2D images to 3D human representations, which it can then apply to new input images.

At inference time, the Human-VDM model takes a single 2D image as input and outputs a 3D human model represented as a set of Gaussian "splatts" - essentially 3D blobs that approximate the shape and pose of the person. These Gaussian splatts can then be rendered into a full 3D mesh for downstream applications.

The key technical innovations include:

Adapting the video diffusion model architecture to work with single images
Designing a novel training process to learn the 2D-to-3D mapping
Outputting the 3D human model as a set of Gaussian splatts for efficient representation

Experiments show that Human-VDM can produce high-quality 3D human models that capture accurate pose and shape details from just a single input image. The models compare favorably to other state-of-the-art single-image 3D human reconstruction techniques.

Critical Analysis

The paper presents a compelling approach for 3D human modeling from single images, leveraging the power of video diffusion models. Some potential areas for further research and improvement include:

Robustness to Varied Inputs: The current system may work best on frontal, well-lit images of people in neutral poses. Exploring how well it handles more challenging input images (e.g. side views, occluded subjects, dynamic poses) could be valuable.
Real-Time Performance: While the offline 3D reconstruction is impressive, enabling real-time 3D modeling from videos or interactive applications could significantly expand the usefulness of this technology.
Incorporation of Additional Cues: The paper focuses solely on monocular 2D images as input. Exploring ways to incorporate other sensor data, such as depth maps or multi-view images, could potentially further improve the 3D reconstruction quality.
Generalization to Non-human Subjects: The current model is trained on human data. Investigating how well the approach can be adapted to model other types of 3D objects, such as animals or furniture, could broaden the applicability of the technique.

Overall, the Human-VDM paper presents a novel and promising approach to single-image 3D human modeling. With further research and development, this technology could enable a wide range of interesting applications in areas like virtual/augmented reality, computer animation, and digital content creation.

Conclusion

The "Human-VDM" paper introduces a novel method for generating high-quality 3D human models from a single input image. By adapting powerful video diffusion models, the researchers have developed a system that can capture accurate pose and shape information and represent it as a set of 3D Gaussian splatts.

This approach has the potential to significantly streamline 3D human modeling workflows, making it more accessible and practical for a variety of applications in virtual reality, animation, and beyond. While there are still some areas for further improvement and exploration, the technical innovations and promising results presented in this work represent an exciting step forward in the field of single-image 3D reconstruction.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Human-VDM: Learning Single-Image 3D Human Gaussian Splatting from Video Diffusion Models

Zhibin Liu, Haoye Dong, Aviral Chharia, Hefeng Wu

Generating lifelike 3D humans from a single RGB image remains a challenging task in computer vision, as it requires accurate modeling of geometry, high-quality texture, and plausible unseen parts. Existing methods typically use multi-view diffusion models for 3D generation, but they often face inconsistent view issues, which hinder high-quality 3D human generation. To address this, we propose Human-VDM, a novel method for generating 3D human from a single RGB image using Video Diffusion Models. Human-VDM provides temporally consistent views for 3D human generation using Gaussian Splatting. It consists of three modules: a view-consistent human video diffusion module, a video augmentation module, and a Gaussian Splatting module. First, a single image is fed into a human video diffusion module to generate a coherent human video. Next, the video augmentation module applies super-resolution and video interpolation to enhance the textures and geometric smoothness of the generated video. Finally, the 3D Human Gaussian Splatting module learns lifelike humans under the guidance of these high-resolution and view-consistent images. Experiments demonstrate that Human-VDM achieves high-quality 3D human from a single image, outperforming state-of-the-art methods in both generation quality and quantity. Project page: https://human-vdm.github.io/Human-VDM/

9/5/2024

Generalizable Human Gaussians from Single-View Image

Jinnan Chen, Chen Li, Jianfeng Zhang, Hanlin Chen, Buzhen Huang, Gim Hee Lee

In this work, we tackle the task of learning generalizable 3D human Gaussians from a single image. The main challenge for this task is to recover detailed geometry and appearance, especially for the unobserved regions. To this end, we propose single-view generalizable Human Gaussian model (HGM), a diffusion-guided framework for 3D human modeling from a single image. We design a diffusion-based coarse-to-fine pipeline, where the diffusion model is adapted to refine novel-view images rendered from a coarse human Gaussian model. The refined images are then used together with the input image to learn a refined human Gaussian model. Although effective in hallucinating the unobserved views, the approach may generate unrealistic human pose and shapes due to the lack of supervision. We circumvent this problem by further encoding the geometric priors from SMPL model. Specifically, we propagate geometric features from SMPL volume to the predicted Gaussians via sparse convolution and attention mechanism. We validate our approach on publicly available datasets and demonstrate that it significantly surpasses state-of-the-art methods in terms of PSNR and SSIM. Additionally, our method exhibits strong generalization for in-the-wild images.

6/11/2024

Human 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Models

Yuxuan Xue, Xianghui Xie, Riccardo Marin, Gerard Pons-Moll

Creating realistic avatars from a single RGB image is an attractive yet challenging problem. Due to its ill-posed nature, recent works leverage powerful prior from 2D diffusion models pretrained on large datasets. Although 2D diffusion models demonstrate strong generalization capability, they cannot provide multi-view shape priors with guaranteed 3D consistency. We propose Human 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion. Our key insight is that 2D multi-view diffusion and 3D reconstruction models provide complementary information for each other, and by coupling them in a tight manner, we can fully leverage the potential of both models. We introduce a novel image-conditioned generative 3D Gaussian Splats reconstruction model that leverages the priors from 2D multi-view diffusion models, and provides an explicit 3D representation, which further guides the 2D reverse sampling process to have better 3D consistency. Experiments show that our proposed framework outperforms state-of-the-art methods and enables the creation of realistic avatars from a single RGB image, achieving high-fidelity in both geometry and appearance. Extensive ablations also validate the efficacy of our design, (1) multi-view 2D priors conditioning in generative 3D reconstruction and (2) consistency refinement of sampling trajectory via the explicit 3D representation. Our code and models will be released on https://yuxuan-xue.com/human-3diffusion.

6/13/2024

HumanSplat: Generalizable Single-Image Human Gaussian Splatting with Structure Priors

Panwang Pan, Zhuo Su, Chenguo Lin, Zhen Fan, Yongjie Zhang, Zeming Li, Tingting Shen, Yadong Mu, Yebin Liu

Despite recent advancements in high-fidelity human reconstruction techniques, the requirements for densely captured images or time-consuming per-instance optimization significantly hinder their applications in broader scenarios. To tackle these issues, we present HumanSplat which predicts the 3D Gaussian Splatting properties of any human from a single input image in a generalizable manner. In particular, HumanSplat comprises a 2D multi-view diffusion model and a latent reconstruction transformer with human structure priors that adeptly integrate geometric priors and semantic features within a unified framework. A hierarchical loss that incorporates human semantic information is further designed to achieve high-fidelity texture modeling and better constrain the estimated multiple views. Comprehensive experiments on standard benchmarks and in-the-wild images demonstrate that HumanSplat surpasses existing state-of-the-art methods in achieving photorealistic novel-view synthesis.

6/19/2024