PSHuman: Photorealistic Single-view Human Reconstruction using Cross-Scale Diffusion

Read original: arXiv:2409.10141 - Published 9/17/2024 by Peng Li, Wangguandong Zheng, Yuan Liu, Tao Yu, Yangguang Li, Xingqun Qi, Mengfei Li, Xiaowei Chi, Siyu Xia, Wei Xue and 3 others

PSHuman: Photorealistic Single-view Human Reconstruction using Cross-Scale Diffusion

Overview

PSHuman is a method for photorealistic single-view human reconstruction using cross-scale diffusion.
It can generate high-quality 3D human models from a single input image.
The model leverages a novel cross-scale diffusion mechanism to capture fine-grained details while preserving global structure.

Plain English Explanation

PSHuman is a technique that can create detailed 3D models of people from a single photograph. It works by using a cross-scale diffusion process to capture both the overall shape and fine details of the person in the image.

The key idea is to start with a rough 3D shape and then gradually refine it over multiple steps, using information from different scales or levels of detail. This allows the model to preserve the global structure of the person while also adding in the fine-grained details like facial features, clothing, and body shape.

By combining this cross-scale approach with powerful deep learning models, PSHuman is able to generate photorealistic 3D human models that look very lifelike and realistic. This could be useful for a variety of applications, such as creating 3D avatars, virtual characters, or content for augmented/virtual reality experiences.

Technical Explanation

PSHuman uses a novel cross-scale diffusion mechanism to reconstruct 3D human models from single-view images. The method starts with a coarse 3D shape and then progressively refines it over multiple diffusion steps, incorporating information from different spatial scales.

This cross-scale approach allows the model to capture both global structure and fine details. The coarse-to-fine diffusion process gradually adds in high-frequency details while preserving the overall shape and proportions of the person.

The authors train the model using a combination of synthetic and real human data, including 3D scans and images. They also introduce a differentiable rendering module that enables end-to-end training of the reconstruction pipeline.

Extensive experiments on benchmark datasets demonstrate that PSHuman can generate highly photorealistic 3D human models with state-of-the-art performance in terms of both geometric accuracy and visual fidelity.

Critical Analysis

The PSHuman paper presents a promising approach for photorealistic single-view human reconstruction. The cross-scale diffusion mechanism is a novel and effective way to capture both global and local details.

However, the paper does not address some potential limitations. For example, the model may struggle with extreme poses or occlusions in the input image, as it relies on a single view. Additionally, the computational complexity of the multi-scale diffusion process could be a challenge for real-time applications.

Further research could explore ways to improve efficiency, handle more diverse scenarios, and quantify the model's limitations more thoroughly. Incorporating additional data modalities, such as depth information or semantic segmentation, could also enhance the reconstruction quality.

Conclusion

PSHuman presents a compelling approach for generating high-quality 3D human models from single-view images. The cross-scale diffusion mechanism is a key innovation that enables the model to capture both global structure and fine-grained details.

The ability to create photorealistic 3D human representations from a single photograph has exciting implications for a wide range of applications, such as virtual avatars, gaming, and augmented reality. As the field of 3D human reconstruction continues to advance, techniques like PSHuman could play a crucial role in making these technologies more accessible and realistic.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

PSHuman: Photorealistic Single-view Human Reconstruction using Cross-Scale Diffusion

Peng Li, Wangguandong Zheng, Yuan Liu, Tao Yu, Yangguang Li, Xingqun Qi, Mengfei Li, Xiaowei Chi, Siyu Xia, Wei Xue, Wenhan Luo, Qifeng Liu, Yike Guo

Detailed and photorealistic 3D human modeling is essential for various applications and has seen tremendous progress. However, full-body reconstruction from a monocular RGB image remains challenging due to the ill-posed nature of the problem and sophisticated clothing topology with self-occlusions. In this paper, we propose PSHuman, a novel framework that explicitly reconstructs human meshes utilizing priors from the multiview diffusion model. It is found that directly applying multiview diffusion on single-view human images leads to severe geometric distortions, especially on generated faces. To address it, we propose a cross-scale diffusion that models the joint probability distribution of global full-body shape and local facial characteristics, enabling detailed and identity-preserved novel-view generation without any geometric distortion. Moreover, to enhance cross-view body shape consistency of varied human poses, we condition the generative model on parametric models like SMPL-X, which provide body priors and prevent unnatural views inconsistent with human anatomy. Leveraging the generated multi-view normal and color images, we present SMPLX-initialized explicit human carving to recover realistic textured human meshes efficiently. Extensive experimental results and quantitative evaluations on CAPE and THuman2.1 datasets demonstrate PSHumans superiority in geometry details, texture fidelity, and generalization capability.

9/17/2024

Human 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Models

Yuxuan Xue, Xianghui Xie, Riccardo Marin, Gerard Pons-Moll

Creating realistic avatars from a single RGB image is an attractive yet challenging problem. Due to its ill-posed nature, recent works leverage powerful prior from 2D diffusion models pretrained on large datasets. Although 2D diffusion models demonstrate strong generalization capability, they cannot provide multi-view shape priors with guaranteed 3D consistency. We propose Human 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion. Our key insight is that 2D multi-view diffusion and 3D reconstruction models provide complementary information for each other, and by coupling them in a tight manner, we can fully leverage the potential of both models. We introduce a novel image-conditioned generative 3D Gaussian Splats reconstruction model that leverages the priors from 2D multi-view diffusion models, and provides an explicit 3D representation, which further guides the 2D reverse sampling process to have better 3D consistency. Experiments show that our proposed framework outperforms state-of-the-art methods and enables the creation of realistic avatars from a single RGB image, achieving high-fidelity in both geometry and appearance. Extensive ablations also validate the efficacy of our design, (1) multi-view 2D priors conditioning in generative 3D reconstruction and (2) consistency refinement of sampling trajectory via the explicit 3D representation. Our code and models will be released on https://yuxuan-xue.com/human-3diffusion.

6/13/2024

Generalizable Human Gaussians from Single-View Image

Jinnan Chen, Chen Li, Jianfeng Zhang, Hanlin Chen, Buzhen Huang, Gim Hee Lee

In this work, we tackle the task of learning generalizable 3D human Gaussians from a single image. The main challenge for this task is to recover detailed geometry and appearance, especially for the unobserved regions. To this end, we propose single-view generalizable Human Gaussian model (HGM), a diffusion-guided framework for 3D human modeling from a single image. We design a diffusion-based coarse-to-fine pipeline, where the diffusion model is adapted to refine novel-view images rendered from a coarse human Gaussian model. The refined images are then used together with the input image to learn a refined human Gaussian model. Although effective in hallucinating the unobserved views, the approach may generate unrealistic human pose and shapes due to the lack of supervision. We circumvent this problem by further encoding the geometric priors from SMPL model. Specifically, we propagate geometric features from SMPL volume to the predicted Gaussians via sparse convolution and attention mechanism. We validate our approach on publicly available datasets and demonstrate that it significantly surpasses state-of-the-art methods in terms of PSNR and SSIM. Additionally, our method exhibits strong generalization for in-the-wild images.

6/11/2024

HumanSplat: Generalizable Single-Image Human Gaussian Splatting with Structure Priors

Panwang Pan, Zhuo Su, Chenguo Lin, Zhen Fan, Yongjie Zhang, Zeming Li, Tingting Shen, Yadong Mu, Yebin Liu

Despite recent advancements in high-fidelity human reconstruction techniques, the requirements for densely captured images or time-consuming per-instance optimization significantly hinder their applications in broader scenarios. To tackle these issues, we present HumanSplat which predicts the 3D Gaussian Splatting properties of any human from a single input image in a generalizable manner. In particular, HumanSplat comprises a 2D multi-view diffusion model and a latent reconstruction transformer with human structure priors that adeptly integrate geometric priors and semantic features within a unified framework. A hierarchical loss that incorporates human semantic information is further designed to achieve high-fidelity texture modeling and better constrain the estimated multiple views. Comprehensive experiments on standard benchmarks and in-the-wild images demonstrate that HumanSplat surpasses existing state-of-the-art methods in achieving photorealistic novel-view synthesis.

6/19/2024