3D Reconstruction of Interacting Multi-Person in Clothing from a Single Image

2401.06415

Published 4/3/2024 by Junuk Cha, Hansol Lee, Jaewon Kim, Nhat Nguyen Bao Truong, Jae Shin Yoon, Seungryul Baek

3D Reconstruction of Interacting Multi-Person in Clothing from a Single Image

Abstract

This paper introduces a novel pipeline to reconstruct the geometry of interacting multi-person in clothing on a globally coherent scene space from a single image. The main challenge arises from the occlusion: a part of a human body is not visible from a single view due to the occlusion by others or the self, which introduces missing geometry and physical implausibility (e.g., penetration). We overcome this challenge by utilizing two human priors for complete 3D geometry and surface contacts. For the geometry prior, an encoder learns to regress the image of a person with missing body parts to the latent vectors; a decoder decodes these vectors to produce 3D features of the associated geometry; and an implicit network combines these features with a surface normal map to reconstruct a complete and detailed 3D humans. For the contact prior, we develop an image-space contact detector that outputs a probability distribution of surface contacts between people in 3D. We use these priors to globally refine the body poses, enabling the penetration-free and accurate reconstruction of interacting multi-person in clothing on the scene space. The results demonstrate that our method is complete, globally coherent, and physically plausible compared to existing methods.

Create account to get full access

Overview

• The paper proposes a novel method for 3D reconstruction of interacting multi-person in clothing from a single image.

• It introduces a generative model that can capture the complex interactions and occlusions between individuals in a scene.

• The approach leverages advances in deep learning to reconstruct 3D human bodies and clothing from a single 2D input image.

Plain English Explanation

The paper describes a new way to create 3D models of multiple people interacting with each other, all from just a single 2D photograph. This is a challenging task because people can overlap and hide parts of each other in a 2D image, making it hard to figure out the full 3D structure.

The key idea is to use a machine learning model that can learn the patterns of how people's bodies and clothes interact and occlude each other. By training this model on lots of example images, it can then take a new 2D photo as input and output a 3D reconstruction showing the full 3D shapes of all the people, even the parts that are hidden from the camera.

This could be useful for applications like virtual clothing try-on, where you want to see how an outfit would look on multiple people interacting, or in film/gaming to create realistic 3D scenes with multiple characters. It's an important advance in being able to capture the full 3D world from limited 2D observations.

Technical Explanation

The paper introduces a generative model that can estimate the 3D shapes of multiple people in clothing from a single 2D image. The key innovations are:

Architecture: The model has an encoder-decoder structure, where the encoder takes the 2D image as input and outputs a compact latent code, and the decoder then uses this code to generate the 3D shapes of the people.
Interaction Modeling: The model explicitly represents the interactions and occlusions between individuals using a graph neural network. This allows it to handle complex multi-person scenes.
Clothing Modeling: In addition to the 3D body shapes, the model also reconstructs the 3D geometry of the clothing on each person, enabling more realistic virtual try-on applications.

The authors train and evaluate their model on a new dataset of multi-person images with ground truth 3D annotations. Experiments show that it outperforms prior methods on 3D reconstruction accuracy, while also being able to handle challenging cases like heavy occlusions.

Critical Analysis

One limitation mentioned in the paper is that the current model assumes a fixed number of people in the scene, which may not always be the case in real-world applications. Extending it to handle a variable number of individuals could be an interesting direction for future work.

Additionally, the clothing modeling component relies on a database of pre-scanned garment geometries. Developing methods to dynamically generate clothing shapes based on the body shape and pose could further improve the realism and flexibility of the reconstructions.

Overall, this is a technically impressive piece of research that makes meaningful progress on the challenging problem of 3D reconstruction for interacting multi-person scenes. However, as with any model-based approach, there may be inherent biases or failure cases that require careful consideration when deploying such a system in the real world.

Conclusion

This paper presents a novel deep learning-based approach for 3D reconstruction of multiple interacting people in clothing from a single 2D image. By explicitly modeling the interactions and occlusions between individuals, as well as the 3D geometry of their clothing, the proposed generative model is able to produce high-quality 3D reconstructions.

This work has important applications in areas like virtual try-on, film/gaming, and other scenarios where understanding the 3D structure of complex multi-person scenes is crucial. While the current method has some limitations, it represents a significant step forward in 3D reconstruction capabilities and could inspire further advancements in this active area of research.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild

Zeren Jiang, Chen Guo, Manuel Kaufmann, Tianjian Jiang, Julien Valentin, Otmar Hilliges, Jie Song

We present MultiPly, a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos. Reconstructing multiple individuals moving and interacting naturally from monocular in-the-wild videos poses a challenging task. Addressing it necessitates precise pixel-level disentanglement of individuals without any prior knowledge about the subjects. Moreover, it requires recovering intricate and complete 3D human shapes from short video sequences, intensifying the level of difficulty. To tackle these challenges, we first define a layered neural representation for the entire scene, composited by individual human and background models. We learn the layered neural representation from videos via our layer-wise differentiable volume rendering. This learning process is further enhanced by our hybrid instance segmentation approach which combines the self-supervised 3D segmentation and the promptable 2D segmentation module, yielding reliable instance segmentation supervision even under close human interaction. A confidence-guided optimization formulation is introduced to optimize the human poses and shape/appearance alternately. We incorporate effective objectives to refine human poses via photometric information and impose physically plausible constraints on human dynamics, leading to temporally consistent 3D reconstructions with high fidelity. The evaluation of our method shows the superiority over prior art on publicly available datasets and in-the-wild videos.

6/4/2024

cs.CV

📊

Multi-person 3D pose estimation from unlabelled data

Daniel Rodriguez-Criado, Pilar Bachiller, George Vogiatzis, Luis J. Manso

Its numerous applications make multi-human 3D pose estimation a remarkably impactful area of research. Nevertheless, assuming a multiple-view system composed of several regular RGB cameras, 3D multi-pose estimation presents several challenges. First of all, each person must be uniquely identified in the different views to separate the 2D information provided by the cameras. Secondly, the 3D pose estimation process from the multi-view 2D information of each person must be robust against noise and potential occlusions in the scenario. In this work, we address these two challenges with the help of deep learning. Specifically, we present a model based on Graph Neural Networks capable of predicting the cross-view correspondence of the people in the scenario along with a Multilayer Perceptron that takes the 2D points to yield the 3D poses of each person. These two models are trained in a self-supervised manner, thus avoiding the need for large datasets with 3D annotations.

4/10/2024

cs.CV cs.AI

The More You See in 2D, the More You Perceive in 3D

Xinyang Han, Zelin Gao, Angjoo Kanazawa, Shubham Goel, Yossi Gandelsman

Humans can infer 3D structure from 2D images of an object based on past experience and improve their 3D understanding as they see more images. Inspired by this behavior, we introduce SAP3D, a system for 3D reconstruction and novel view synthesis from an arbitrary number of unposed images. Given a few unposed images of an object, we adapt a pre-trained view-conditioned diffusion model together with the camera poses of the images via test-time fine-tuning. The adapted diffusion model and the obtained camera poses are then utilized as instance-specific priors for 3D reconstruction and novel view synthesis. We show that as the number of input images increases, the performance of our approach improves, bridging the gap between optimization-based prior-less 3D reconstruction methods and single-image-to-3D diffusion-based methods. We demonstrate our system on real images as well as standard synthetic benchmarks. Our ablation studies confirm that this adaption behavior is key for more accurate 3D understanding.

4/5/2024

cs.CV

R2Human: Real-Time 3D Human Appearance Rendering from a Single Image

Yuanwang Yang, Qiao Feng, Yu-Kun Lai, Kun Li

Rendering 3D human appearance in different views is crucial for achieving holographic communication and immersive VR/AR. Existing methods either rely on multi-camera setups or have low-quality rendered images from a single image. In this paper, we propose R2Human, the first approach for real-time inference and rendering of photorealistic 3D human appearance from a single image. The core of our approach is to combine the strengths of implicit texture fields and explicit neural rendering with our novel representation, namely Z-map. Based on this, we present an end-to-end network that performs high-fidelity color reconstruction of visible areas and provides reliable color inference for occluded regions. To further enhance the 3D perception ability of our network, we leverage the Fourier occupancy field as a prior for generating the texture field and providing a sampling surface in the rendering stage. We also propose a consistency loss and a spatio-temporal fusion strategy to ensure the multi-view coherence. Experimental results show that our method outperforms the state-of-the-art methods on both synthetic data and challenging real-world images, in real time.

6/17/2024

cs.CV