SCULPT: Shape-Conditioned Unpaired Learning of Pose-dependent Clothed and Textured Human Meshes

2308.10638

Published 5/7/2024 by Soubhik Sanyal, Partha Ghosh, Jinlong Yang, Michael J. Black, Justus Thies, Timo Bolkart

🔎

Abstract

We present SCULPT, a novel 3D generative model for clothed and textured 3D meshes of humans. Specifically, we devise a deep neural network that learns to represent the geometry and appearance distribution of clothed human bodies. Training such a model is challenging, as datasets of textured 3D meshes for humans are limited in size and accessibility. Our key observation is that there exist medium-sized 3D scan datasets like CAPE, as well as large-scale 2D image datasets of clothed humans and multiple appearances can be mapped to a single geometry. To effectively learn from the two data modalities, we propose an unpaired learning procedure for pose-dependent clothed and textured human meshes. Specifically, we learn a pose-dependent geometry space from 3D scan data. We represent this as per vertex displacements w.r.t. the SMPL model. Next, we train a geometry conditioned texture generator in an unsupervised way using the 2D image data. We use intermediate activations of the learned geometry model to condition our texture generator. To alleviate entanglement between pose and clothing type, and pose and clothing appearance, we condition both the texture and geometry generators with attribute labels such as clothing types for the geometry, and clothing colors for the texture generator. We automatically generated these conditioning labels for the 2D images based on the visual question answering model BLIP and CLIP. We validate our method on the SCULPT dataset, and compare to state-of-the-art 3D generative models for clothed human bodies. Our code and data can be found at https://sculpt.is.tue.mpg.de.

Create account to get full access

Overview

Presents a novel 3D generative model called SCULPT for creating clothed and textured 3D meshes of human bodies
Addresses the challenge of limited datasets of 3D textured human meshes by leveraging 2D image data and 3D scan data
Proposes an unpaired learning approach to effectively learn from these two data modalities

Plain English Explanation

SCULPT is a new deep learning model that can generate 3D computer models of clothed human bodies, including the shape of the body and the appearance of the clothing. Creating these 3D models is challenging because there aren't many large datasets of 3D scans of real people with their clothes on.

To get around this data limitation, the researchers behind SCULPT had a clever idea. They used two different types of data - 3D scans of people's bodies from a dataset called CAPE, and 2D photos of people wearing different clothes from large image datasets. By combining these two data sources in a novel way, SCULPT learns to generate realistic 3D models of clothed human bodies.

The key innovation is that SCULPT learns a "geometry space" from the 3D scan data, which represents the shape of the body. It then uses this geometry information to generate the correct texture and appearance of the clothing from the 2D image data. This allows SCULPT to create 3D models that not only have the right body shape, but also the right clothing style and appearance.

To further improve the results, SCULPT also uses information about the type of clothing (e.g. shirt, pants) and its color to help the model keep the clothing and body separate. This helps avoid issues where the model confuses the pose of the body with the type or appearance of the clothing.

Overall, SCULPT represents an important advance in the field of 3D human modeling, by enabling the creation of high-quality, textured 3D meshes of clothed human bodies using limited 3D data. This could have applications in areas like virtual reality, animation, and human-computer interaction.

Technical Explanation

The key technical aspects of the SCULPT model are:

Pose-Dependent Geometry Space: The researchers first learn a geometry space that represents the 3D shape of the body in different poses, using the 3D scan data from the CAPE dataset. This is represented as per-vertex displacements from the underlying SMPL body model.
Geometry-Conditioned Texture Generation: They then train a texture generator that can produce the appearance of the clothing, conditioned on the learned geometry information. This is done in an unsupervised way using the 2D image data.
Attribute Conditioning: To help disentangle the interactions between pose, clothing type, and clothing appearance, the researchers condition both the geometry and texture generators on attribute labels, such as clothing type and color. These labels are automatically generated using computer vision models like BLIP and CLIP.
Unpaired Learning: Since the 3D scan data and 2D image data are not directly paired, the researchers use an unpaired learning approach to effectively leverage both data sources. This allows SCULPT to be trained without requiring perfectly aligned 3D and 2D data.

The researchers validate SCULPT on the SCULPT dataset, and compare it to other state-of-the-art 3D generative models for clothed human bodies, demonstrating its ability to generate high-quality, textured 3D meshes.

Critical Analysis

One potential limitation of the SCULPT model is that it relies on the availability of 3D scan data, which may not be easily accessible or scalable to larger datasets. While the researchers showed that SCULPT can be trained with the relatively small CAPE dataset, it would be interesting to see how the model performs with larger 3D datasets, or if it can be adapted to work with even sparser 3D data.

Additionally, the automatic labeling of clothing attributes using models like BLIP and CLIP may introduce some noise or errors, which could impact the final quality of the generated results. It would be valuable to explore the sensitivity of SCULPT to these label imperfections, and whether more precise labeling methods could further improve the model's performance.

Finally, while the researchers have made the SCULPT code and data publicly available, it would be helpful to see more extensive user studies or real-world applications to fully assess the practical utility of the model. Exploring how SCULPT could be integrated into various 3D content creation workflows would be a valuable next step.

Conclusion

The SCULPT model presented in this paper represents an important advancement in the field of 3D human modeling, particularly for generating clothed and textured 3D meshes. By leveraging both 3D scan data and 2D image data in a novel unpaired learning framework, the researchers have demonstrated a way to overcome the limitations of existing 3D human datasets.

The ability to generate high-quality, realistic 3D models of clothed human bodies has numerous potential applications, from virtual reality and animation to fashion and e-commerce. As the field of 3D human modeling continues to evolve, SCULPT and similar techniques will likely play a crucial role in enabling more immersive and engaging digital experiences.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Semantic Human Mesh Reconstruction with Textures

Xiaoyu Zhan, Jianxin Yang, Yuanqi Li, Jie Guo, Yanwen Guo, Wenping Wang

The field of 3D detailed human mesh reconstruction has made significant progress in recent years. However, current methods still face challenges when used in industrial applications due to unstable results, low-quality meshes, and a lack of UV unwrapping and skinning weights. In this paper, we present SHERT, a novel pipeline that can reconstruct semantic human meshes with textures and high-precision details. SHERT applies semantic- and normal-based sampling between the detailed surface (e.g. mesh and SDF) and the corresponding SMPL-X model to obtain a partially sampled semantic mesh and then generates the complete semantic mesh by our specifically designed self-supervised completion and refinement networks. Using the complete semantic mesh as a basis, we employ a texture diffusion model to create human textures that are driven by both images and texts. Our reconstructed meshes have stable UV unwrapping, high-quality triangle meshes, and consistent semantic information. The given SMPL-X model provides semantic information and shape priors, allowing SHERT to perform well even with incorrect and incomplete inputs. The semantic information also makes it easy to substitute and animate different body parts such as the face, body, and hands. Quantitative and qualitative experiments demonstrate that SHERT is capable of producing high-fidelity and robust semantic meshes that outperform state-of-the-art methods.

4/4/2024

cs.CV

🛸

TELA: Text to Layer-wise 3D Clothed Human Generation

Junting Dong, Qi Fang, Zehuan Huang, Xudong Xu, Jingbo Wang, Sida Peng, Bo Dai

This paper addresses the task of 3D clothed human generation from textural descriptions. Previous works usually encode the human body and clothes as a holistic model and generate the whole model in a single-stage optimization, which makes them struggle for clothing editing and meanwhile lose fine-grained control over the whole generation process. To solve this, we propose a layer-wise clothed human representation combined with a progressive optimization strategy, which produces clothing-disentangled 3D human models while providing control capacity for the generation process. The basic idea is progressively generating a minimal-clothed human body and layer-wise clothes. During clothing generation, a novel stratified compositional rendering method is proposed to fuse multi-layer human models, and a new loss function is utilized to help decouple the clothing model from the human body. The proposed method achieves high-quality disentanglement, which thereby provides an effective way for 3D garment generation. Extensive experiments demonstrate that our approach achieves state-of-the-art 3D clothed human generation while also supporting cloth editing applications such as virtual try-on. Project page: http://jtdong.com/tela_layer/

4/26/2024

cs.CV

SMPLX-Lite: A Realistic and Drivable Avatar Benchmark with Rich Geometry and Texture Annotations

Yujiao Jiang, Qingmin Liao, Zhaolong Wang, Xiangru Lin, Zongqing Lu, Yuxi Zhao, Hanqing Wei, Jingrui Ye, Yu Zhang, Zhijing Shao

Recovering photorealistic and drivable full-body avatars is crucial for numerous applications, including virtual reality, 3D games, and tele-presence. Most methods, whether reconstruction or generation, require large numbers of human motion sequences and corresponding textured meshes. To easily learn a drivable avatar, a reasonable parametric body model with unified topology is paramount. However, existing human body datasets either have images or textured models and lack parametric models which fit clothes well. We propose a new parametric model SMPLX-Lite-D, which can fit detailed geometry of the scanned mesh while maintaining stable geometry in the face, hand and foot regions. We present SMPLX-Lite dataset, the most comprehensive clothing avatar dataset with multi-view RGB sequences, keypoints annotations, textured scanned meshes, and textured SMPLX-Lite-D models. With the SMPLX-Lite dataset, we train a conditional variational autoencoder model that takes human pose and facial keypoints as input, and generates a photorealistic drivable human avatar.

5/31/2024

cs.CV cs.GR

3D Human Reconstruction in the Wild with Synthetic Data Using Generative Models

Yongtao Ge, Wenjia Wang, Yongfan Chen, Hao Chen, Chunhua Shen

In this work, we show that synthetic data created by generative models is complementary to computer graphics (CG) rendered data for achieving remarkable generalization performance on diverse real-world scenes for 3D human pose and shape estimation (HPS). Specifically, we propose an effective approach based on recent diffusion models, termed HumanWild, which can effortlessly generate human images and corresponding 3D mesh annotations. We first collect a large-scale human-centric dataset with comprehensive annotations, e.g., text captions and surface normal images. Then, we train a customized ControlNet model upon this dataset to generate diverse human images and initial ground-truth labels. At the core of this step is that we can easily obtain numerous surface normal images from a 3D human parametric model, e.g., SMPL-X, by rendering the 3D mesh onto the image plane. As there exists inevitable noise in the initial labels, we then apply an off-the-shelf foundation segmentation model, i.e., SAM, to filter negative data samples. Our data generation pipeline is flexible and customizable to facilitate different real-world tasks, e.g., ego-centric scenes and perspective-distortion scenes. The generated dataset comprises 0.79M images with corresponding 3D annotations, covering versatile viewpoints, scenes, and human identities. We train various HPS regressors on top of the generated data and evaluate them on a wide range of benchmarks (3DPW, RICH, EgoBody, AGORA, SSP-3D) to verify the effectiveness of the generated data. By exclusively employing generative models, we generate large-scale in-the-wild human images and high-quality annotations, eliminating the need for real-world data collection.

4/12/2024

cs.CV