DreamVTON: Customizing 3D Virtual Try-on with Personalized Diffusion Models

Read original: arXiv:2407.16511 - Published 7/24/2024 by Zhenyu Xie, Haoye Dong, Yufei Gao, Zehua Ma, Xiaodan Liang

DreamVTON: Customizing 3D Virtual Try-on with Personalized Diffusion Models

Overview

This paper presents DreamVTON, a system for customizing 3D virtual try-on with personalized diffusion models.
DreamVTON allows users to virtually try on clothing on a 3D avatar that is customized to their own body shape and proportions.
The system uses personalized diffusion models to generate realistic 3D clothing representations that fit the user's avatar.

Plain English Explanation

DreamVTON: Customizing 3D Virtual Try-on with Personalized Diffusion Models is a new system that lets people virtually try on clothes in 3D. Instead of just seeing a generic avatar, the 3D model is customized to match the user's own body. This is done using "personalized diffusion models" - special AI algorithms that can create realistic 3D clothing that fits the user's avatar.

This is useful because it allows people to get a much better sense of how clothes would look on them, without having to physically try everything on. The personalized 3D model provides a more accurate representation of how the clothing would actually fit. This could help reduce returns and improve the online shopping experience.

Technical Explanation

The key technical innovation in DreamVTON is the use of personalized diffusion models to generate the 3D clothing representations. Diffusion models are a type of generative AI that can create new realistic images by learning from a large dataset.

The researchers trained these diffusion models on a dataset of 3D clothing and body scans. By conditioning the diffusion model on a user's specific body shape and measurements, they were able to generate 3D clothing that realistically fits that individual's avatar.

The system also includes techniques for virtual try-on, where the generated 3D clothing is seamlessly overlaid onto the user's avatar. This allows for an immersive and interactive experience of virtually trying on different outfits.

Critical Analysis

The paper acknowledges some limitations of the current DreamVTON system. For example, it only supports a fixed set of clothing items, and the generation of 3D clothing is still limited in its realism and detail.

Additionally, the training of the personalized diffusion models requires a substantial dataset of 3D body scans and clothing, which may not be readily available for all users and clothing types.

Further research could explore ways to expand the range of supported clothing, improve the quality of the 3D renderings, and reduce the data requirements for personalization. Integrating DreamVTON with other 3D scanning or reconstruction technologies could also enhance the user experience.

Conclusion

DreamVTON represents an important step forward in the field of 3D virtual try-on, by leveraging personalized diffusion models to create a more realistic and customized experience for users. As the technology continues to evolve, this type of system could have significant implications for online shopping, fashion design, and consumer engagement with virtual experiences.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

DreamVTON: Customizing 3D Virtual Try-on with Personalized Diffusion Models

Zhenyu Xie, Haoye Dong, Yufei Gao, Zehua Ma, Xiaodan Liang

Image-based 3D Virtual Try-ON (VTON) aims to sculpt the 3D human according to person and clothes images, which is data-efficient (i.e., getting rid of expensive 3D data) but challenging. Recent text-to-3D methods achieve remarkable improvement in high-fidelity 3D human generation, demonstrating its potential for 3D virtual try-on. Inspired by the impressive success of personalized diffusion models (e.g., Dreambooth and LoRA) for 2D VTON, it is straightforward to achieve 3D VTON by integrating the personalization technique into the diffusion-based text-to-3D framework. However, employing the personalized module in a pre-trained diffusion model (e.g., StableDiffusion (SD)) would degrade the model's capability for multi-view or multi-domain synthesis, which is detrimental to the geometry and texture optimization guided by Score Distillation Sampling (SDS) loss. In this work, we propose a novel customizing 3D human try-on model, named textbf{DreamVTON}, to separately optimize the geometry and texture of the 3D human. Specifically, a personalized SD with multi-concept LoRA is proposed to provide the generative prior about the specific person and clothes, while a Densepose-guided ControlNet is exploited to guarantee consistent prior about body pose across various camera views. Besides, to avoid the inconsistent multi-view priors from the personalized SD dominating the optimization, DreamVTON introduces a template-based optimization mechanism, which employs mask templates for geometry shape learning and normal/RGB templates for geometry/texture details learning. Furthermore, for the geometry optimization phase, DreamVTON integrates a normal-style LoRA into personalized SD to enhance normal map generative prior, facilitating smooth geometry modeling.

7/24/2024

🏅

MV-VTON: Multi-View Virtual Try-On with Diffusion Models

Haoyu Wang, Zhilu Zhang, Donglin Di, Shiliang Zhang, Wangmeng Zuo

The goal of image-based virtual try-on is to generate an image of the target person naturally wearing the given clothing. However, existing methods solely focus on the frontal try-on using the frontal clothing. When the views of the clothing and person are significantly inconsistent, particularly when the person's view is non-frontal, the results are unsatisfactory. To address this challenge, we introduce Multi-View Virtual Try-ON (MV-VTON), which aims to reconstruct the dressing results from multiple views using the given clothes. Given that single-view clothes provide insufficient information for MV-VTON, we instead employ two images, i.e., the frontal and back views of the clothing, to encompass the complete view as much as possible. Moreover, we adopt diffusion models that have demonstrated superior abilities to perform our MV-VTON. In particular, we propose a view-adaptive selection method where hard-selection and soft-selection are applied to the global and local clothing feature extraction, respectively. This ensures that the clothing features are roughly fit to the person's view. Subsequently, we suggest joint attention blocks to align and fuse clothing features with person features. Additionally, we collect a MV-VTON dataset MVG, in which each person has multiple photos with diverse views and poses. Experiments show that the proposed method not only achieves state-of-the-art results on MV-VTON task using our MVG dataset, but also has superiority on frontal-view virtual try-on task using VITON-HD and DressCode datasets. Codes and datasets are publicly released at https://github.com/hywang2002/MV-VTON .

9/5/2024

Improving Diffusion Models for Authentic Virtual Try-on in the Wild

Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon Choi, Jinwoo Shin

This paper considers image-based virtual try-on, which renders an image of a person wearing a curated garment, given a pair of images depicting the person and the garment, respectively. Previous works adapt existing exemplar-based inpainting diffusion models for virtual try-on to improve the naturalness of the generated visuals compared to other methods (e.g., GAN-based), but they fail to preserve the identity of the garments. To overcome this limitation, we propose a novel diffusion model that improves garment fidelity and generates authentic virtual try-on images. Our method, coined IDM-VTON, uses two different modules to encode the semantics of garment image; given the base UNet of the diffusion model, 1) the high-level semantics extracted from a visual encoder are fused to the cross-attention layer, and then 2) the low-level features extracted from parallel UNet are fused to the self-attention layer. In addition, we provide detailed textual prompts for both garment and person images to enhance the authenticity of the generated visuals. Finally, we present a customization method using a pair of person-garment images, which significantly improves fidelity and authenticity. Our experimental results show that our method outperforms previous approaches (both diffusion-based and GAN-based) in preserving garment details and generating authentic virtual try-on images, both qualitatively and quantitatively. Furthermore, the proposed customization method demonstrates its effectiveness in a real-world scenario. More visualizations are available in our project page: https://idm-vton.github.io

7/30/2024

CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models

Zheng Chong, Xiao Dong, Haoxiang Li, Shiyue Zhang, Wenqing Zhang, Xujie Zhang, Hanqing Zhao, Xiaodan Liang

Virtual try-on methods based on diffusion models achieve realistic try-on effects but often replicate the backbone network as a ReferenceNet or use additional image encoders to process condition inputs, leading to high training and inference costs. In this work, we rethink the necessity of ReferenceNet and image encoders and innovate the interaction between garment and person by proposing CatVTON, a simple and efficient virtual try-on diffusion model. CatVTON facilitates the seamless transfer of in-shop or worn garments of any category to target persons by simply concatenating them in spatial dimensions as inputs. The efficiency of our model is demonstrated in three aspects: (1) Lightweight network: Only the original diffusion modules are used, without additional network modules. The text encoder and cross-attentions for text injection in the backbone are removed, reducing the parameters by 167.02M. (2) Parameter-efficient training: We identified the try-on relevant modules through experiments and achieved high-quality try-on effects by training only 49.57M parameters, approximately 5.51 percent of the backbone network's parameters. (3) Simplified inference: CatVTON eliminates all unnecessary conditions and preprocessing steps, including pose estimation, human parsing, and text input, requiring only a garment reference, target person image, and mask for the virtual try-on process. Extensive experiments demonstrate that CatVTON achieves superior qualitative and quantitative results with fewer prerequisites and trainable parameters than baseline methods. Furthermore, CatVTON shows good generalization in in-the-wild scenarios despite using open-source datasets with only 73K samples.

7/24/2024