Improving Virtual Try-On with Garment-focused Diffusion Models

Read original: arXiv:2409.08258 - Published 9/14/2024 by Siqi Wan, Yehao Li, Jingwen Chen, Yingwei Pan, Ting Yao, Yang Cao, Tao Mei

Improving Virtual Try-On with Garment-focused Diffusion Models

Overview

This paper proposes a novel diffusion model for improving virtual try-on, with a focus on garment-specific appearance priors.
The model aims to generate high-quality, realistic images of people wearing different clothing items.
The researchers introduce a garment-focused approach to enhance the visual realism and consistency of virtual try-on results.

Plain English Explanation

The paper describes a new type of artificial intelligence (AI) model that can help improve virtual try-on experiences. Virtual try-on is when you can see what you would look like wearing a piece of clothing, without actually trying it on.

The researchers developed a diffusion model that is specifically designed to focus on the appearance of the garment (clothing item) being tried on. This helps the model generate more realistic and consistent images of people wearing different clothes.

Traditionally, virtual try-on systems have struggled to create images that look natural and true-to-life. This new garment-focused diffusion model aims to address those limitations by better understanding the unique properties and characteristics of each clothing item. This allows the model to more accurately apply the clothing to the person's body and create a more convincing final image.

Technical Explanation

The paper introduces a garment-focused diffusion model for improving virtual try-on. Diffusion models are a type of AI that can generate new images by starting with random noise and gradually refining it into a realistic-looking image.

The key innovation in this work is the incorporation of garment-specific appearance priors into the diffusion model. This means the model has a better understanding of how different clothing items should look and behave when worn by a person. This is achieved through careful design of the model architecture and training process.

The researchers also propose a multi-view virtual try-on approach that generates consistent images from multiple viewpoints. This helps create a more immersive and realistic virtual try-on experience for users.

Extensive experiments on benchmark datasets demonstrate the superior performance of the garment-focused diffusion model compared to previous virtual try-on methods. The generated images exhibit higher visual quality, better garment-body alignment, and more coherent appearance across different views.

Critical Analysis

The paper provides a thorough technical explanation of the proposed garment-focused diffusion model and its advantages over prior virtual try-on approaches. The researchers have carefully designed the model architecture and training process to better account for the unique properties of garments.

However, the paper does not address some potential limitations or areas for further research. For example, the model may still struggle with complex fabric textures, intricate garment designs, or significant body shape differences between the user and the reference model. Additional work may be needed to further improve the realism and generalization capabilities of the virtual try-on system.

Additionally, the paper does not discuss the computational efficiency or real-time performance of the proposed model, which are important considerations for practical deployment in consumer-facing applications. Future research could explore ways to optimize the model for faster inference and more seamless user experiences.

Conclusion

The garment-focused diffusion model presented in this paper represents a significant advancement in virtual try-on technology. By incorporating garment-specific appearance priors, the model can generate more realistic and consistent images of people wearing different clothing items.

This research has the potential to greatly improve the online shopping experience, allowing customers to better visualize how clothes will look on them before making a purchase. The improved realism and multi-view consistency of the virtual try-on results could also benefit various other applications, such as fashion design, virtual fashion shows, and digital avatars.

Overall, this paper makes a valuable contribution to the field of AI-powered virtual try-on and paves the way for further developments in this important and growing area of research.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Improving Virtual Try-On with Garment-focused Diffusion Models

Siqi Wan, Yehao Li, Jingwen Chen, Yingwei Pan, Ting Yao, Yang Cao, Tao Mei

Diffusion models have led to the revolutionizing of generative modeling in numerous image synthesis tasks. Nevertheless, it is not trivial to directly apply diffusion models for synthesizing an image of a target person wearing a given in-shop garment, i.e., image-based virtual try-on (VTON) task. The difficulty originates from the aspect that the diffusion process should not only produce holistically high-fidelity photorealistic image of the target person, but also locally preserve every appearance and texture detail of the given garment. To address this, we shape a new Diffusion model, namely GarDiff, which triggers the garment-focused diffusion process with amplified guidance of both basic visual appearance and detailed textures (i.e., high-frequency details) derived from the given garment. GarDiff first remoulds a pre-trained latent diffusion model with additional appearance priors derived from the CLIP and VAE encodings of the reference garment. Meanwhile, a novel garment-focused adapter is integrated into the UNet of diffusion model, pursuing local fine-grained alignment with the visual appearance of reference garment and human pose. We specifically design an appearance loss over the synthesized garment to enhance the crucial, high-frequency details. Extensive experiments on VITON-HD and DressCode datasets demonstrate the superiority of our GarDiff when compared to state-of-the-art VTON approaches. Code is publicly available at: href{https://github.com/siqi0905/GarDiff/tree/master}{https://github.com/siqi0905/GarDiff/tree/master}.

9/14/2024

Improving Diffusion Models for Authentic Virtual Try-on in the Wild

Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon Choi, Jinwoo Shin

This paper considers image-based virtual try-on, which renders an image of a person wearing a curated garment, given a pair of images depicting the person and the garment, respectively. Previous works adapt existing exemplar-based inpainting diffusion models for virtual try-on to improve the naturalness of the generated visuals compared to other methods (e.g., GAN-based), but they fail to preserve the identity of the garments. To overcome this limitation, we propose a novel diffusion model that improves garment fidelity and generates authentic virtual try-on images. Our method, coined IDM-VTON, uses two different modules to encode the semantics of garment image; given the base UNet of the diffusion model, 1) the high-level semantics extracted from a visual encoder are fused to the cross-attention layer, and then 2) the low-level features extracted from parallel UNet are fused to the self-attention layer. In addition, we provide detailed textual prompts for both garment and person images to enhance the authenticity of the generated visuals. Finally, we present a customization method using a pair of person-garment images, which significantly improves fidelity and authenticity. Our experimental results show that our method outperforms previous approaches (both diffusion-based and GAN-based) in preserving garment details and generating authentic virtual try-on images, both qualitatively and quantitatively. Furthermore, the proposed customization method demonstrates its effectiveness in a real-world scenario. More visualizations are available in our project page: https://idm-vton.github.io

7/30/2024

🏅

MV-VTON: Multi-View Virtual Try-On with Diffusion Models

Haoyu Wang, Zhilu Zhang, Donglin Di, Shiliang Zhang, Wangmeng Zuo

The goal of image-based virtual try-on is to generate an image of the target person naturally wearing the given clothing. However, existing methods solely focus on the frontal try-on using the frontal clothing. When the views of the clothing and person are significantly inconsistent, particularly when the person's view is non-frontal, the results are unsatisfactory. To address this challenge, we introduce Multi-View Virtual Try-ON (MV-VTON), which aims to reconstruct the dressing results from multiple views using the given clothes. Given that single-view clothes provide insufficient information for MV-VTON, we instead employ two images, i.e., the frontal and back views of the clothing, to encompass the complete view as much as possible. Moreover, we adopt diffusion models that have demonstrated superior abilities to perform our MV-VTON. In particular, we propose a view-adaptive selection method where hard-selection and soft-selection are applied to the global and local clothing feature extraction, respectively. This ensures that the clothing features are roughly fit to the person's view. Subsequently, we suggest joint attention blocks to align and fuse clothing features with person features. Additionally, we collect a MV-VTON dataset MVG, in which each person has multiple photos with diverse views and poses. Experiments show that the proposed method not only achieves state-of-the-art results on MV-VTON task using our MVG dataset, but also has superiority on frontal-view virtual try-on task using VITON-HD and DressCode datasets. Codes and datasets are publicly released at https://github.com/hywang2002/MV-VTON .

9/5/2024

👀

ViViD: Video Virtual Try-on using Diffusion Models

Zixun Fang, Wei Zhai, Aimin Su, Hongliang Song, Kai Zhu, Mao Wang, Yu Chen, Zhiheng Liu, Yang Cao, Zheng-Jun Zha

Video virtual try-on aims to transfer a clothing item onto the video of a target person. Directly applying the technique of image-based try-on to the video domain in a frame-wise manner will cause temporal-inconsistent outcomes while previous video-based try-on solutions can only generate low visual quality and blurring results. In this work, we present ViViD, a novel framework employing powerful diffusion models to tackle the task of video virtual try-on. Specifically, we design the Garment Encoder to extract fine-grained clothing semantic features, guiding the model to capture garment details and inject them into the target video through the proposed attention feature fusion mechanism. To ensure spatial-temporal consistency, we introduce a lightweight Pose Encoder to encode pose signals, enabling the model to learn the interactions between clothing and human posture and insert hierarchical Temporal Modules into the text-to-image stable diffusion model for more coherent and lifelike video synthesis. Furthermore, we collect a new dataset, which is the largest, with the most diverse types of garments and the highest resolution for the task of video virtual try-on to date. Extensive experiments demonstrate that our approach is able to yield satisfactory video try-on results. The dataset, codes, and weights will be publicly available. Project page: https://becauseimbatman0.github.io/ViViD.

5/29/2024