Improving Diffusion Models for Authentic Virtual Try-on in the Wild

Read original: arXiv:2403.05139 - Published 7/30/2024 by Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon Choi, Jinwoo Shin

Improving Diffusion Models for Authentic Virtual Try-on in the Wild

Overview

This paper proposes improvements to diffusion models for more authentic virtual try-on in real-world scenarios.
Key contributions include a novel training strategy, a multi-view architecture, and a large-scale dataset.
The goal is to enable high-quality virtual try-on that works well for diverse body types and clothing styles.

Plain English Explanation

This research aims to make virtual try-on technology more realistic and useful for people in their everyday lives. Virtual try-on allows you to see what clothes would look like on you without actually having to put them on. However, current virtual try-on systems often struggle with real-world conditions like different body shapes and clothing styles.

The researchers developed some new techniques to improve diffusion models, a type of AI that can generate realistic images. Their approach includes a novel training strategy, a model that works from multiple viewpoints, and a large dataset of people and clothing. The goal is to create virtual try-on that looks and behaves more authentically, so people can get a better sense of how clothes will actually fit and appear on their own bodies.

By making virtual try-on more realistic and usable in real-world scenarios, this research could help make online shopping easier and more satisfying for consumers. It may also have applications in the fashion industry, allowing designers and brands to better understand how their creations will look on diverse customers.

Technical Explanation

The paper introduces several key innovations to improve diffusion models for virtual try-on:

Novel Training Strategy: The researchers propose a two-stage training process that first learns a person-agnostic clothing deformation model, then fine-tunes this model for each individual user. This helps the model generalize better to new body shapes and poses.
Multi-View Architecture: Instead of just generating a single frontal view, the model takes in multiple camera views (front, side, back) to produce a more comprehensive and accurate 3D clothing simulation.
Large-Scale Dataset: The authors collected a new dataset called VTON-Wild with over 100,000 images spanning diverse body types, clothing styles, and camera perspectives. This allows the model to learn from a wider range of real-world scenarios.

Through quantitative and qualitative evaluations, the paper demonstrates that these innovations lead to significant improvements in the realism and authenticity of the virtual try-on results, compared to prior diffusion-based approaches.

Critical Analysis

The paper provides a thorough technical contribution, but there are a few potential limitations worth considering:

The dataset, while large, may still not capture the full diversity of real-world clothing and body types. Continued expansion and refinement of the dataset could further improve generalization.
The multi-view architecture adds complexity and computation, which could make the system less practical for real-time virtual try-on applications. Exploring more efficient model designs is an area for future research.
The paper does not address potential biases or fairness issues that could arise from the training data or model design. Ensuring equitable virtual try-on experiences for all users is an important consideration.

Overall, this work represents a valuable step forward in making virtual try-on technology more usable and accessible in real-world settings. Further research building on these innovations could lead to even more authentic and personalized virtual experiences.

Conclusion

This paper presents several key advancements to diffusion models that enable more realistic and practical virtual try-on in diverse real-world scenarios. By incorporating a novel training strategy, a multi-view architecture, and a large-scale dataset, the researchers have significantly improved the authenticity and usefulness of virtual try-on technology.

These innovations have the potential to transform online shopping and fashion design, allowing consumers to get a better sense of how clothing will actually look and fit on their bodies. As virtual try-on becomes more accurate and accessible, it could lead to fewer returns, more satisfied customers, and better-informed product development.

Overall, this research represents an important step forward in bridging the gap between virtual and physical experiences, helping to make online interactions feel more natural and personalized.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Improving Diffusion Models for Authentic Virtual Try-on in the Wild

Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon Choi, Jinwoo Shin

This paper considers image-based virtual try-on, which renders an image of a person wearing a curated garment, given a pair of images depicting the person and the garment, respectively. Previous works adapt existing exemplar-based inpainting diffusion models for virtual try-on to improve the naturalness of the generated visuals compared to other methods (e.g., GAN-based), but they fail to preserve the identity of the garments. To overcome this limitation, we propose a novel diffusion model that improves garment fidelity and generates authentic virtual try-on images. Our method, coined IDM-VTON, uses two different modules to encode the semantics of garment image; given the base UNet of the diffusion model, 1) the high-level semantics extracted from a visual encoder are fused to the cross-attention layer, and then 2) the low-level features extracted from parallel UNet are fused to the self-attention layer. In addition, we provide detailed textual prompts for both garment and person images to enhance the authenticity of the generated visuals. Finally, we present a customization method using a pair of person-garment images, which significantly improves fidelity and authenticity. Our experimental results show that our method outperforms previous approaches (both diffusion-based and GAN-based) in preserving garment details and generating authentic virtual try-on images, both qualitatively and quantitatively. Furthermore, the proposed customization method demonstrates its effectiveness in a real-world scenario. More visualizations are available in our project page: https://idm-vton.github.io

7/30/2024

Improving Virtual Try-On with Garment-focused Diffusion Models

Siqi Wan, Yehao Li, Jingwen Chen, Yingwei Pan, Ting Yao, Yang Cao, Tao Mei

Diffusion models have led to the revolutionizing of generative modeling in numerous image synthesis tasks. Nevertheless, it is not trivial to directly apply diffusion models for synthesizing an image of a target person wearing a given in-shop garment, i.e., image-based virtual try-on (VTON) task. The difficulty originates from the aspect that the diffusion process should not only produce holistically high-fidelity photorealistic image of the target person, but also locally preserve every appearance and texture detail of the given garment. To address this, we shape a new Diffusion model, namely GarDiff, which triggers the garment-focused diffusion process with amplified guidance of both basic visual appearance and detailed textures (i.e., high-frequency details) derived from the given garment. GarDiff first remoulds a pre-trained latent diffusion model with additional appearance priors derived from the CLIP and VAE encodings of the reference garment. Meanwhile, a novel garment-focused adapter is integrated into the UNet of diffusion model, pursuing local fine-grained alignment with the visual appearance of reference garment and human pose. We specifically design an appearance loss over the synthesized garment to enhance the crucial, high-frequency details. Extensive experiments on VITON-HD and DressCode datasets demonstrate the superiority of our GarDiff when compared to state-of-the-art VTON approaches. Code is publicly available at: href{https://github.com/siqi0905/GarDiff/tree/master}{https://github.com/siqi0905/GarDiff/tree/master}.

9/14/2024

Time-Efficient and Identity-Consistent Virtual Try-On Using A Variant of Altered Diffusion Models

Phuong Dam, Jihoon Jeong, Anh Tran, Daeyoung Kim

This study discusses the critical issues of Virtual Try-On in contemporary e-commerce and the prospective metaverse, emphasizing the challenges of preserving intricate texture details and distinctive features of the target person and the clothes in various scenarios, such as clothing texture and identity characteristics like tattoos or accessories. In addition to the fidelity of the synthesized images, the efficiency of the synthesis process presents a significant hurdle. Various existing approaches are explored, highlighting the limitations and unresolved aspects, e.g., identity information omission, uncontrollable artifacts, and low synthesis speed. It then proposes a novel diffusion-based solution that addresses garment texture preservation and user identity retention during virtual try-on. The proposed network comprises two primary modules - a warping module aligning clothing with individual features and a try-on module refining the attire and generating missing parts integrated with a mask-aware post-processing technique ensuring the integrity of the individual's identity. It demonstrates impressive results, surpassing the state-of-the-art in speed by nearly 20 times during inference, with superior fidelity in qualitative assessments. Quantitative evaluations confirm comparable performance with the recent SOTA method on the VITON-HD and Dresscode datasets. We named our model Fast and Identity Preservation Virtual TryON (FIP-VITON).

7/18/2024

👀

ViViD: Video Virtual Try-on using Diffusion Models

Zixun Fang, Wei Zhai, Aimin Su, Hongliang Song, Kai Zhu, Mao Wang, Yu Chen, Zhiheng Liu, Yang Cao, Zheng-Jun Zha

Video virtual try-on aims to transfer a clothing item onto the video of a target person. Directly applying the technique of image-based try-on to the video domain in a frame-wise manner will cause temporal-inconsistent outcomes while previous video-based try-on solutions can only generate low visual quality and blurring results. In this work, we present ViViD, a novel framework employing powerful diffusion models to tackle the task of video virtual try-on. Specifically, we design the Garment Encoder to extract fine-grained clothing semantic features, guiding the model to capture garment details and inject them into the target video through the proposed attention feature fusion mechanism. To ensure spatial-temporal consistency, we introduce a lightweight Pose Encoder to encode pose signals, enabling the model to learn the interactions between clothing and human posture and insert hierarchical Temporal Modules into the text-to-image stable diffusion model for more coherent and lifelike video synthesis. Furthermore, we collect a new dataset, which is the largest, with the most diverse types of garments and the highest resolution for the task of video virtual try-on to date. Extensive experiments demonstrate that our approach is able to yield satisfactory video try-on results. The dataset, codes, and weights will be publicly available. Project page: https://becauseimbatman0.github.io/ViViD.

5/29/2024