VITON-DiT: Learning In-the-Wild Video Try-On from Human Dance Videos via Diffusion Transformers

2405.18326

Published 6/10/2024 by Jun Zheng, Fuwei Zhao, Youjiang Xu, Xin Dong, Xiaodan Liang

VITON-DiT: Learning In-the-Wild Video Try-On from Human Dance Videos via Diffusion Transformers

Abstract

Video try-on stands as a promising area for its tremendous real-world potential. Prior works are limited to transferring product clothing images onto person videos with simple poses and backgrounds, while underperforming on casually captured videos. Recently, Sora revealed the scalability of Diffusion Transformer (DiT) in generating lifelike videos featuring real-world scenarios. Inspired by this, we explore and propose the first DiT-based video try-on framework for practical in-the-wild applications, named VITON-DiT. Specifically, VITON-DiT consists of a garment extractor, a Spatial-Temporal denoising DiT, and an identity preservation ControlNet. To faithfully recover the clothing details, the extracted garment features are fused with the self-attention outputs of the denoising DiT and the ControlNet. We also introduce novel random selection strategies during training and an Interpolated Auto-Regressive (IAR) technique at inference to facilitate long video generation. Unlike existing attempts that require the laborious and restrictive construction of a paired training dataset, severely limiting their scalability, VITON-DiT alleviates this by relying solely on unpaired human dance videos and a carefully designed multi-stage training strategy. Furthermore, we curate a challenging benchmark dataset to evaluate the performance of casual video try-on. Extensive experiments demonstrate the superiority of VITON-DiT in generating spatio-temporal consistent try-on results for in-the-wild videos with complicated human poses.

Create account to get full access

Overview

This paper introduces VITON-DiT, a novel framework for learning in-the-wild video try-on from human dance videos using diffusion transformers.
The key innovations include a two-stage diffusion model that generates high-quality video try-on results, and a self-supervised pre-training scheme to learn effective image and motion representations from in-the-wild dance videos.
The proposed method outperforms state-of-the-art video try-on approaches and can generate vivid, realistic video try-on results from unconstrained human dance videos.

Plain English Explanation

The paper presents a new system called VITON-DiT that can take a video of a person dancing and virtually "try on" different clothing items on them. This is a challenging task, as the system needs to understand how the person is moving and then realistically apply the new clothing to their body in a way that looks natural and matches the movement.

The key innovations in VITON-DiT are a two-stage diffusion model that generates high-quality video try-on results, and a self-supervised pre-training scheme that helps the system learn effective representations of both the image data and the motion of the person in the video. This allows VITON-DiT to work well on "in-the-wild" dance videos, rather than just highly-controlled settings.

Compared to previous virtual try-on and video generation approaches, VITON-DiT is able to produce more realistic and vivid video try-on results from real-world human dance videos. This could have applications in areas like online clothing shopping, movie/TV special effects, and interactive virtual avatars.

Technical Explanation

The key technical innovations in VITON-DiT include:

Two-stage Diffusion Model: The system uses a two-stage diffusion model, where the first stage generates a coarse video try-on result, and the second stage refines this result to produce the final high-quality video.
Self-supervised Pre-training: The authors utilize a self-supervised pre-training scheme to learn effective representations of both the image data and the motion of the person in the input dance videos. This allows the model to generalize well to "in-the-wild" dance videos, rather than just highly-controlled settings.
Motion-Aware Design: The system is designed to be aware of the person's motion in the input video, which is crucial for generating realistic video try-on results that match the movement of the person.

The paper presents extensive experiments evaluating VITON-DiT on various benchmarks, including comparisons to state-of-the-art video try-on and video generation methods. The results demonstrate that VITON-DiT outperforms previous approaches and can generate vivid, realistic video try-on results from unconstrained human dance videos.

Critical Analysis

The paper provides a comprehensive and technical explanation of the VITON-DiT framework, and the results showcase its ability to generate high-quality video try-on results from in-the-wild dance videos. However, the authors do acknowledge some limitations:

Clothing Diversity: The current system is limited in the range of clothing styles and types it can handle, and may struggle with more complex or unusual garments.
Generalization to Other Domains: While the self-supervised pre-training helps VITON-DiT generalize to in-the-wild dance videos, it's unclear how well the system would perform on other types of human motion, such as sports or everyday activities.
Real-time Performance: The current implementation of VITON-DiT is not optimized for real-time performance, which may limit its practical applications in interactive settings.

Future research could explore ways to address these limitations, such as expanding the clothing repertoire, exploring transfer learning to other motion domains, and optimizing the system for faster inference times.

Conclusion

The VITON-DiT framework represents a significant advancement in the field of video try-on, demonstrating the ability to generate realistic and vivid results from unconstrained human dance videos. The key innovations, including the two-stage diffusion model and the self-supervised pre-training scheme, allow the system to effectively capture both the visual and motion aspects of the input videos.

The potential applications of this technology are broad, ranging from online clothing shopping and movie/TV special effects to interactive virtual avatars. As the field of video generation continues to evolve, further research building on the insights and techniques presented in this paper could lead to even more realistic and versatile video try-on systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

👀

ViViD: Video Virtual Try-on using Diffusion Models

Zixun Fang, Wei Zhai, Aimin Su, Hongliang Song, Kai Zhu, Mao Wang, Yu Chen, Zhiheng Liu, Yang Cao, Zheng-Jun Zha

Video virtual try-on aims to transfer a clothing item onto the video of a target person. Directly applying the technique of image-based try-on to the video domain in a frame-wise manner will cause temporal-inconsistent outcomes while previous video-based try-on solutions can only generate low visual quality and blurring results. In this work, we present ViViD, a novel framework employing powerful diffusion models to tackle the task of video virtual try-on. Specifically, we design the Garment Encoder to extract fine-grained clothing semantic features, guiding the model to capture garment details and inject them into the target video through the proposed attention feature fusion mechanism. To ensure spatial-temporal consistency, we introduce a lightweight Pose Encoder to encode pose signals, enabling the model to learn the interactions between clothing and human posture and insert hierarchical Temporal Modules into the text-to-image stable diffusion model for more coherent and lifelike video synthesis. Furthermore, we collect a new dataset, which is the largest, with the most diverse types of garments and the highest resolution for the task of video virtual try-on to date. Extensive experiments demonstrate that our approach is able to yield satisfactory video try-on results. The dataset, codes, and weights will be publicly available. Project page: https://becauseimbatman0.github.io/ViViD.

5/29/2024

cs.CV

🏅

MV-VTON: Multi-View Virtual Try-On with Diffusion Models

Haoyu Wang, Zhilu Zhang, Donglin Di, Shiliang Zhang, Wangmeng Zuo

The goal of image-based virtual try-on is to generate an image of the target person naturally wearing the given clothing. However, most existing methods solely focus on the frontal try-on using the frontal clothing. When the views of the clothing and person are significantly inconsistent, particularly when the person's view is non-frontal, the results are unsatisfactory. To address this challenge, we introduce Multi-View Virtual Try-ON (MV-VTON), which aims to reconstruct the dressing results of a person from multiple views using the given clothes. On the one hand, given that single-view clothes provide insufficient information for MV-VTON, we instead employ two images, i.e., the frontal and back views of the clothing, to encompass the complete view as much as possible. On the other hand, the diffusion models that have demonstrated superior abilities are adopted to perform our MV-VTON. In particular, we propose a view-adaptive selection method where hard-selection and soft-selection are applied to the global and local clothing feature extraction, respectively. This ensures that the clothing features are roughly fit to the person's view. Subsequently, we suggest a joint attention block to align and fuse clothing features with person features. Additionally, we collect a MV-VTON dataset, i.e., Multi-View Garment (MVG), in which each person has multiple photos with diverse views and poses. Experiments show that the proposed method not only achieves state-of-the-art results on MV-VTON task using our MVG dataset, but also has superiority on frontal-view virtual try-on task using VITON-HD and DressCode datasets. Codes and datasets will be publicly released at https://github.com/hywang2002/MV-VTON .

4/30/2024

cs.CV

Self-Supervised Vision Transformer for Enhanced Virtual Clothes Try-On

Lingxiao Lu, Shengyi Wu, Haoxuan Sun, Junhong Gou, Jianlou Si, Chen Qian, Jianfu Zhang, Liqing Zhang

Virtual clothes try-on has emerged as a vital feature in online shopping, offering consumers a critical tool to visualize how clothing fits. In our research, we introduce an innovative approach for virtual clothes try-on, utilizing a self-supervised Vision Transformer (ViT) coupled with a diffusion model. Our method emphasizes detail enhancement by contrasting local clothing image embeddings, generated by ViT, with their global counterparts. Techniques such as conditional guidance and focus on key regions have been integrated into our approach. These combined strategies empower the diffusion model to reproduce clothing details with increased clarity and realism. The experimental results showcase substantial advancements in the realism and precision of details in virtual try-on experiences, significantly surpassing the capabilities of existing technologies.

6/18/2024

cs.CV

VTON-IT: Virtual Try-On using Image Translation

Santosh Adhikari, Bishnu Bhusal, Prashant Ghimire, Anil Shrestha

Virtual Try-On (trying clothes virtually) is a promising application of the Generative Adversarial Network (GAN). However, it is an arduous task to transfer the desired clothing item onto the corresponding regions of a human body because of varying body size, pose, and occlusions like hair and overlapped clothes. In this paper, we try to produce photo-realistic translated images through semantic segmentation and a generative adversarial architecture-based image translation network. We present a novel image-based Virtual Try-On application VTON-IT that takes an RGB image, segments desired body part, and overlays target cloth over the segmented body region. Most state-of-the-art GAN-based Virtual Try-On applications produce unaligned pixelated synthesis images on real-life test images. However, our approach generates high-resolution natural images with detailed textures on such variant images.

5/8/2024

cs.CV cs.LG