ViViD: Video Virtual Try-on using Diffusion Models

2405.11794

Published 5/29/2024 by Zixun Fang, Wei Zhai, Aimin Su, Hongliang Song, Kai Zhu, Mao Wang, Yu Chen, Zhiheng Liu, Yang Cao, Zheng-Jun Zha

cs.CV

👀

Abstract

Video virtual try-on aims to transfer a clothing item onto the video of a target person. Directly applying the technique of image-based try-on to the video domain in a frame-wise manner will cause temporal-inconsistent outcomes while previous video-based try-on solutions can only generate low visual quality and blurring results. In this work, we present ViViD, a novel framework employing powerful diffusion models to tackle the task of video virtual try-on. Specifically, we design the Garment Encoder to extract fine-grained clothing semantic features, guiding the model to capture garment details and inject them into the target video through the proposed attention feature fusion mechanism. To ensure spatial-temporal consistency, we introduce a lightweight Pose Encoder to encode pose signals, enabling the model to learn the interactions between clothing and human posture and insert hierarchical Temporal Modules into the text-to-image stable diffusion model for more coherent and lifelike video synthesis. Furthermore, we collect a new dataset, which is the largest, with the most diverse types of garments and the highest resolution for the task of video virtual try-on to date. Extensive experiments demonstrate that our approach is able to yield satisfactory video try-on results. The dataset, codes, and weights will be publicly available. Project page: https://becauseimbatman0.github.io/ViViD.

Create account to get full access

Overview

Formatting instructions for submitting papers to the NeurIPS 2023 conference
Covers style, length, and other technical requirements for the paper submission

Plain English Explanation

The paper provides the formatting guidelines for submitting research papers to the NeurIPS 2023 conference. NeurIPS (Neural Information Processing Systems) is a prestigious machine learning conference, and these instructions outline the specific requirements authors must follow when preparing their submissions.

The key points covered include:

Style: The paper must be formatted in the NeurIPS style, including font size, margins, and layout.
Length: There is a strict page limit for submissions, typically around 8-10 pages.
Figures and Tables: Guidelines on including visual elements and formatting them correctly.
References: Instructions on citing sources and formatting the reference list.

Following these guidelines ensures the paper is properly formatted and can be easily reviewed by the conference committee. This helps streamline the submission and review process for NeurIPS.

Technical Explanation

The paper outlines the formatting instructions for submitting research papers to the NeurIPS 2023 conference. It covers the specific requirements for the paper's style, including font, margins, and layout. There are also guidelines for the length of the paper, typically around 8-10 pages.

The instructions also provide details on including figures and tables in the paper, as well as how to properly format the references. Following these guidelines ensures the paper is presented in a consistent and easily reviewable format.

Critical Analysis

The formatting instructions are comprehensive and provide clear guidance to authors on the technical requirements for submitting a paper to NeurIPS 2023. The guidelines ensure a consistent presentation of research across all submissions, which helps the review process run smoothly.

One potential limitation is the strict page limit, which may constrain authors in fully explaining their work. However, this is a common requirement for many top-tier conferences to manage the volume of submissions.

Overall, the formatting instructions serve an important role in maintaining the high standards and professionalism of the NeurIPS conference. Authors must carefully follow these guidelines to give their research the best chance of acceptance.

Conclusion

The formatting instructions for NeurIPS 2023 lay out the technical requirements for submitting research papers to the conference. These guidelines cover the style, length, and other formatting details that authors must adhere to.

Following these instructions ensures a consistent presentation of research across all submissions, which streamlines the review process for the conference organizers. While the guidelines may pose some constraints, they ultimately help maintain the high quality and professional standards that NeurIPS is known for.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🏅

MV-VTON: Multi-View Virtual Try-On with Diffusion Models

Haoyu Wang, Zhilu Zhang, Donglin Di, Shiliang Zhang, Wangmeng Zuo

The goal of image-based virtual try-on is to generate an image of the target person naturally wearing the given clothing. However, most existing methods solely focus on the frontal try-on using the frontal clothing. When the views of the clothing and person are significantly inconsistent, particularly when the person's view is non-frontal, the results are unsatisfactory. To address this challenge, we introduce Multi-View Virtual Try-ON (MV-VTON), which aims to reconstruct the dressing results of a person from multiple views using the given clothes. On the one hand, given that single-view clothes provide insufficient information for MV-VTON, we instead employ two images, i.e., the frontal and back views of the clothing, to encompass the complete view as much as possible. On the other hand, the diffusion models that have demonstrated superior abilities are adopted to perform our MV-VTON. In particular, we propose a view-adaptive selection method where hard-selection and soft-selection are applied to the global and local clothing feature extraction, respectively. This ensures that the clothing features are roughly fit to the person's view. Subsequently, we suggest a joint attention block to align and fuse clothing features with person features. Additionally, we collect a MV-VTON dataset, i.e., Multi-View Garment (MVG), in which each person has multiple photos with diverse views and poses. Experiments show that the proposed method not only achieves state-of-the-art results on MV-VTON task using our MVG dataset, but also has superiority on frontal-view virtual try-on task using VITON-HD and DressCode datasets. Codes and datasets will be publicly released at https://github.com/hywang2002/MV-VTON .

4/30/2024

cs.CV

Self-Supervised Vision Transformer for Enhanced Virtual Clothes Try-On

Lingxiao Lu, Shengyi Wu, Haoxuan Sun, Junhong Gou, Jianlou Si, Chen Qian, Jianfu Zhang, Liqing Zhang

Virtual clothes try-on has emerged as a vital feature in online shopping, offering consumers a critical tool to visualize how clothing fits. In our research, we introduce an innovative approach for virtual clothes try-on, utilizing a self-supervised Vision Transformer (ViT) coupled with a diffusion model. Our method emphasizes detail enhancement by contrasting local clothing image embeddings, generated by ViT, with their global counterparts. Techniques such as conditional guidance and focus on key regions have been integrated into our approach. These combined strategies empower the diffusion model to reproduce clothing details with increased clarity and realism. The experimental results showcase substantial advancements in the realism and precision of details in virtual try-on experiences, significantly surpassing the capabilities of existing technologies.

6/18/2024

cs.CV

VITON-DiT: Learning In-the-Wild Video Try-On from Human Dance Videos via Diffusion Transformers

Jun Zheng, Fuwei Zhao, Youjiang Xu, Xin Dong, Xiaodan Liang

Video try-on stands as a promising area for its tremendous real-world potential. Prior works are limited to transferring product clothing images onto person videos with simple poses and backgrounds, while underperforming on casually captured videos. Recently, Sora revealed the scalability of Diffusion Transformer (DiT) in generating lifelike videos featuring real-world scenarios. Inspired by this, we explore and propose the first DiT-based video try-on framework for practical in-the-wild applications, named VITON-DiT. Specifically, VITON-DiT consists of a garment extractor, a Spatial-Temporal denoising DiT, and an identity preservation ControlNet. To faithfully recover the clothing details, the extracted garment features are fused with the self-attention outputs of the denoising DiT and the ControlNet. We also introduce novel random selection strategies during training and an Interpolated Auto-Regressive (IAR) technique at inference to facilitate long video generation. Unlike existing attempts that require the laborious and restrictive construction of a paired training dataset, severely limiting their scalability, VITON-DiT alleviates this by relying solely on unpaired human dance videos and a carefully designed multi-stage training strategy. Furthermore, we curate a challenging benchmark dataset to evaluate the performance of casual video try-on. Extensive experiments demonstrate the superiority of VITON-DiT in generating spatio-temporal consistent try-on results for in-the-wild videos with complicated human poses.

6/10/2024

cs.CV

📉

Image-Based Virtual Try-On: A Survey

Dan Song, Xuanpu Zhang, Juan Zhou, Weizhi Nie, Ruofeng Tong, Mohan Kankanhalli, An-An Liu

Image-based virtual try-on aims to synthesize a naturally dressed person image with a clothing image, which revolutionizes online shopping and inspires related topics within image generation, showing both research significance and commercial potential. However, there is a gap between current research progress and commercial applications and an absence of comprehensive overview of this field to accelerate the development. In this survey, we provide a comprehensive analysis of the state-of-the-art techniques and methodologies in aspects of pipeline architecture, person representation and key modules such as try-on indication, clothing warping and try-on stage. We propose a new semantic criteria with CLIP, and evaluate representative methods with uniformly implemented evaluation metrics on the same dataset. In addition to quantitative and qualitative evaluation of current open-source methods, unresolved issues are highlighted and future research directions are prospected to identify key trends and inspire further exploration. The uniformly implemented evaluation metrics, dataset and collected methods will be made public available at https://github.com/little-misfit/Survey-Of-Virtual-Try-On.

5/2/2024

cs.CV