CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

Read original: arXiv:2401.09962 - Published 5/24/2024 by Zhao Wang, Aoxue Li, Lingting Zhu, Yong Guo, Qi Dou, Zhenguo Li

CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

Overview

The paper presents a novel text-to-video generation model called CustomVideo that allows for customization of the generated video with multiple subjects.
It builds upon existing text-to-video generation techniques and introduces new capabilities to control the visual elements in the output.
The model is evaluated on various datasets and benchmarks, demonstrating its ability to generate high-quality videos with customizable content.

Plain English Explanation

The researchers have developed a new AI system that can create videos based on text descriptions, with the ability to customize the content in the videos. This is an advancement over previous text-to-video generation models, which were limited in their ability to control the specific elements that appear in the generated videos.

The CustomVideo model allows users to provide text prompts that describe the desired video, and then fine-tune the visual elements to their preferences. For example, a user could ask the system to create a video about a person cooking, and then specify that they want the person to be wearing a particular outfit or have a certain hairstyle.

This level of customization is achieved through new techniques that give the model more control over the individual components of the video, such as the characters, objects, and backgrounds. The researchers tested their model on several datasets and found that it can generate high-quality, personalized videos that closely match the user's preferences.

This work is part of a broader trend in the field of text-to-video generation and text-to-image generation, where AI systems are becoming increasingly capable of translating language into visual media. The CustomVideo model represents a significant step forward in this direction, allowing for more personalized and customized video content.

Technical Explanation

The CustomVideo model builds upon existing text-to-video generation techniques, such as DisenStudio and Direct Video, to enable customization of the generated videos with multiple subjects.

The key innovation of the CustomVideo model is its ability to disentangle the different visual elements in the video, such as the characters, objects, and backgrounds, and allow for independent control over these components. This is achieved through a novel architecture that includes separate encoders and decoders for the various visual elements, as well as a cross-attention mechanism to align the text input with the corresponding visual features.

The model is trained on a diverse dataset of videos with multiple subjects, which allows it to learn the necessary representations to generate videos with customizable content. During inference, the user can provide a text prompt describing the desired video, and then fine-tune the individual visual elements to their preferences.

The researchers evaluated the CustomVideo model on several benchmark datasets and found that it outperforms previous text-to-video generation models in terms of both objective metrics and subjective human evaluations. The model's ability to generate high-quality, customized videos with multiple subjects represents a significant advancement in the field of multi-subject personalization for text-to-video generation.

Critical Analysis

The CustomVideo model presents an impressive step forward in text-to-video generation, but there are a few potential limitations and areas for further research:

Generalization: While the model demonstrates strong performance on the evaluated datasets, it's unclear how well it would generalize to more diverse or complex video content beyond the training distributions. Further testing on a wider range of video types and scenarios would be beneficial.
Computational Efficiency: The model's ability to fine-tune individual visual elements may come at the cost of increased computational complexity and inference time. Exploring ways to improve the efficiency of the customization process would be valuable.
Ethical Considerations: As with any powerful generative model, there are potential concerns around the misuse of CustomVideo for the creation of deceptive or harmful content. The researchers should consider addressing these ethical implications in their future work.
User Experience: While the model provides a high degree of customization, the user interface and interaction design aspects were not the primary focus of this research. Exploring ways to make the customization process more intuitive and accessible for end-users would be an important next step.

Overall, the CustomVideo model represents a significant advancement in the field of text-to-video generation, and the researchers should be commended for their innovative approach. However, continued research and development will be necessary to address the potential limitations and ensure the responsible deployment of this technology.

Conclusion

The CustomVideo model introduced in this paper represents a notable advancement in the field of text-to-video generation. By leveraging techniques to disentangle and independently control the various visual elements in the generated videos, the model enables a new level of customization and personalization that was not possible with previous approaches.

The researchers' thorough evaluation and demonstration of the model's capabilities suggest that CustomVideo could have significant implications for a wide range of applications, from entertainment and media production to educational and training materials. As the field of text-to-video generation and multi-subject personalization continues to evolve, the innovations presented in this paper are likely to inspire further research and development towards even more advanced and versatile text-to-video systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

Zhao Wang, Aoxue Li, Lingting Zhu, Yong Guo, Qi Dou, Zhenguo Li

Customized text-to-video generation aims to generate high-quality videos guided by text prompts and subject references. Current approaches for personalizing text-to-video generation suffer from tackling multiple subjects, which is a more challenging and practical scenario. In this work, our aim is to promote multi-subject guided text-to-video customization. We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects. To be specific, firstly, we encourage the co-occurrence of multiple subjects via composing them in a single image. Further, upon a basic text-to-video diffusion model, we design a simple yet effective attention control strategy to disentangle different subjects in the latent space of diffusion model. Moreover, to help the model focus on the specific area of the object, we segment the object from given reference images and provide a corresponding object mask for attention learning. Also, we collect a multi-subject text-to-video generation dataset as a comprehensive benchmark, with 63 individual subjects from 13 different categories and 68 meaningful pairs. Extensive qualitative, quantitative, and user study results demonstrate the superiority of our method compared to previous state-of-the-art approaches. The project page is https://kyfafyd.wang/projects/customvideo.

5/24/2024

🛸

DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control

Hong Chen, Xin Wang, Yipeng Zhang, Yuwei Zhou, Zeyang Zhang, Siao Tang, Wenwu Zhu

Generating customized content in videos has received increasing attention recently. However, existing works primarily focus on customized text-to-video generation for single subject, suffering from subject-missing and attribute-binding problems when the video is expected to contain multiple subjects. Furthermore, existing models struggle to assign the desired actions to the corresponding subjects (action-binding problem), failing to achieve satisfactory multi-subject generation performance. To tackle the problems, in this paper, we propose DisenStudio, a novel framework that can generate text-guided videos for customized multiple subjects, given few images for each subject. Specifically, DisenStudio enhances a pretrained diffusion-based text-to-video model with our proposed spatial-disentangled cross-attention mechanism to associate each subject with the desired action. Then the model is customized for the multiple subjects with the proposed motion-preserved disentangled finetuning, which involves three tuning strategies: multi-subject co-occurrence tuning, masked single-subject tuning, and multi-subject motion-preserved tuning. The first two strategies guarantee the subject occurrence and preserve their visual attributes, and the third strategy helps the model maintain the temporal motion-generation ability when finetuning on static images. We conduct extensive experiments to demonstrate our proposed DisenStudio significantly outperforms existing methods in various metrics. Additionally, we show that DisenStudio can be used as a powerful tool for various controllable generation applications.

5/22/2024

🛸

Text Prompting for Multi-Concept Video Customization by Autoregressive Generation

Divya Kothandaraman, Kihyuk Sohn, Ruben Villegas, Paul Voigtlaender, Dinesh Manocha, Mohammad Babaeizadeh

We present a method for multi-concept customization of pretrained text-to-video (T2V) models. Intuitively, the multi-concept customized video can be derived from the (non-linear) intersection of the video manifolds of the individual concepts, which is not straightforward to find. We hypothesize that sequential and controlled walking towards the intersection of the video manifolds, directed by text prompting, leads to the solution. To do so, we generate the various concepts and their corresponding interactions, sequentially, in an autoregressive manner. Our method can generate videos of multiple custom concepts (subjects, action and background) such as a teddy bear running towards a brown teapot, a dog playing violin and a teddy bear swimming in the ocean. We quantitatively evaluate our method using videoCLIP and DINO scores, in addition to human evaluation. Videos for results presented in this paper can be found at https://github.com/divyakraman/MultiConceptVideo2024.

5/24/2024

Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models

Yixuan Ren, Yang Zhou, Jimei Yang, Jing Shi, Difan Liu, Feng Liu, Mingi Kwon, Abhinav Shrivastava

Image customization has been extensively studied in text-to-image (T2I) diffusion models, leading to impressive outcomes and applications. With the emergence of text-to-video (T2V) diffusion models, its temporal counterpart, motion customization, has not yet been well investigated. To address the challenge of one-shot video motion customization, we propose Customize-A-Video that models the motion from a single reference video and adapts it to new subjects and scenes with both spatial and temporal varieties. It leverages low-rank adaptation (LoRA) on temporal attention layers to tailor the pre-trained T2V diffusion model for specific motion modeling. To disentangle the spatial and temporal information during training, we introduce a novel concept of appearance absorbers that detach the original appearance from the reference video prior to motion learning. The proposed modules are trained in a staged pipeline and inferred in a plug-and-play fashion, enabling easy extensions to various downstream tasks such as custom video generation and editing, video appearance customization and multiple motion combination. Our project page can be found at https://customize-a-video.github.io.

8/29/2024