DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control

Read original: arXiv:2405.12796 - Published 5/22/2024 by Hong Chen, Xin Wang, Yipeng Zhang, Yuwei Zhou, Zeyang Zhang, Siao Tang, Wenwu Zhu

🛸

Overview

This paper proposes a novel framework called DisenStudio to generate customized text-guided videos with multiple subjects.
Existing methods struggle with issues like subject-missing, attribute-binding, and action-binding when generating videos with multiple subjects.
DisenStudio tackles these problems by enhancing a pre-trained diffusion-based text-to-video model with a spatial-disentangled cross-attention mechanism and customizing the model through multi-subject co-occurrence tuning, masked single-subject tuning, and multi-subject motion-preserved tuning.

Plain English Explanation

The paper focuses on a challenge in creating customized videos: generating videos with multiple people or objects (referred to as "subjects") where each subject can be assigned specific actions or attributes. Existing methods have struggled with this, often failing to properly associate the desired actions with the correct subjects or resulting in some subjects being missing from the final video.

To address these issues, the researchers developed a new framework called DisenStudio. DisenStudio builds upon a pre-existing text-to-video model, enhancing it with a spatial-disentangled cross-attention mechanism to better connect the subjects in the video with the desired actions.

The researchers then customize the model further through three tuning strategies:

Multi-subject co-occurrence tuning: This ensures that all the desired subjects appear in the final video.
Masked single-subject tuning: This helps preserve the visual attributes of each individual subject.
Multi-subject motion-preserved tuning: This maintains the model's ability to generate realistic temporal motion, even when fine-tuning on static images.

These customization techniques allow DisenStudio to generate high-quality, multi-subject videos where the actions and attributes of each subject are properly aligned, overcoming the limitations of previous methods.

Technical Explanation

The paper introduces a novel framework called DisenStudio that can generate text-guided videos with customized multiple subjects. This addresses the shortcomings of existing methods, which primarily focus on single-subject text-to-video generation and struggle with issues like subject-missing, attribute-binding, and action-binding when dealing with multiple subjects.

To tackle these problems, DisenStudio enhances a pre-trained diffusion-based text-to-video model with a spatial-disentangled cross-attention mechanism. This mechanism helps the model associate each subject with the desired actions more effectively.

The researchers then customize the model for multiple subjects through three tuning strategies:

Multi-subject co-occurrence tuning: This ensures that all the subjects specified in the text prompt appear in the generated video.
Masked single-subject tuning: This helps preserve the visual attributes of each individual subject by training the model on masked single-subject images.
Multi-subject motion-preserved tuning: This maintains the model's ability to generate realistic temporal motion, even when fine-tuning on static images.

The paper presents extensive experiments demonstrating that DisenStudio significantly outperforms existing methods in various metrics for multi-subject video generation. Additionally, the researchers show that DisenStudio can be a powerful tool for various controllable generation applications.

Critical Analysis

The paper provides a comprehensive solution to the challenge of generating customized videos with multiple subjects. The proposed DisenStudio framework effectively addresses the key issues of subject-missing, attribute-binding, and action-binding that have plagued existing methods.

However, the paper does not extensively discuss the potential limitations or caveats of the DisenStudio approach. For example, it is unclear how the framework would perform in scenarios with a larger number of subjects or subjects with more complex interactions. Additionally, the paper does not explore the model's robustness to variations in the input text prompts, which could be an important consideration for real-world applications.

Further research could also investigate the computational efficiency and resource requirements of the DisenStudio framework, as well as its scalability to larger and more diverse datasets. Exploring ways to make the model more interpretable and explainable could also be a valuable direction for future work.

Overall, the DisenStudio framework represents a significant advancement in the field of customized video generation, but there are still opportunities for further refinement and exploration to fully unlock its potential.

Conclusion

The paper introduces DisenStudio, a novel framework that can generate text-guided videos with customized multiple subjects. By enhancing a pre-trained diffusion-based text-to-video model and employing specialized tuning strategies, DisenStudio effectively addresses the key challenges of subject-missing, attribute-binding, and action-binding that have plagued existing methods.

The extensive experiments demonstrate the superior performance of DisenStudio compared to other approaches, making it a powerful tool for various controllable generation applications. While the paper does not extensively discuss the potential limitations of the framework, the proposed solutions represent a significant step forward in the field of customized video generation and pave the way for further research and development in this area.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🛸

DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control

Hong Chen, Xin Wang, Yipeng Zhang, Yuwei Zhou, Zeyang Zhang, Siao Tang, Wenwu Zhu

Generating customized content in videos has received increasing attention recently. However, existing works primarily focus on customized text-to-video generation for single subject, suffering from subject-missing and attribute-binding problems when the video is expected to contain multiple subjects. Furthermore, existing models struggle to assign the desired actions to the corresponding subjects (action-binding problem), failing to achieve satisfactory multi-subject generation performance. To tackle the problems, in this paper, we propose DisenStudio, a novel framework that can generate text-guided videos for customized multiple subjects, given few images for each subject. Specifically, DisenStudio enhances a pretrained diffusion-based text-to-video model with our proposed spatial-disentangled cross-attention mechanism to associate each subject with the desired action. Then the model is customized for the multiple subjects with the proposed motion-preserved disentangled finetuning, which involves three tuning strategies: multi-subject co-occurrence tuning, masked single-subject tuning, and multi-subject motion-preserved tuning. The first two strategies guarantee the subject occurrence and preserve their visual attributes, and the third strategy helps the model maintain the temporal motion-generation ability when finetuning on static images. We conduct extensive experiments to demonstrate our proposed DisenStudio significantly outperforms existing methods in various metrics. Additionally, we show that DisenStudio can be used as a powerful tool for various controllable generation applications.

5/22/2024

CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

Zhao Wang, Aoxue Li, Lingting Zhu, Yong Guo, Qi Dou, Zhenguo Li

Customized text-to-video generation aims to generate high-quality videos guided by text prompts and subject references. Current approaches for personalizing text-to-video generation suffer from tackling multiple subjects, which is a more challenging and practical scenario. In this work, our aim is to promote multi-subject guided text-to-video customization. We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects. To be specific, firstly, we encourage the co-occurrence of multiple subjects via composing them in a single image. Further, upon a basic text-to-video diffusion model, we design a simple yet effective attention control strategy to disentangle different subjects in the latent space of diffusion model. Moreover, to help the model focus on the specific area of the object, we segment the object from given reference images and provide a corresponding object mask for attention learning. Also, we collect a multi-subject text-to-video generation dataset as a comprehensive benchmark, with 63 individual subjects from 13 different categories and 68 meaningful pairs. Extensive qualitative, quantitative, and user study results demonstrate the superiority of our method compared to previous state-of-the-art approaches. The project page is https://kyfafyd.wang/projects/customvideo.

5/24/2024

🖼️

AutoStudio: Crafting Consistent Subjects in Multi-turn Interactive Image Generation

Junhao Cheng, Xi Lu, Hanhui Li, Khun Loun Zai, Baiqiao Yin, Yuhao Cheng, Yiqiang Yan, Xiaodan Liang

As cutting-edge Text-to-Image (T2I) generation models already excel at producing remarkable single images, an even more challenging task, i.e., multi-turn interactive image generation begins to attract the attention of related research communities. This task requires models to interact with users over multiple turns to generate a coherent sequence of images. However, since users may switch subjects frequently, current efforts struggle to maintain subject consistency while generating diverse images. To address this issue, we introduce a training-free multi-agent framework called AutoStudio. AutoStudio employs three agents based on large language models (LLMs) to handle interactions, along with a stable diffusion (SD) based agent for generating high-quality images. Specifically, AutoStudio consists of (i) a subject manager to interpret interaction dialogues and manage the context of each subject, (ii) a layout generator to generate fine-grained bounding boxes to control subject locations, (iii) a supervisor to provide suggestions for layout refinements, and (iv) a drawer to complete image generation. Furthermore, we introduce a Parallel-UNet to replace the original UNet in the drawer, which employs two parallel cross-attention modules for exploiting subject-aware features. We also introduce a subject-initialized generation method to better preserve small subjects. Our AutoStudio hereby can generate a sequence of multi-subject images interactively and consistently. Extensive experiments on the public CMIGBench benchmark and human evaluations show that AutoStudio maintains multi-subject consistency across multiple turns well, and it also raises the state-of-the-art performance by 13.65% in average Frechet Inception Distance and 2.83% in average character-character similarity.

6/12/2024

AttenCraft: Attention-guided Disentanglement of Multiple Concepts for Text-to-Image Customization

Junjie Shentu, Matthew Watson, Noura Al Moubayed

With the unprecedented performance being achieved by text-to-image (T2I) diffusion models, T2I customization further empowers users to tailor the diffusion model to new concepts absent in the pre-training dataset, termed subject-driven generation. Moreover, extracting several new concepts from a single image enables the model to learn multiple concepts, and simultaneously decreases the difficulties of training data preparation, urging the disentanglement of multiple concepts to be a new challenge. However, existing models for disentanglement commonly require pre-determined masks or retain background elements. To this end, we propose an attention-guided method, AttenCraft, for multiple concept disentanglement. In particular, our method leverages self-attention and cross-attention maps to create accurate masks for each concept within a single initialization step, omitting any required mask preparation by humans or other models. The created masks are then applied to guide the cross-attention activation of each target concept during training and achieve concept disentanglement. Additionally, we introduce Uniform sampling and Reweighted sampling schemes to alleviate the non-synchronicity of feature acquisition from different concepts, and improve generation quality. Our method outperforms baseline models in terms of image-alignment, and behaves comparably on text-alignment. Finally, we showcase the applicability of AttenCraft to more complicated settings, such as an input image containing three concepts. The project is available at https://github.com/junjie-shentu/AttenCraft.

5/29/2024