AutoStudio: Crafting Consistent Subjects in Multi-turn Interactive Image Generation

Read original: arXiv:2406.01388 - Published 6/12/2024 by Junhao Cheng, Xi Lu, Hanhui Li, Khun Loun Zai, Baiqiao Yin, Yuhao Cheng, Yiqiang Yan, Xiaodan Liang

🖼️

Overview

Cutting-edge text-to-image (T2I) models can already produce remarkable single images, but a more challenging task is multi-turn interactive image generation.
This task requires models to interact with users over multiple turns to generate a coherent sequence of images, while maintaining subject consistency despite frequent subject changes.
To address this issue, the paper introduces a training-free multi-agent framework called AutoStudio.

Plain English Explanation

The paper focuses on a more advanced task in the field of text-to-image (T2I) generation. While current T2I models can create impressive single images, the researchers wanted to tackle the even more challenging problem of generating a sequence of coherent images through a multi-turn interaction with the user.

In this scenario, users may frequently change the subject they want to see in the images. The researchers found that existing approaches struggle to maintain consistency across these subject changes while still generating diverse images. To solve this problem, they developed a new framework called AutoStudio.

AutoStudio uses a team of specialized agents, each with a different role, to handle the multi-turn interaction and image generation process. This includes an agent to manage the context of each subject, another to generate the layout and positioning of elements in the image, a supervisor to refine the layout, and a final agent to complete the image generation.

Additionally, the researchers introduced a novel neural network architecture called Parallel-UNet, which helps the system better preserve the representation of small subjects in the generated images. They also developed a subject-initialized generation method to further improve the consistency of the generated sequence of images.

The key idea is to use this collaborative, multi-agent approach to maintain subject consistency and generate a coherent series of images, even as the user changes the desired subject matter over the course of the interaction.

Technical Explanation

The paper introduces a training-free multi-agent framework called AutoStudio to address the challenge of multi-turn interactive image generation. AutoStudio consists of four main components:

Subject Manager: Responsible for interpreting the interaction dialogue and managing the context of each subject.
Layout Generator: Generates fine-grained bounding boxes to control the locations of subjects in the image.
Supervisor: Provides suggestions for refining the layout generated by the Layout Generator.
Drawer: Completes the final image generation using a Stable Diffusion-based model.

To improve the system's ability to preserve small subjects, the researchers introduce a Parallel-UNet architecture, which employs two parallel cross-attention modules to better exploit subject-aware features.

They also propose a subject-initialized generation method to further enhance the consistency of the generated image sequence, even when the user changes the subject matter.

Extensive experiments on the CMIGBench benchmark and human evaluations show that AutoStudio outperforms existing state-of-the-art approaches in maintaining multi-subject consistency across multiple turns. The system achieves a 13.65% improvement in average Fréchet Inception Distance and a 2.83% improvement in average character-character similarity.

Critical Analysis

The paper presents a comprehensive and well-designed solution to the challenging problem of multi-turn interactive image generation. The use of a multi-agent framework, with each agent specializing in a particular task, is a clever approach that allows the system to handle the complexities of the problem.

One potential limitation mentioned in the paper is the need for further research to improve the system's ability to handle abrupt subject changes or completely new subjects that were not part of the training data. The researchers suggest exploring more advanced language understanding and context-aware generation techniques to address this issue.

Additionally, while the paper focuses on text-to-image generation, the underlying principles and approaches could potentially be extended to other multimodal tasks, such as text-to-video generation or character-driven interactive storytelling. Exploring these possibilities could lead to further advancements in the field of multi-subject personalization and training-free subject-enhanced attention guidance.

Conclusion

The paper introduces a novel, training-free multi-agent framework called AutoStudio that addresses the challenge of multi-turn interactive image generation. By employing specialized agents to handle different aspects of the task, the system can maintain subject consistency and generate a coherent sequence of images, even as the user changes the desired subject matter.

The key innovations, such as the Parallel-UNet architecture and the subject-initialized generation method, enable AutoStudio to outperform existing state-of-the-art approaches in terms of both image quality and subject consistency. This research represents an important step forward in the development of more advanced and user-friendly text-to-image generation systems, with potential applications in areas like interactive storytelling, personalized content creation, and smart digital assistants.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🖼️

AutoStudio: Crafting Consistent Subjects in Multi-turn Interactive Image Generation

Junhao Cheng, Xi Lu, Hanhui Li, Khun Loun Zai, Baiqiao Yin, Yuhao Cheng, Yiqiang Yan, Xiaodan Liang

As cutting-edge Text-to-Image (T2I) generation models already excel at producing remarkable single images, an even more challenging task, i.e., multi-turn interactive image generation begins to attract the attention of related research communities. This task requires models to interact with users over multiple turns to generate a coherent sequence of images. However, since users may switch subjects frequently, current efforts struggle to maintain subject consistency while generating diverse images. To address this issue, we introduce a training-free multi-agent framework called AutoStudio. AutoStudio employs three agents based on large language models (LLMs) to handle interactions, along with a stable diffusion (SD) based agent for generating high-quality images. Specifically, AutoStudio consists of (i) a subject manager to interpret interaction dialogues and manage the context of each subject, (ii) a layout generator to generate fine-grained bounding boxes to control subject locations, (iii) a supervisor to provide suggestions for layout refinements, and (iv) a drawer to complete image generation. Furthermore, we introduce a Parallel-UNet to replace the original UNet in the drawer, which employs two parallel cross-attention modules for exploiting subject-aware features. We also introduce a subject-initialized generation method to better preserve small subjects. Our AutoStudio hereby can generate a sequence of multi-subject images interactively and consistently. Extensive experiments on the public CMIGBench benchmark and human evaluations show that AutoStudio maintains multi-subject consistency across multiple turns well, and it also raises the state-of-the-art performance by 13.65% in average Frechet Inception Distance and 2.83% in average character-character similarity.

6/12/2024

🛸

Training-Free Consistent Text-to-Image Generation

Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, Yuval Atzmon

Text-to-image models offer a new level of creative flexibility by allowing users to guide the image generation process through natural language. However, using these models to consistently portray the same subject across diverse prompts remains challenging. Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects or add image conditioning to the model. These methods require lengthy per-subject optimization or large-scale pre-training. Moreover, they struggle to align generated images with text prompts and face difficulties in portraying multiple subjects. Here, we present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model. We introduce a subject-driven shared attention block and correspondence-based feature injection to promote subject consistency between images. Additionally, we develop strategies to encourage layout diversity while maintaining subject consistency. We compare ConsiStory to a range of baselines, and demonstrate state-of-the-art performance on subject consistency and text alignment, without requiring a single optimization step. Finally, ConsiStory can naturally extend to multi-subject scenarios, and even enable training-free personalization for common objects.

5/31/2024

🛸

DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control

Hong Chen, Xin Wang, Yipeng Zhang, Yuwei Zhou, Zeyang Zhang, Siao Tang, Wenwu Zhu

Generating customized content in videos has received increasing attention recently. However, existing works primarily focus on customized text-to-video generation for single subject, suffering from subject-missing and attribute-binding problems when the video is expected to contain multiple subjects. Furthermore, existing models struggle to assign the desired actions to the corresponding subjects (action-binding problem), failing to achieve satisfactory multi-subject generation performance. To tackle the problems, in this paper, we propose DisenStudio, a novel framework that can generate text-guided videos for customized multiple subjects, given few images for each subject. Specifically, DisenStudio enhances a pretrained diffusion-based text-to-video model with our proposed spatial-disentangled cross-attention mechanism to associate each subject with the desired action. Then the model is customized for the multiple subjects with the proposed motion-preserved disentangled finetuning, which involves three tuning strategies: multi-subject co-occurrence tuning, masked single-subject tuning, and multi-subject motion-preserved tuning. The first two strategies guarantee the subject occurrence and preserve their visual attributes, and the third strategy helps the model maintain the temporal motion-generation ability when finetuning on static images. We conduct extensive experiments to demonstrate our proposed DisenStudio significantly outperforms existing methods in various metrics. Additionally, we show that DisenStudio can be used as a powerful tool for various controllable generation applications.

5/22/2024

🖼️

TheaterGen: Character Management with LLM for Consistent Multi-turn Image Generation

Junhao Cheng, Baiqiao Yin, Kaixin Cai, Minbin Huang, Hanhui Li, Yuxin He, Xi Lu, Yue Li, Yifei Li, Yuhao Cheng, Yiqiang Yan, Xiaodan Liang

Recent advances in diffusion models can generate high-quality and stunning images from text. However, multi-turn image generation, which is of high demand in real-world scenarios, still faces challenges in maintaining semantic consistency between images and texts, as well as contextual consistency of the same subject across multiple interactive turns. To address this issue, we introduce TheaterGen, a training-free framework that integrates large language models (LLMs) and text-to-image (T2I) models to provide the capability of multi-turn image generation. Within this framework, LLMs, acting as a Screenwriter, engage in multi-turn interaction, generating and managing a standardized prompt book that encompasses prompts and layout designs for each character in the target image. Based on these, Theatergen generate a list of character images and extract guidance information, akin to the Rehearsal. Subsequently, through incorporating the prompt book and guidance information into the reverse denoising process of T2I diffusion models, Theatergen generate the final image, as conducting the Final Performance. With the effective management of prompt books and character images, TheaterGen significantly improves semantic and contextual consistency in synthesized images. Furthermore, we introduce a dedicated benchmark, CMIGBench (Consistent Multi-turn Image Generation Benchmark) with 8000 multi-turn instructions. Different from previous multi-turn benchmarks, CMIGBench does not define characters in advance. Both the tasks of story generation and multi-turn editing are included on CMIGBench for comprehensive evaluation. Extensive experimental results show that TheaterGen outperforms state-of-the-art methods significantly. It raises the performance bar of the cutting-edge Mini DALLE 3 model by 21% in average character-character similarity and 19% in average text-image similarity.

4/30/2024