DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation

Read original: arXiv:2403.08857 - Published 7/4/2024 by Minbin Huang, Yanxin Long, Xinchi Deng, Ruihang Chu, Jiangfeng Xiong, Xiaodan Liang, Hong Cheng, Qinglin Lu, Wei Liu

DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation

Overview

• This paper introduces DialogGen, a multi-modal interactive dialogue system for multi-turn text-to-image generation. • The system allows users to engage in back-and-forth conversations to iteratively refine and improve generated images. • It leverages large language models and vision transformers to enable this interactive text-to-image generation process.

Plain English Explanation

• DialogGen is a new AI system that lets you have a conversation to create images. You can describe what you want, and the system will generate an image. Then you can give feedback, and the system will update the image based on your comments. • This interactive process allows you to work with the system to get the exact image you want, rather than just describing it and getting a single result. • The system uses large language models, which are AI systems trained on vast amounts of text data, and vision transformers, which are AI models trained on image data. This allows it to understand your text prompts and generate matching images.

Technical Explanation

• DialogGen consists of a language model, a vision transformer, and a dialogue manager that coordinates the interaction between the two. • The language model takes the user's text prompts as input and generates relevant image descriptions. • The vision transformer then uses these descriptions to generate corresponding images. • The dialogue manager keeps track of the conversation history and uses it to provide more context for subsequent image generations. • The system is trained on large datasets of image-text pairs to learn the associations between language and visual concepts.

Critical Analysis

• One potential limitation of DialogGen is that the interactive dialogue process may be time-consuming, requiring multiple rounds of feedback and refinement to achieve the desired image. • Additionally, the system's performance may be constrained by the quality and diversity of the training data, which could limit the types of images it can generate. • Further research could explore ways to make the dialogue more efficient, such as by incorporating user preferences or previous interactions to inform the image generation process. • Potential concerns around the ethical use of such systems, such as the potential for bias or the generation of harmful content, should also be carefully considered.

Conclusion

• DialogGen represents an important step forward in the development of interactive, multi-modal AI systems that can assist users in achieving their desired creative and expressive goals. • By leveraging large language models and vision transformers to enable iterative text-to-image generation, the system opens up new possibilities for collaborative and personalized content creation. • As generative AI systems continue to advance, it will be crucial to address the ethical and societal implications of these technologies, ensuring they are developed and deployed responsibly.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation

Minbin Huang, Yanxin Long, Xinchi Deng, Ruihang Chu, Jiangfeng Xiong, Xiaodan Liang, Hong Cheng, Qinglin Lu, Wei Liu

Text-to-image (T2I) generation models have significantly advanced in recent years. However, effective interaction with these models is challenging for average users due to the need for specialized prompt engineering knowledge and the inability to perform multi-turn image generation, hindering a dynamic and iterative creation process. Recent attempts have tried to equip Multi-modal Large Language Models (MLLMs) with T2I models to bring the user's natural language instructions into reality. Hence, the output modality of MLLMs is extended, and the multi-turn generation quality of T2I models is enhanced thanks to the strong multi-modal comprehension ability of MLLMs. However, many of these works face challenges in identifying correct output modalities and generating coherent images accordingly as the number of output modalities increases and the conversations go deeper. Therefore, we propose DialogGen, an effective pipeline to align off-the-shelf MLLMs and T2I models to build a Multi-modal Interactive Dialogue System (MIDS) for multi-turn Text-to-Image generation. It is composed of drawing prompt alignment, careful training data curation, and error correction. Moreover, as the field of MIDS flourishes, comprehensive benchmarks are urgently needed to evaluate MIDS fairly in terms of output modality correctness and multi-modal output coherence. To address this issue, we introduce the Multi-modal Dialogue Benchmark (DialogBen), a comprehensive bilingual benchmark designed to assess the ability of MLLMs to generate accurate and coherent multi-modal content that supports image editing. It contains two evaluation metrics to measure the model's ability to switch modalities and the coherence of the output images. Our extensive experiments on DialogBen and user study demonstrate the effectiveness of DialogGen compared with other State-of-the-Art models.

7/4/2024

🖼️

TheaterGen: Character Management with LLM for Consistent Multi-turn Image Generation

Junhao Cheng, Baiqiao Yin, Kaixin Cai, Minbin Huang, Hanhui Li, Yuxin He, Xi Lu, Yue Li, Yifei Li, Yuhao Cheng, Yiqiang Yan, Xiaodan Liang

Recent advances in diffusion models can generate high-quality and stunning images from text. However, multi-turn image generation, which is of high demand in real-world scenarios, still faces challenges in maintaining semantic consistency between images and texts, as well as contextual consistency of the same subject across multiple interactive turns. To address this issue, we introduce TheaterGen, a training-free framework that integrates large language models (LLMs) and text-to-image (T2I) models to provide the capability of multi-turn image generation. Within this framework, LLMs, acting as a Screenwriter, engage in multi-turn interaction, generating and managing a standardized prompt book that encompasses prompts and layout designs for each character in the target image. Based on these, Theatergen generate a list of character images and extract guidance information, akin to the Rehearsal. Subsequently, through incorporating the prompt book and guidance information into the reverse denoising process of T2I diffusion models, Theatergen generate the final image, as conducting the Final Performance. With the effective management of prompt books and character images, TheaterGen significantly improves semantic and contextual consistency in synthesized images. Furthermore, we introduce a dedicated benchmark, CMIGBench (Consistent Multi-turn Image Generation Benchmark) with 8000 multi-turn instructions. Different from previous multi-turn benchmarks, CMIGBench does not define characters in advance. Both the tasks of story generation and multi-turn editing are included on CMIGBench for comprehensive evaluation. Extensive experimental results show that TheaterGen outperforms state-of-the-art methods significantly. It raises the performance bar of the cutting-edge Mini DALLE 3 model by 21% in average character-character similarity and 19% in average text-image similarity.

4/30/2024

🛸

What Do You Want? User-centric Prompt Generation for Text-to-image Synthesis via Multi-turn Guidance

Yilun Liu, Minggui He, Feiyu Yao, Yuhe Ji, Shimin Tao, Jingzhou Du, Duan Li, Jian Gao, Li Zhang, Hao Yang, Boxing Chen, Osamu Yoshie

The emergence of text-to-image synthesis (TIS) models has significantly influenced digital image creation by producing high-quality visuals from written descriptions. Yet these models heavily rely on the quality and specificity of textual prompts, posing a challenge for novice users who may not be familiar with TIS-model-preferred prompt writing. Existing solutions relieve this via automatic model-preferred prompt generation from user queries. However, this single-turn manner suffers from limited user-centricity in terms of result interpretability and user interactivity. To address these issues, we propose DialPrompt, a multi-turn dialogue-based TIS prompt generation model that emphasises user-centricity. DialPrompt is designed to follow a multi-turn guidance workflow, where in each round of dialogue the model queries user with their preferences on possible optimization dimensions before generating the final TIS prompt. To achieve this, we mined 15 essential dimensions for high-quality prompts from advanced users and curated a multi-turn dataset. Through training on this dataset, DialPrompt can improve interpretability by allowing users to understand the correlation between specific phrases and image attributes. Additionally, it enables greater user control and engagement in the prompt generation process, leading to more personalized and visually satisfying outputs. Experiments indicate that DialPrompt achieves a competitive result in the quality of synthesized images, outperforming existing prompt engineering approaches by 5.7%. Furthermore, in our user evaluation, DialPrompt outperforms existing approaches by 46.5% in user-centricity score and is rated 7.9/10 by 19 human reviewers.

8/26/2024

User-Friendly Customized Generation with Multi-Modal Prompts

Linhao Zhong, Yan Hong, Wentao Chen, Binglin Zhou, Yiyi Zhang, Jianfu Zhang, Liqing Zhang

Text-to-image generation models have seen considerable advancement, catering to the increasing interest in personalized image creation. Current customization techniques often necessitate users to provide multiple images (typically 3-5) for each customized object, along with the classification of these objects and descriptive textual prompts for scenes. This paper questions whether the process can be made more user-friendly and the customization more intricate. We propose a method where users need only provide images along with text for each customization topic, and necessitates only a single image per visual concept. We introduce the concept of a ``multi-modal prompt'', a novel integration of text and images tailored to each customization concept, which simplifies user interaction and facilitates precise customization of both objects and scenes. Our proposed paradigm for customized text-to-image generation surpasses existing finetune-based methods in user-friendliness and the ability to customize complex objects with user-friendly inputs. Our code is available at $href{https://github.com/zhongzero/Multi-Modal-Prompt}{https://github.com/zhongzero/Multi-Modal-Prompt}$.

5/28/2024