Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following

Read original: arXiv:2311.17002 - Published 4/10/2024 by Yutong Feng, Biao Gong, Di Chen, Yujun Shen, Yu Liu, Jingren Zhou

🔍

Overview

Existing text-to-image (T2I) diffusion models struggle with interpreting complex prompts, especially those involving quantity, object-attribute binding, and multi-subject descriptions.
The researchers introduce a "semantic panel" as a middleware to help the generator better follow instructions from the input text.
The semantic panel is obtained by arranging visual concepts parsed from the input text using large language models, which is then injected into the denoising network as a detailed control signal.
A carefully designed semantic formatting protocol and an automated data preparation pipeline are used to facilitate text-to-panel learning.
This approach, called Ranni, enhances a pre-trained T2I generator's textual controllability and allows for more convenient interaction and customization.

Plain English Explanation

Existing AI systems that can generate images from text prompts often have trouble understanding complex instructions, especially when it comes to things like quantity, how different objects are related, and descriptions with multiple subjects. The researchers in this paper introduce a new approach called Ranni that aims to address these challenges.

The key idea is to use a "semantic panel" as an intermediary between the text prompt and the image generation process. This semantic panel arranges the different visual concepts that are mentioned in the text, as identified by large language models. This panel is then fed into the image generator as a detailed guide to help it better follow the instructions in the text prompt.

To make this text-to-panel learning process work, the researchers developed a special formatting protocol and an automated data preparation pipeline. This allows their system to enhance the textual controllability of an existing image generation model.

Importantly, the use of this semantic panel also makes the interaction with the system more convenient. Users can directly adjust the elements in the panel or use language instructions to customize the generated images. This opens up the possibility of continuous generation and chatting-based editing, which the researchers demonstrate in a practical system.

Technical Explanation

The core technical innovation of this work is the introduction of a "semantic panel" as a middleware between the text prompt and the image generation process. This semantic panel is obtained by arranging the visual concepts parsed from the input text using large language models, such as CLIP. The panel is then injected into the denoising network of the image generator as a detailed control signal to complement the text condition.

To facilitate the text-to-panel learning, the researchers designed a semantic formatting protocol that specifies how the visual concepts should be organized and represented in the panel. They also developed a fully-automatic data preparation pipeline to generate the training data for this text-to-panel learning task.

The researchers call their approach Ranni, and they show that it can enhance the textual controllability of a pre-trained T2I generator. Importantly, the semantic panel introduces a more convenient form of interaction, allowing users to directly adjust the elements in the panel or use language instructions to customize the generation process.

The researchers demonstrate the potential of this system in continuous generation and chatting-based editing, showcasing the benefits of their approach over traditional T2I models.

Critical Analysis

The researchers have addressed an important challenge in text-to-image generation, which is the ability to handle complex prompts with quantity, object-attribute binding, and multi-subject descriptions. The introduction of the semantic panel as a middleware is a novel and potentially useful approach.

However, the paper does not provide a detailed evaluation of the performance gains achieved by Ranni compared to other state-of-the-art T2I models, such as DiffAgent or Latent Diffusion. The qualitative examples shown are promising, but a more rigorous quantitative comparison would help better assess the benefits of the semantic panel approach.

Additionally, the paper does not address potential biases that may be introduced by the language models used to parse the text or the formatting protocol used to represent the semantic panel. These are important considerations for ensuring the fairness and robustness of the overall system.

Overall, the Ranni approach is an interesting and potentially impactful contribution to the field of text-to-image generation. Further research and evaluation would help solidify its strengths and identify any areas for improvement.

Conclusion

This paper introduces a novel approach called Ranni that uses a "semantic panel" as a middleware to enhance the textual controllability of text-to-image (T2I) diffusion models. By arranging visual concepts parsed from the input text using large language models, the semantic panel provides a detailed control signal to the image generator, helping it better follow complex instructions.

The researchers have developed a carefully designed semantic formatting protocol and an automated data preparation pipeline to facilitate the text-to-panel learning process. This approach allows for more convenient interaction and customization, enabling features like continuous generation and chatting-based editing.

While the qualitative results are promising, further research is needed to thoroughly evaluate the performance gains of Ranni compared to other state-of-the-art T2I models and to address potential biases introduced by the language models and formatting protocol. Nonetheless, the Ranni approach represents an interesting and potentially impactful contribution to the field of text-to-image generation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔍

Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following

Yutong Feng, Biao Gong, Di Chen, Yujun Shen, Yu Liu, Jingren Zhou

Existing text-to-image (T2I) diffusion models usually struggle in interpreting complex prompts, especially those with quantity, object-attribute binding, and multi-subject descriptions. In this work, we introduce a semantic panel as the middleware in decoding texts to images, supporting the generator to better follow instructions. The panel is obtained through arranging the visual concepts parsed from the input text by the aid of large language models, and then injected into the denoising network as a detailed control signal to complement the text condition. To facilitate text-to-panel learning, we come up with a carefully designed semantic formatting protocol, accompanied by a fully-automatic data preparation pipeline. Thanks to such a design, our approach, which we call Ranni, manages to enhance a pre-trained T2I generator regarding its textual controllability. More importantly, the introduction of the generative middleware brings a more convenient form of interaction (i.e., directly adjusting the elements in the panel or using language instructions) and further allows users to finely customize their generation, based on which we develop a practical system and showcase its potential in continuous generation and chatting-based editing. Our project page is at https://ranni-t2i.github.io/Ranni.

4/10/2024

DivCon: Divide and Conquer for Progressive Text-to-Image Generation

Yuhao Jia, Wenhan Tan

Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements. To further improve T2I models' capability in numerical and spatial reasoning, the layout is employed as an intermedium to bridge large language models and layout-based diffusion models. However, these methods still struggle with generating images from textural prompts with multiple objects and complicated spatial relationships. To tackle this challenge, we introduce a divide-and-conquer approach which decouples the T2I generation task into simple subtasks. Our approach divides the layout prediction stage into numerical & spatial reasoning and bounding box prediction. Then, the layout-to-image generation stage is conducted in an iterative manner to reconstruct objects from easy ones to difficult ones. We conduct experiments on the HRS and NSR-1K benchmarks and our approach outperforms previous state-of-the-art models with notable margins. In addition, visual results demonstrate that our approach significantly improves the controllability and consistency in generating multiple objects from complex textural prompts.

8/19/2024

🖼️

Visual Concept-driven Image Generation with Text-to-Image Diffusion Model

Tanzila Rahman, Shweta Mahajan, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Leonid Sigal

Text-to-image (TTI) diffusion models have demonstrated impressive results in generating high-resolution images of complex and imaginative scenes. Recent approaches have further extended these methods with personalization techniques that allow them to integrate user-illustrated concepts (e.g., the user him/herself) using a few sample image illustrations. However, the ability to generate images with multiple interacting concepts, such as human subjects, as well as concepts that may be entangled in one, or across multiple, image illustrations remains illusive. In this work, we propose a concept-driven TTI personalization framework that addresses these core challenges. We build on existing works that learn custom tokens for user-illustrated concepts, allowing those to interact with existing text tokens in the TTI model. However, importantly, to disentangle and better learn the concepts in question, we jointly learn (latent) segmentation masks that disentangle these concepts in user-provided image illustrations. We do so by introducing an Expectation Maximization (EM)-like optimization procedure where we alternate between learning the custom tokens and estimating (latent) masks encompassing corresponding concepts in user-supplied images. We obtain these masks based on cross-attention, from within the U-Net parameterized latent diffusion model and subsequent DenseCRF optimization. We illustrate that such joint alternating refinement leads to the learning of better tokens for concepts and, as a by-product, latent masks. We illustrate the benefits of the proposed approach qualitatively and quantitatively with several examples and use cases that can combine three or more entangled concepts.

7/18/2024

A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models

Xincheng Shuai, Henghui Ding, Xingjun Ma, Rongcheng Tu, Yu-Gang Jiang, Dacheng Tao

Image editing aims to edit the given synthetic or real image to meet the specific requirements from users. It is widely studied in recent years as a promising and challenging field of Artificial Intelligence Generative Content (AIGC). Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models, which generate images according to text prompts. These models demonstrate remarkable generative capabilities and have become widely used tools for image editing. T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs. In this survey, we provide a comprehensive review of multimodal-guided image editing techniques that leverage T2I diffusion models. First, we define the scope of image editing from a holistic perspective and detail various control signals and editing scenarios. We then propose a unified framework to formalize the editing process, categorizing it into two primary algorithm families. This framework offers a design space for users to achieve specific goals. Subsequently, we present an in-depth analysis of each component within this framework, examining the characteristics and applicable scenarios of different combinations. Given that training-based methods learn to directly map the source image to target one under user guidance, we discuss them separately, and introduce injection schemes of source image in different scenarios. Additionally, we review the application of 2D techniques to video editing, highlighting solutions for inter-frame inconsistency. Finally, we discuss open challenges in the field and suggest potential future research directions. We keep tracing related works at https://github.com/xinchengshuai/Awesome-Image-Editing.

6/21/2024