Compositional Text-to-Image Generation with Dense Blob Representations

2405.08246

Published 5/15/2024 by Weili Nie, Sifei Liu, Morteza Mardani, Chao Liu, Benjamin Eckart, Arash Vahdat

🛸

Abstract

Existing text-to-image models struggle to follow complex text prompts, raising the need for extra grounding inputs for better controllability. In this work, we propose to decompose a scene into visual primitives - denoted as dense blob representations - that contain fine-grained details of the scene while being modular, human-interpretable, and easy-to-construct. Based on blob representations, we develop a blob-grounded text-to-image diffusion model, termed BlobGEN, for compositional generation. Particularly, we introduce a new masked cross-attention module to disentangle the fusion between blob representations and visual features. To leverage the compositionality of large language models (LLMs), we introduce a new in-context learning approach to generate blob representations from text prompts. Our extensive experiments show that BlobGEN achieves superior zero-shot generation quality and better layout-guided controllability on MS-COCO. When augmented by LLMs, our method exhibits superior numerical and spatial correctness on compositional image generation benchmarks. Project page: https://blobgen-2d.github.io.

Create account to get full access

Overview

Existing text-to-image models struggle to follow complex text prompts, leading to the need for extra inputs to improve controllability.
This work proposes to decompose a scene into visual primitives called "dense blob representations" that contain fine-grained details while being modular, human-interpretable, and easy-to-construct.
Based on blob representations, the authors developed a blob-grounded text-to-image diffusion model called BlobGEN for compositional generation.
The paper introduces a new masked cross-attention module to disentangle the fusion between blob representations and visual features.
An in-context learning approach is used to generate blob representations from text prompts, leveraging the compositionality of large language models (LLMs).

Plain English Explanation

Creating images from text descriptions is a challenging task, as existing text-to-image models struggle to follow complex prompts. To address this, the researchers in this paper propose a new approach that breaks down the image into simpler, more manageable components called "blobs."

Blobs are visual building blocks that contain detailed information about the scene, but are also modular, easy for humans to understand, and straightforward to create. The researchers developed a new text-to-image model called BlobGEN that uses these blob representations as a foundation for generating images.

A key innovation is a new module that helps the model better integrate the blob information with the visual features it's generating. The researchers also found a way to generate the blob representations directly from the text prompts, using the compositional capabilities of large language models.

Through extensive experiments, the researchers show that their BlobGEN model can generate higher-quality images in a zero-shot setting (without any fine-tuning) and has better control over the layout and composition of the generated images, compared to traditional text-to-image models.

Technical Explanation

The core idea of this work is to decompose a scene into visual primitives called "dense blob representations" that capture fine-grained details of the scene in a modular, human-interpretable, and easy-to-construct format.

Based on these blob representations, the authors developed a blob-grounded text-to-image diffusion model, termed BlobGEN, for compositional generation. A key component is a new masked cross-attention module that helps disentangle the fusion between blob representations and visual features.

To leverage the compositional capabilities of large language models (LLMs), the authors introduce a novel in-context learning approach to generate blob representations directly from text prompts.

Extensive experiments show that BlobGEN achieves superior zero-shot generation quality and better layout-guided controllability on the MS-COCO dataset. When augmented by LLMs, the method exhibits improved numerical and spatial correctness on compositional image generation benchmarks.

Critical Analysis

The paper presents a promising approach to improve the controllability and compositional capabilities of text-to-image models. The use of blob representations as a modular and interpretable intermediate representation is a novel idea that could have broader applications.

However, the paper does not provide a thorough discussion of the limitations of the proposed approach. For example, it is unclear how the method would perform on more complex or abstract scenes, or how sensitive it is to the quality and diversity of the blob representations.

Additionally, the in-context learning approach for generating blob representations from text prompts relies on the availability of large language models, which may not be accessible or feasible for all users. Further research is needed to explore more efficient or accessible ways of generating these blob representations.

Overall, the paper makes a valuable contribution to the field of text-to-image generation, but there are opportunities for further development and refinement of the proposed techniques.

Conclusion

This work introduces a novel approach to text-to-image generation that leverages visual primitives called "dense blob representations" to improve the controllability and compositional capabilities of the generated images.

The BlobGEN model and its in-context learning approach for generating blob representations from text prompts demonstrate promising results in terms of zero-shot generation quality and layout-guided controllability.

While the paper presents an exciting step forward, further research is needed to address the limitations and explore the broader applications of this approach. Nonetheless, this work contributes valuable insights and techniques to the ongoing efforts in the field of text-to-image generation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Understanding and Mitigating Compositional Issues in Text-to-Image Generative Models

Arman Zarei, Keivan Rezaei, Samyadeep Basu, Mehrdad Saberi, Mazda Moayeri, Priyatham Kattakinda, Soheil Feizi

Recent text-to-image diffusion-based generative models have the stunning ability to generate highly detailed and photo-realistic images and achieve state-of-the-art low FID scores on challenging image generation benchmarks. However, one of the primary failure modes of these text-to-image generative models is in composing attributes, objects, and their associated relationships accurately into an image. In our paper, we investigate this compositionality-based failure mode and highlight that imperfect text conditioning with CLIP text-encoder is one of the primary reasons behind the inability of these models to generate high-fidelity compositional scenes. In particular, we show that (i) there exists an optimal text-embedding space that can generate highly coherent compositional scenes which shows that the output space of the CLIP text-encoder is sub-optimal, and (ii) we observe that the final token embeddings in CLIP are erroneous as they often include attention contributions from unrelated tokens in compositional prompts. Our main finding shows that the best compositional improvements can be achieved (without harming the model's FID scores) by fine-tuning {it only} a simple linear projection on CLIP's representation space in Stable-Diffusion variants using a small set of compositional image-text pairs. This result demonstrates that the sub-optimality of the CLIP's output space is a major error source. We also show that re-weighting the erroneous attention contributions in CLIP can also lead to improved compositional performances, however these improvements are often less significant than those achieved by solely learning a linear projection head, highlighting erroneous attentions to be only a minor error source.

6/13/2024

cs.CV

Compositional Neural Textures

Peihan Tu, Li-Yi Wei, Matthias Zwicker

Texture plays a vital role in enhancing visual richness in both real photographs and computer-generated imagery. However, the process of editing textures often involves laborious and repetitive manual adjustments of textons, which are the small, recurring local patterns that define textures. In this work, we introduce a fully unsupervised approach for representing textures using a compositional neural model that captures individual textons. We represent each texton as a 2D Gaussian function whose spatial support approximates its shape, and an associated feature that encodes its detailed appearance. By modeling a texture as a discrete composition of Gaussian textons, the representation offers both expressiveness and ease of editing. Textures can be edited by modifying the compositional Gaussians within the latent space, and new textures can be efficiently synthesized by feeding the modified Gaussians through a generator network in a feed-forward manner. This approach enables a wide range of applications, including transferring appearance from an image texture to another image, diversifying textures, texture interpolation, revealing/modifying texture variations, edit propagation, texture animation, and direct texton manipulation. The proposed approach contributes to advancing texture analysis, modeling, and editing techniques, and opens up new possibilities for creating visually appealing images with controllable textures.

4/22/2024

cs.GR cs.AI cs.CV cs.LG

📈

Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

Xiaolong Li, Jiawei Mo, Ying Wang, Chethan Parameshwara, Xiaohan Fei, Ashwin Swaminathan, CJ Taylor, Zhuowen Tu, Paolo Favaro, Stefano Soatto

In this paper, we propose an effective two-stage approach named Grounded-Dreamer to generate 3D assets that can accurately follow complex, compositional text prompts while achieving high fidelity by using a pre-trained multi-view diffusion model. Multi-view diffusion models, such as MVDream, have shown to generate high-fidelity 3D assets using score distillation sampling (SDS). However, applied naively, these methods often fail to comprehend compositional text prompts, and may often entirely omit certain subjects or parts. To address this issue, we first advocate leveraging text-guided 4-view images as the bottleneck in the text-to-3D pipeline. We then introduce an attention refocusing mechanism to encourage text-aligned 4-view image generation, without the necessity to re-train the multi-view diffusion model or craft a high-quality compositional 3D dataset. We further propose a hybrid optimization strategy to encourage synergy between the SDS loss and the sparse RGB reference images. Our method consistently outperforms previous state-of-the-art (SOTA) methods in generating compositional 3D assets, excelling in both quality and accuracy, and enabling diverse 3D from the same text prompt.

4/30/2024

cs.CV cs.AI

RealCompo: Balancing Realism and Compositionality Improves Text-to-Image Diffusion Models

Xinchen Zhang, Ling Yang, Yaqi Cai, Zhaochen Yu, Kai-Ni Wang, Jiake Xie, Ye Tian, Minkai Xu, Yong Tang, Yujiu Yang, Bin Cui

Diffusion models have achieved remarkable advancements in text-to-image generation. However, existing models still have many difficulties when faced with multiple-object compositional generation. In this paper, we propose RealCompo, a new training-free and transferred-friendly text-to-image generation framework, which aims to leverage the respective advantages of text-to-image models and spatial-aware image diffusion models (e.g., layout, keypoints and segmentation maps) to enhance both realism and compositionality of the generated images. An intuitive and novel balancer is proposed to dynamically balance the strengths of the two models in denoising process, allowing plug-and-play use of any model without extra training. Extensive experiments show that our RealCompo consistently outperforms state-of-the-art text-to-image models and spatial-aware image diffusion models in multiple-object compositional generation while keeping satisfactory realism and compositionality of the generated images. Notably, our RealCompo can be seamlessly extended with a wide range of spatial-aware image diffusion models and stylized diffusion models. Our code is available at: https://github.com/YangLing0818/RealCompo

6/5/2024

cs.CV cs.AI cs.LG