AttenCraft: Attention-guided Disentanglement of Multiple Concepts for Text-to-Image Customization

Read original: arXiv:2405.17965 - Published 5/29/2024 by Junjie Shentu, Matthew Watson, Noura Al Moubayed

AttenCraft: Attention-guided Disentanglement of Multiple Concepts for Text-to-Image Customization

Overview

• The paper presents a novel approach called AttenCraft for text-to-image customization, which leverages attention-guided disentanglement of multiple concepts. • The method aims to enable fine-grained control over the generation of images by allowing users to specify and manipulate different semantic concepts independently. • This contrasts with previous text-to-image models that often struggle to capture and separate complex, multi-faceted visual concepts from natural language descriptions.

Plain English Explanation

The research paper introduces a new technique called AttenCraft that makes it easier to customize images generated from text descriptions. Typically, when you describe an image using words, it can be difficult for AI systems to fully capture all the different elements you want to see, like the specific objects, colors, or styles.

AttenCraft addresses this by allowing you to independently control and adjust different "concepts" in the image, such as the type of car, the background setting, the weather, and so on. This is achieved through an attention-based approach that helps the model understand the relationship between the text and the different visual components it should create.

Rather than generating a single, fixed image from a text description, AttenCraft gives you more granular control to fine-tune the image to your preferences. This could be useful for applications like customized image generation or multi-concept fusion where you want to blend together different visual elements in a coherent way.

Technical Explanation

The key innovation in AttenCraft is the use of an attention mechanism to "disentangle" the different semantic concepts present in the text description. This allows the model to separately encode and generate the various visual components, rather than treating the text as a monolithic input.

The architecture consists of an encoder that takes the text prompt and produces an attention map highlighting the relevant parts for each concept. This attention information is then used to guide the generation of the corresponding visual elements in the decoder. By attending to the relevant parts of the text, the model can focus on generating the specific aspects of the image that the user wants to customize.

The paper evaluates AttenCraft on several benchmarks for text-to-image generation and customization, including Attention Calibration, FreeCustom, and MCDollar2Dollar. The results demonstrate that the attention-guided disentanglement approach outperforms previous state-of-the-art methods in terms of both quantitative and qualitative measures of image quality and customization capabilities.

Critical Analysis

The main strength of AttenCraft is its ability to provide users with fine-grained control over the generation of images from text, which addresses a key limitation of existing text-to-image models. By separating the different semantic concepts, the approach allows for more targeted manipulation and customization of the generated images.

However, the paper acknowledges that the current implementation of AttenCraft is limited to a predefined set of concepts, which may not capture the full complexity of real-world visual scenes. Extending the model to handle a more open-ended and diverse set of concepts is an important area for future research, as mentioned in the DisEnStudio paper.

Additionally, while the quantitative results demonstrate the effectiveness of the attention-guided disentanglement approach, the paper could have provided more in-depth analysis of the model's limitations and failure cases. Understanding the types of prompts or concepts that the model struggles with could inform future improvements and help users better gauge the capabilities and limitations of the system.

Conclusion

The AttenCraft paper presents a novel technique for text-to-image customization that leverages attention-guided disentanglement of multiple semantic concepts. By allowing users to independently control and manipulate different visual elements, the approach offers a more fine-grained level of customization compared to previous text-to-image models.

The results show the potential of this attention-based approach for applications that require tailored image generation, such as multi-concept fusion or personalized image generation. As the field of text-to-image generation continues to advance, techniques like AttenCraft that enable greater user control and customization could become increasingly valuable for a wide range of visual content creation tasks.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

AttenCraft: Attention-guided Disentanglement of Multiple Concepts for Text-to-Image Customization

Junjie Shentu, Matthew Watson, Noura Al Moubayed

With the unprecedented performance being achieved by text-to-image (T2I) diffusion models, T2I customization further empowers users to tailor the diffusion model to new concepts absent in the pre-training dataset, termed subject-driven generation. Moreover, extracting several new concepts from a single image enables the model to learn multiple concepts, and simultaneously decreases the difficulties of training data preparation, urging the disentanglement of multiple concepts to be a new challenge. However, existing models for disentanglement commonly require pre-determined masks or retain background elements. To this end, we propose an attention-guided method, AttenCraft, for multiple concept disentanglement. In particular, our method leverages self-attention and cross-attention maps to create accurate masks for each concept within a single initialization step, omitting any required mask preparation by humans or other models. The created masks are then applied to guide the cross-attention activation of each target concept during training and achieve concept disentanglement. Additionally, we introduce Uniform sampling and Reweighted sampling schemes to alleviate the non-synchronicity of feature acquisition from different concepts, and improve generation quality. Our method outperforms baseline models in terms of image-alignment, and behaves comparably on text-alignment. Finally, we showcase the applicability of AttenCraft to more complicated settings, such as an input image containing three concepts. The project is available at https://github.com/junjie-shentu/AttenCraft.

5/29/2024

Attention Calibration for Disentangled Text-to-Image Personalization

Yanbing Zhang, Mengping Yang, Qin Zhou, Zhe Wang

Recent thrilling progress in large-scale text-to-image (T2I) models has unlocked unprecedented synthesis quality of AI-generated content (AIGC) including image generation, 3D and video composition. Further, personalized techniques enable appealing customized production of a novel concept given only several images as reference. However, an intriguing problem persists: Is it possible to capture multiple, novel concepts from one single reference image? In this paper, we identify that existing approaches fail to preserve visual consistency with the reference image and eliminate cross-influence from concepts. To alleviate this, we propose an attention calibration mechanism to improve the concept-level understanding of the T2I model. Specifically, we first introduce new learnable modifiers bound with classes to capture attributes of multiple concepts. Then, the classes are separated and strengthened following the activation of the cross-attention operation, ensuring comprehensive and self-contained concepts. Additionally, we suppress the attention activation of different classes to mitigate mutual influence among concepts. Together, our proposed method, dubbed DisenDiff, can learn disentangled multiple concepts from one single image and produce novel customized images with learned concepts. We demonstrate that our method outperforms the current state of the art in both qualitative and quantitative evaluations. More importantly, our proposed techniques are compatible with LoRA and inpainting pipelines, enabling more interactive experiences.

4/12/2024

🖼️

Visual Concept-driven Image Generation with Text-to-Image Diffusion Model

Tanzila Rahman, Shweta Mahajan, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Leonid Sigal

Text-to-image (TTI) diffusion models have demonstrated impressive results in generating high-resolution images of complex and imaginative scenes. Recent approaches have further extended these methods with personalization techniques that allow them to integrate user-illustrated concepts (e.g., the user him/herself) using a few sample image illustrations. However, the ability to generate images with multiple interacting concepts, such as human subjects, as well as concepts that may be entangled in one, or across multiple, image illustrations remains illusive. In this work, we propose a concept-driven TTI personalization framework that addresses these core challenges. We build on existing works that learn custom tokens for user-illustrated concepts, allowing those to interact with existing text tokens in the TTI model. However, importantly, to disentangle and better learn the concepts in question, we jointly learn (latent) segmentation masks that disentangle these concepts in user-provided image illustrations. We do so by introducing an Expectation Maximization (EM)-like optimization procedure where we alternate between learning the custom tokens and estimating (latent) masks encompassing corresponding concepts in user-supplied images. We obtain these masks based on cross-attention, from within the U-Net parameterized latent diffusion model and subsequent DenseCRF optimization. We illustrate that such joint alternating refinement leads to the learning of better tokens for concepts and, as a by-product, latent masks. We illustrate the benefits of the proposed approach qualitatively and quantitatively with several examples and use cases that can combine three or more entangled concepts.

7/18/2024

🖼️

FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition

Ganggui Ding, Canyu Zhao, Wen Wang, Zhen Yang, Zide Liu, Hao Chen, Chunhua Shen

Benefiting from large-scale pre-trained text-to-image (T2I) generative models, impressive progress has been achieved in customized image generation, which aims to generate user-specified concepts. Existing approaches have extensively focused on single-concept customization and still encounter challenges when it comes to complex scenarios that involve combining multiple concepts. These approaches often require retraining/fine-tuning using a few images, leading to time-consuming training processes and impeding their swift implementation. Furthermore, the reliance on multiple images to represent a singular concept increases the difficulty of customization. To this end, we propose FreeCustom, a novel tuning-free method to generate customized images of multi-concept composition based on reference concepts, using only one image per concept as input. Specifically, we introduce a new multi-reference self-attention (MRSA) mechanism and a weighted mask strategy that enables the generated image to access and focus more on the reference concepts. In addition, MRSA leverages our key finding that input concepts are better preserved when providing images with context interactions. Experiments show that our method's produced images are consistent with the given concepts and better aligned with the input text. Our method outperforms or performs on par with other training-based methods in terms of multi-concept composition and single-concept customization, but is simpler. Codes can be found at https://github.com/aim-uofa/FreeCustom.

5/24/2024