Tuning-Free Visual Customization via View Iterative Self-Attention Control

Read original: arXiv:2406.06258 - Published 6/12/2024 by Xiaojie Li, Chenghao Gu, Shuzhao Xie, Yunpeng Bai, Weixiang Zhang, Zhi Wang

Tuning-Free Visual Customization via View Iterative Self-Attention Control

Overview

This paper introduces a novel approach called "Dynamic Prompt Learning" to address the issue of cross-attention leakage in text-based image editing models.
Cross-attention leakage can lead to undesirable artifacts or changes in regions of an image that are unrelated to the text prompt, reducing the model's controllability.
The proposed method dynamically learns prompt-specific parameters to better align the text prompt with the desired image edits, improving the overall performance and consistency of the text-to-image editing process.

Plain English Explanation

Text-based image editing models allow users to modify images by describing the desired changes in natural language. However, these models can suffer from a problem called "cross-attention leakage," where the changes made to the image go beyond what the text prompt specifies. This can result in unintended alterations or artifacts in parts of the image that are unrelated to the text prompt, reducing the model's ability to precisely control the editing process.

The researchers in this paper have developed a new technique called "Dynamic Prompt Learning" to address this issue. The key idea is to dynamically learn prompt-specific parameters that help better align the text prompt with the desired image edits. This allows the model to make more targeted and controlled changes to the image, improving the overall performance and consistency of the text-to-image editing process.

Unified Editing of Panorama, 3D Scenes, and Videos Through Language and Temporally Consistent Object Editing in Videos Using Extended Prompts are two related papers that also explore techniques for improving the controllability and consistency of text-based image and video editing.

Technical Explanation

The paper introduces a novel approach called "Dynamic Prompt Learning" to address the issue of cross-attention leakage in text-based image editing models. Cross-attention leakage occurs when the changes made to an image go beyond what the text prompt specifies, leading to undesirable artifacts or changes in regions of the image that are unrelated to the text prompt.

The key idea of Dynamic Prompt Learning is to dynamically learn prompt-specific parameters that can better align the text prompt with the desired image edits. This is achieved by introducing a prompt encoder module that learns prompt-specific representations, which are then used to modulate the cross-attention mechanism in the image editing model.

The authors evaluate their approach on several text-based image editing benchmarks and demonstrate that Dynamic Prompt Learning can significantly improve the performance and consistency of the text-to-image editing process, reducing cross-attention leakage and allowing for more precise control over the desired image changes.

RefdROP: Controllable Consistency in Image or Video Generation and LASER: Tuning-Free LLM-Driven Attention Control are two related papers that explore techniques for improving the controllability and consistency of text-guided image and video generation models.

Critical Analysis

The paper presents a promising approach to address the issue of cross-attention leakage in text-based image editing models. The proposed Dynamic Prompt Learning technique appears to be effective in improving the performance and consistency of the text-to-image editing process, as demonstrated by the experimental results.

However, the paper does not discuss the potential computational overhead or memory requirements of the additional prompt encoder module, which could be a concern for real-world deployment, especially on resource-constrained devices. Additionally, the authors do not provide detailed analysis of the types of image edits or prompts where the Dynamic Prompt Learning approach excels or falls short, which could be helpful for understanding the limitations and potential use cases of the method.

FreeCUSTOM: Tuning-Free Customized Image Generation with Multi-Modal Prompts is another related paper that explores techniques for improving the customizability and control in text-guided image generation, which could provide additional insights into the challenges and tradeoffs in this domain.

Conclusion

The Dynamic Prompt Learning approach proposed in this paper represents a significant advancement in addressing the issue of cross-attention leakage in text-based image editing models. By dynamically learning prompt-specific parameters, the method can better align the text prompt with the desired image edits, leading to more precise and consistent editing capabilities.

As text-based image editing becomes increasingly prevalent, techniques like Dynamic Prompt Learning will be crucial for empowering users to make targeted and controllable changes to their images. The insights and methodologies presented in this paper could have far-reaching implications for the development of more advanced and user-friendly text-to-image editing tools.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Tuning-Free Visual Customization via View Iterative Self-Attention Control

Xiaojie Li, Chenghao Gu, Shuzhao Xie, Yunpeng Bai, Weixiang Zhang, Zhi Wang

Fine-Tuning Diffusion Models enable a wide range of personalized generation and editing applications on diverse visual modalities. While Low-Rank Adaptation (LoRA) accelerates the fine-tuning process, it still requires multiple reference images and time-consuming training, which constrains its scalability for large-scale and real-time applications. In this paper, we propose textit{View Iterative Self-Attention Control (VisCtrl)} to tackle this challenge. Specifically, VisCtrl is a training-free method that injects the appearance and structure of a user-specified subject into another subject in the target image, unlike previous approaches that require fine-tuning the model. Initially, we obtain the initial noise for both the reference and target images through DDIM inversion. Then, during the denoising phase, features from the reference image are injected into the target image via the self-attention mechanism. Notably, by iteratively performing this feature injection process, we ensure that the reference image features are gradually integrated into the target image. This approach results in consistent and harmonious editing with only one reference image in a few denoising steps. Moreover, benefiting from our plug-and-play architecture design and the proposed Feature Gradual Sampling strategy for multi-view editing, our method can be easily extended to edit in complex visual domains. Extensive experiments show the efficacy of VisCtrl across a spectrum of tasks, including personalized editing of images, videos, and 3D scenes.

6/12/2024

RefDrop: Controllable Consistency in Image or Video Generation via Reference Feature Guidance

Jiaojiao Fan, Haotian Xue, Qinsheng Zhang, Yongxin Chen

There is a rapidly growing interest in controlling consistency across multiple generated images using diffusion models. Among various methods, recent works have found that simply manipulating attention modules by concatenating features from multiple reference images provides an efficient approach to enhancing consistency without fine-tuning. Despite its popularity and success, few studies have elucidated the underlying mechanisms that contribute to its effectiveness. In this work, we reveal that the popular approach is a linear interpolation of image self-attention and cross-attention between synthesized content and reference features, with a constant rank-1 coefficient. Motivated by this observation, we find that a rank-1 coefficient is not necessary and simplifies the controllable generation mechanism. The resulting algorithm, which we coin as RefDrop, allows users to control the influence of reference context in a direct and precise manner. Besides further enhancing consistency in single-subject image generation, our method also enables more interesting applications, such as the consistent generation of multiple subjects, suppressing specific features to encourage more diverse content, and high-quality personalized video generation by boosting temporal consistency. Even compared with state-of-the-art image-prompt-based generators, such as IP-Adapter, RefDrop is competitive in terms of controllability and quality while avoiding the need to train a separate image encoder for feature injection from reference images, making it a versatile plug-and-play solution for any image or video diffusion model.

5/29/2024

LASER: Tuning-Free LLM-Driven Attention Control for Efficient Text-conditioned Image-to-Animation

Haoyu Zheng, Wenqiao Zhang, Yaoke Wang, Hao Zhou, Jiang Liu, Juncheng Li, Zheqi Lv, Siliang Tang, Yueting Zhuang

Revolutionary advancements in text-to-image models have unlocked new dimensions for sophisticated content creation, e.g., text-conditioned image editing, allowing us to edit the diverse images that convey highly complex visual concepts according to the textual guidance. Despite being promising, existing methods focus on texture- or non-rigid-based visual manipulation, which struggles to produce the fine-grained animation of smooth text-conditioned image morphing without fine-tuning, i.e., due to their highly unstructured latent space. In this paper, we introduce a tuning-free LLM-driven attention control framework, encapsulated by the progressive process of LLM planning, prompt-Aware editing, StablE animation geneRation, abbreviated as LASER. LASER employs a large language model (LLM) to refine coarse descriptions into detailed prompts, guiding pre-trained text-to-image models for subsequent image generation. We manipulate the model's spatial features and self-attention mechanisms to maintain animation integrity and enable seamless morphing directly from text prompts, eliminating the need for additional fine-tuning or annotations. Our meticulous control over spatial features and self-attention ensures structural consistency in the images. This paper presents a novel framework integrating LLMs with text-to-image models to create high-quality animations from a single text input. We also propose a Text-conditioned Image-to-Animation Benchmark to validate the effectiveness and efficacy of LASER. Extensive experiments demonstrate that LASER produces impressive, consistent, and efficient results in animation generation, positioning it as a powerful tool for advanced digital content creation.

4/24/2024

Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing

Vadim Titov, Madina Khalmatova, Alexandra Ivanova, Dmitry Vetrov, Aibek Alanov

Despite recent advances in large-scale text-to-image generative models, manipulating real images with these models remains a challenging problem. The main limitations of existing editing methods are that they either fail to perform with consistent quality on a wide range of image edits or require time-consuming hyperparameter tuning or fine-tuning of the diffusion model to preserve the image-specific appearance of the input image. We propose a novel approach that is built upon a modified diffusion sampling process via the guidance mechanism. In this work, we explore the self-guidance technique to preserve the overall structure of the input image and its local regions appearance that should not be edited. In particular, we explicitly introduce layout-preserving energy functions that are aimed to save local and global structures of the source image. Additionally, we propose a noise rescaling mechanism that allows to preserve noise distribution by balancing the norms of classifier-free guidance and our proposed guiders during generation. Such a guiding approach does not require fine-tuning the diffusion model and exact inversion process. As a result, the proposed method provides a fast and high-quality editing mechanism. In our experiments, we show through human evaluation and quantitative analysis that the proposed method allows to produce desired editing which is more preferable by humans and also achieves a better trade-off between editing quality and preservation of the original image. Our code is available at https://github.com/FusionBrainLab/Guide-and-Rescale.

9/10/2024