Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance






Published 5/3/2024 by Kelvin C. K. Chan, Yang Zhao, Xuhui Jia, Ming-Hsuan Yang, Huisheng Wang
Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance


In subject-driven text-to-image synthesis, the synthesis process tends to be heavily influenced by the reference images provided by users, often overlooking crucial attributes detailed in the text prompt. In this work, we propose Subject-Agnostic Guidance (SAG), a simple yet effective solution to remedy the problem. We show that through constructing a subject-agnostic condition and applying our proposed dual classifier-free guidance, one could obtain outputs consistent with both the given subject and input text prompts. We validate the efficacy of our approach through both optimization-based and encoder-based methods. Additionally, we demonstrate its applicability in second-order customization methods, where an encoder-based model is fine-tuned with DreamBooth. Our approach is conceptually simple and requires only minimal code modifications, but leads to substantial quality improvements, as evidenced by our evaluations and user studies.

Create account to get full access


If you already have an account, we'll log you in


  • Presents a new approach for improving subject-driven image synthesis by incorporating subject-agnostic guidance
  • Aims to address limitations of existing methods that rely solely on subject-specific information
  • Introduces a novel training strategy and model architecture to incorporate subject-agnostic guidance

Plain English Explanation

This research paper proposes a new way to generate images based on a subject or concept, while also incorporating general, subject-agnostic guidance. Existing methods for this type of "subject-driven image synthesis" often rely only on information specific to the subject being depicted, which can limit the quality and diversity of the generated images.

The researchers behind this work recognized that adding more general, "subject-agnostic" guidance - information not tied to any specific subject - could help improve the synthesis process. They developed a new training strategy and model architecture to achieve this, with the goal of producing higher-quality, more varied images compared to approaches that only use subject-specific data.

The key idea is to leverage both subject-specific and subject-agnostic information during training and generation, allowing the model to learn relevant patterns and generate images that capture the desired subject while also reflecting broader visual principles and aesthetics. This subject-agnostic guidance approach aims to produce more natural, compelling images that are grounded in the target subject but not overly constrained by it.

Technical Explanation

The paper introduces a novel training strategy and model architecture to incorporate subject-agnostic guidance into subject-driven image synthesis. The proposed SAGE (Subject-Agnostic Guided Synthesis) model consists of a subject encoder, a subject-agnostic encoder, and a synthesis module.

The subject encoder learns to represent the target subject using subject-specific information, such as textual descriptions or semantic segmentation maps. The subject-agnostic encoder, on the other hand, learns to capture more general visual patterns and aesthetics that are not tied to any specific subject. These two encoders are then combined to guide the synthesis module, which generates the final image.

The training process is designed to encourage the model to leverage both subject-specific and subject-agnostic information. This is achieved through a multi-task learning approach, where the model is trained to not only generate images consistent with the target subject but also to reproduce general visual features learned by the subject-agnostic encoder.

The researchers evaluate their approach on several subject-driven image synthesis tasks, including text-to-image and semantic segmentation-to-image synthesis. The results demonstrate that the SAGE model outperforms existing methods in terms of image quality, diversity, and alignment with the target subject.

Critical Analysis

The paper presents a compelling approach to improving subject-driven image synthesis by incorporating subject-agnostic guidance. The key strength of this work is the recognition that leveraging both subject-specific and subject-agnostic information can lead to more versatile and compelling generated images.

One potential limitation, however, is the reliance on separate encoders for subject-specific and subject-agnostic information. While this design choice allows for explicit modeling of these different types of guidance, it may also introduce additional complexity and potential challenges in terms of architecture optimization and training stability.

Additionally, the paper does not explore the potential for personalized text-to-image generation or the ability to incorporate user preferences into the synthesis process. Investigating these directions could further enhance the applicability and user-friendliness of the proposed approach.

Overall, the SAGE model represents a promising step forward in subject-driven image synthesis, and the incorporation of subject-agnostic guidance is a compelling idea that merits further exploration and refinement.


This research paper presents a novel approach for improving subject-driven image synthesis by leveraging both subject-specific and subject-agnostic guidance. The proposed SAGE model demonstrates the benefits of this dual-guidance approach, producing higher-quality and more diverse generated images compared to existing methods.

The key innovation of this work is the recognition that subject-agnostic information, in addition to subject-specific data, can play a crucial role in generating compelling and visually appealing images. By incorporating both types of guidance, the SAGE model is able to capture the essence of the target subject while also reflecting broader principles of aesthetics and visual composition.

This research opens up new avenues for advancing subject-driven image synthesis, with potential applications in areas such as creative content generation, product visualization, and visual storytelling. As the field continues to evolve, incorporating subject-agnostic guidance may prove to be an increasingly important strategy for developing more versatile and user-friendly image synthesis systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers


Training-free Subject-Enhanced Attention Guidance for Compositional Text-to-image Generation

Shengyuan Liu, Bo Wang, Ye Ma, Te Yang, Xipeng Cao, Quan Chen, Han Li, Di Dong, Peng Jiang





Existing subject-driven text-to-image generation models suffer from tedious fine-tuning steps and struggle to maintain both text-image alignment and subject fidelity. For generating compositional subjects, it often encounters problems such as object missing and attribute mixing, where some subjects in the input prompt are not generated or their attributes are incorrectly combined. To address these limitations, we propose a subject-driven generation framework and introduce training-free guidance to intervene in the generative process during inference time. This approach strengthens the attention map, allowing for precise attribute binding and feature injection for each subject. Notably, our method exhibits exceptional zero-shot generation ability, especially in the challenging task of compositional generation. Furthermore, we propose a novel metric GroundingScore to evaluate subject alignment thoroughly. The obtained quantitative results serve as compelling evidence showcasing the effectiveness of our proposed method. The code will be released soon.

Read more



High-fidelity Person-centric Subject-to-Image Synthesis

Yibin Wang, Weizhong Zhang, Jianwei Zheng, Cheng Jin





Current subject-driven image generation methods encounter significant challenges in person-centric image generation. The reason is that they learn the semantic scene and person generation by fine-tuning a common pre-trained diffusion, which involves an irreconcilable training imbalance. Precisely, to generate realistic persons, they need to sufficiently tune the pre-trained model, which inevitably causes the model to forget the rich semantic scene prior and makes scene generation over-fit to the training data. Moreover, even with sufficient fine-tuning, these methods can still not generate high-fidelity persons since joint learning of the scene and person generation also lead to quality compromise. In this paper, we propose Face-diffuser, an effective collaborative generation pipeline to eliminate the above training imbalance and quality compromise. Specifically, we first develop two specialized pre-trained diffusion models, i.e., Text-driven Diffusion Model (TDM) and Subject-augmented Diffusion Model (SDM), for scene and person generation, respectively. The sampling process is divided into three sequential stages, i.e., semantic scene construction, subject-scene fusion, and subject enhancement. The first and last stages are performed by TDM and SDM respectively. The subject-scene fusion stage, that is the collaboration achieved through a novel and highly effective mechanism, Saliency-adaptive Noise Fusion (SNF). Specifically, it is based on our key observation that there exists a robust link between classifier-free guidance responses and the saliency of generated images. In each time step, SNF leverages the unique strengths of each model and allows for the spatial blending of predicted noises from both models automatically in a saliency-aware manner. Extensive experiments confirm the impressive effectiveness and robustness of the Face-diffuser.

Read more



Training-Free Consistent Text-to-Image Generation

Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, Yuval Atzmon





Text-to-image models offer a new level of creative flexibility by allowing users to guide the image generation process through natural language. However, using these models to consistently portray the same subject across diverse prompts remains challenging. Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects or add image conditioning to the model. These methods require lengthy per-subject optimization or large-scale pre-training. Moreover, they struggle to align generated images with text prompts and face difficulties in portraying multiple subjects. Here, we present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model. We introduce a subject-driven shared attention block and correspondence-based feature injection to promote subject consistency between images. Additionally, we develop strategies to encourage layout diversity while maintaining subject consistency. We compare ConsiStory to a range of baselines, and demonstrate state-of-the-art performance on subject consistency and text alignment, without requiring a single optimization step. Finally, ConsiStory can naturally extend to multi-subject scenarios, and even enable training-free personalization for common objects.

Read more


AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation

AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation

Jingkun An, Yinghao Zhu, Zongjian Li, Haoran Feng, Bohua Chen, Yemin Shi, Chengwei Pan





Text-to-Image (T2I) diffusion models have achieved remarkable success in image generation. Despite their progress, challenges remain in both prompt-following ability, image quality and lack of high-quality datasets, which are essential for refining these models. As acquiring labeled data is costly, we introduce AGFSync, a framework that enhances T2I diffusion models through Direct Preference Optimization (DPO) in a fully AI-driven approach. AGFSync utilizes Vision-Language Models (VLM) to assess image quality across style, coherence, and aesthetics, generating feedback data within an AI-driven loop. By applying AGFSync to leading T2I models such as SD v1.4, v1.5, and SDXL, our extensive experiments on the TIFA dataset demonstrate notable improvements in VQA scores, aesthetic evaluations, and performance on the HPSv2 benchmark, consistently outperforming the base models. AGFSync's method of refining T2I diffusion models paves the way for scalable alignment techniques.

Read more
