Subject-Diffusion:Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning

2307.11410

Published 5/21/2024 by Jian Ma, Junhao Liang, Chen Chen, Haonan Lu

🛸

Abstract

Recent progress in personalized image generation using diffusion models has been significant. However, development in the area of open-domain and non-fine-tuning personalized image generation is proceeding rather slowly. In this paper, we propose Subject-Diffusion, a novel open-domain personalized image generation model that, in addition to not requiring test-time fine-tuning, also only requires a single reference image to support personalized generation of single- or multi-subject in any domain. Firstly, we construct an automatic data labeling tool and use the LAION-Aesthetics dataset to construct a large-scale dataset consisting of 76M images and their corresponding subject detection bounding boxes, segmentation masks and text descriptions. Secondly, we design a new unified framework that combines text and image semantics by incorporating coarse location and fine-grained reference image control to maximize subject fidelity and generalization. Furthermore, we also adopt an attention control mechanism to support multi-subject generation. Extensive qualitative and quantitative results demonstrate that our method outperforms other SOTA frameworks in single, multiple, and human customized image generation. Please refer to our href{https://oppo-mente-lab.github.io/subject_diffusion/}{project page}

Create account to get full access

Overview

Recent progress in personalized image generation using diffusion models has been significant.
However, development in the area of open-domain and non-fine-tuning personalized image generation is proceeding slowly.
This paper proposes a novel model called Subject-Diffusion that can generate personalized images without requiring test-time fine-tuning, using only a single reference image.

Plain English Explanation

The paper describes a new approach for generating personalized images using a technique called diffusion models. Diffusion models are a type of artificial intelligence that can create new images from scratch, similar to how a human artist might draw a picture.

Typically, these models require a lot of training data and may need to be "fine-tuned" on specific images before they can generate personalized content. However, the researchers behind this paper have developed a new model called Subject-Diffusion that can generate personalized images without needing that extra fine-tuning step.

The key innovation is that Subject-Diffusion can use just a single reference image to guide the generation process. This makes it much more flexible and practical for real-world applications, where users may only have a few example images to work with.

The researchers also developed a way to handle generating images with multiple subjects, which is an important capability for things like family portraits or group photos. Overall, this work represents a significant advance in the field of personalized image generation using diffusion models.

Technical Explanation

The paper first describes an automatic data labeling tool that the researchers used to construct a large-scale dataset of 76 million images with subject detection bounding boxes, segmentation masks, and text descriptions. This dataset, called LAION-Aesthetics, provides a rich resource for training and evaluating personalized image generation models.

The core of the Subject-Diffusion model is a novel unified framework that combines text and image semantics. It incorporates coarse location and fine-grained reference image control to maximize subject fidelity and generalization. This allows the model to generate personalized images that closely match the target subject, without requiring fine-tuning on that specific subject.

Furthermore, the researchers adopted an attention control mechanism to support the generation of images with multiple subjects. This enables the model to handle complex scenes with multiple people or objects.

Extensive qualitative and quantitative evaluations demonstrate that Subject-Diffusion outperforms other state-of-the-art frameworks in single, multiple, and human-customized image generation tasks. This suggests that the proposed approach is a significant advancement in the field of personalized image generation using diffusion models.

Critical Analysis

The paper presents a compelling solution to the challenge of open-domain and non-fine-tuning personalized image generation. The Subject-Diffusion model's ability to generate personalized images using only a single reference image is a notable innovation that could make this technology more accessible and practical for real-world applications.

One potential limitation of the research is the reliance on the LAION-Aesthetics dataset, which may not be representative of all possible subjects and domains. It would be interesting to see how well the Subject-Diffusion model performs on more diverse or specialized datasets.

Additionally, the paper does not provide much insight into the computational efficiency or real-world deployment considerations of the proposed model. Further research could explore the trade-offs between model complexity, inference time, and generation quality.

Overall, the Subject-Diffusion model represents a significant step forward in the field of personalized image generation, and the researchers' attention control mechanism for multi-subject generation is a promising direction for future work. As the field continues to evolve, it will be interesting to see how this approach compares to other emerging text-to-image synthesis and few-shot generation techniques.

Conclusion

This paper presents a novel personalized image generation model called Subject-Diffusion that can generate high-quality images without requiring test-time fine-tuning, using only a single reference image. The key innovations include a unified framework that combines text and image semantics, as well as an attention control mechanism for multi-subject generation.

The extensive evaluation results demonstrate that Subject-Diffusion outperforms other state-of-the-art personalized image generation models, making it a promising approach for a wide range of applications, from custom artwork creation to personalized avatars and virtual photography.

As the field of personalized image generation continues to advance, this work represents an important step forward in providing users with more powerful and flexible tools for creating personalized visual content.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance

X. Wang, Siming Fu, Qihan Huang, Wanggui He, Hao Jiang

Recent advancements in text-to-image generation models have dramatically enhanced the generation of photorealistic images from textual prompts, leading to an increased interest in personalized text-to-image applications, particularly in multi-subject scenarios. However, these advances are hindered by two main challenges: firstly, the need to accurately maintain the details of each referenced subject in accordance with the textual descriptions; and secondly, the difficulty in achieving a cohesive representation of multiple subjects in a single image without introducing inconsistencies. To address these concerns, our research introduces the MS-Diffusion framework for layout-guided zero-shot image personalization with multi-subjects. This innovative approach integrates grounding tokens with the feature resampler to maintain detail fidelity among subjects. With the layout guidance, MS-Diffusion further improves the cross-attention to adapt to the multi-subject inputs, ensuring that each subject condition acts on specific areas. The proposed multi-subject cross-attention orchestrates harmonious inter-subject compositions while preserving the control of texts. Comprehensive quantitative and qualitative experiments affirm that this method surpasses existing models in both image and text fidelity, promoting the development of personalized text-to-image generation.

6/12/2024

cs.CV

🐍

High-fidelity Person-centric Subject-to-Image Synthesis

Yibin Wang, Weizhong Zhang, Jianwei Zheng, Cheng Jin

Current subject-driven image generation methods encounter significant challenges in person-centric image generation. The reason is that they learn the semantic scene and person generation by fine-tuning a common pre-trained diffusion, which involves an irreconcilable training imbalance. Precisely, to generate realistic persons, they need to sufficiently tune the pre-trained model, which inevitably causes the model to forget the rich semantic scene prior and makes scene generation over-fit to the training data. Moreover, even with sufficient fine-tuning, these methods can still not generate high-fidelity persons since joint learning of the scene and person generation also lead to quality compromise. In this paper, we propose Face-diffuser, an effective collaborative generation pipeline to eliminate the above training imbalance and quality compromise. Specifically, we first develop two specialized pre-trained diffusion models, i.e., Text-driven Diffusion Model (TDM) and Subject-augmented Diffusion Model (SDM), for scene and person generation, respectively. The sampling process is divided into three sequential stages, i.e., semantic scene construction, subject-scene fusion, and subject enhancement. The first and last stages are performed by TDM and SDM respectively. The subject-scene fusion stage, that is the collaboration achieved through a novel and highly effective mechanism, Saliency-adaptive Noise Fusion (SNF). Specifically, it is based on our key observation that there exists a robust link between classifier-free guidance responses and the saliency of generated images. In each time step, SNF leverages the unique strengths of each model and allows for the spatial blending of predicted noises from both models automatically in a saliency-aware manner. Extensive experiments confirm the impressive effectiveness and robustness of the Face-diffuser.

5/6/2024

cs.CV cs.AI

Source-Free Domain Adaptation with Diffusion-Guided Source Data Generation

Shivang Chopra, Suraj Kothawade, Houda Aynaou, Aman Chadha

This paper introduces a novel approach to leverage the generalizability of Diffusion Models for Source-Free Domain Adaptation (DM-SFDA). Our proposed DMSFDA method involves fine-tuning a pre-trained text-to-image diffusion model to generate source domain images using features from the target images to guide the diffusion process. Specifically, the pre-trained diffusion model is fine-tuned to generate source samples that minimize entropy and maximize confidence for the pre-trained source model. We then use a diffusion model-based image mixup strategy to bridge the domain gap between the source and target domains. We validate our approach through comprehensive experiments across a range of datasets, including Office-31, Office-Home, and VisDA. The results demonstrate significant improvements in SFDA performance, highlighting the potential of diffusion models in generating contextually relevant, domain-specific images.

6/28/2024

cs.CV cs.AI cs.LG

Diffusion Features to Bridge Domain Gap for Semantic Segmentation

Yuxiang Ji, Boyong He, Chenyuan Qu, Zhuoyue Tan, Chuan Qin, Liaoni Wu

Pre-trained diffusion models have demonstrated remarkable proficiency in synthesizing images across a wide range of scenarios with customizable prompts, indicating their effective capacity to capture universal features. Motivated by this, our study delves into the utilization of the implicit knowledge embedded within diffusion models to address challenges in cross-domain semantic segmentation. This paper investigates the approach that leverages the sampling and fusion techniques to harness the features of diffusion models efficiently. Contrary to the simplistic migration applications characterized by prior research, our finding reveals that the multi-step diffusion process inherent in the diffusion model manifests more robust semantic features. We propose DIffusion Feature Fusion (DIFF) as a backbone use for extracting and integrating effective semantic representations through the diffusion process. By leveraging the strength of text-to-image generation capability, we introduce a new training framework designed to implicitly learn posterior knowledge from it. Through rigorous evaluation in the contexts of domain generalization semantic segmentation, we establish that our methodology surpasses preceding approaches in mitigating discrepancies across distinct domains and attains the state-of-the-art (SOTA) benchmark. Within the synthetic-to-real (syn-to-real) context, our method significantly outperforms ResNet-based and transformer-based backbone methods, achieving an average improvement of $3.84%$ mIoU across various datasets. The implementation code will be released soon.

6/4/2024

cs.CV cs.AI