Identity Decoupling for Multi-Subject Personalization of Text-to-Image Models

Read original: arXiv:2404.04243 - Published 5/29/2024 by Sangwon Jang, Jaehyeong Jo, Kimin Lee, Sung Ju Hwang

🏷️

Overview

Text-to-image diffusion models can generate personalized images based on reference images, but struggle with handling multiple subjects simultaneously
The authors present MuDI, a novel framework that enables multi-subject personalization by effectively decoupling identities from multiple subjects
MuDI uses segmented subjects generated by the Segment Anything Model as a form of data augmentation for training and initialization for the generation process

Plain English Explanation

The paper describes a new system called MuDI that can generate personalized images with multiple people in them, without the different people's identities getting mixed together. Current text-to-image models often struggle with this, resulting in a confusing blend of different people's features.

The key idea behind MuDI is to use a separate AI model called the Segment Anything Model to identify and separate the different people in the reference images. This segmentation information is then used to train the MuDI model and guide the image generation process, helping it keep the identities of the people distinct.

This approach allows MuDI to generate high-quality personalized images with multiple subjects, even when the subjects are very similar to each other. In user tests, MuDI was able to personalize multiple subjects without identity mixing much more successfully than existing methods, and was preferred by over 70% of users compared to the strongest baseline.

Technical Explanation

The paper presents a novel framework called MuDI that enables effective multi-subject personalization in text-to-image diffusion models. The key innovation is the use of segmented subject information generated by the Segment Anything Model as a form of data augmentation for training and initialization for the generation process.

During training, the MuDI model takes in both the reference images and their corresponding subject segmentation maps. This allows the model to learn to generate images where the different subjects' identities are effectively decoupled. At inference time, the segmentation information is used to initialize the generation process, guiding the model to produce personalized images with distinct subject identities.

The authors demonstrate through experiments that MuDI can generate high-quality personalized images without identity mixing, even for highly similar subjects. In human evaluations, MuDI shows twice as many successes for personalizing multiple subjects without identity mixing compared to existing baselines, and is preferred by over 70% of users.

Critical Analysis

The paper presents a compelling approach to the important challenge of enabling multi-subject personalization in text-to-image diffusion models. The use of segmentation information as a form of data augmentation is a clever and effective solution, as it allows the model to learn the necessary principles for maintaining distinct subject identities.

One potential limitation is the reliance on the Segment Anything Model for the segmentation information. While this model has shown strong performance, it introduces an additional dependency that could impact the overall system's robustness and generalization. It would be interesting to explore alternative segmentation approaches or ways to make the MuDI model more self-sufficient in this regard.

Additionally, the paper focuses primarily on the technical evaluation and user studies, but does not delve deeply into potential societal implications or ethical considerations around the use of such personalized image generation systems. As these models become more advanced and widely deployed, it will be crucial to carefully examine their potential for bias and other unintended consequences.

Overall, the MuDI framework represents a significant step forward in addressing a key limitation of current text-to-image models. The authors' data-efficient multimodal fusion approach and adaptive affinity-based generalization are likely to inspire further research and development in this important area.

Conclusion

The paper presents MuDI, a novel framework that enables effective multi-subject personalization in text-to-image diffusion models. By leveraging segmentation information as a form of data augmentation, MuDI is able to generate high-quality personalized images with distinct subject identities, even for highly similar individuals.

The demonstrated capabilities of MuDI represent a significant advancement in the field of text-to-image generation, with potential applications in areas such as digital art, virtual environments, and personalized marketing. As these technologies continue to evolve, it will be crucial to thoughtfully consider the ethical implications and potential societal impacts to ensure they are developed and deployed responsibly.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏷️

Identity Decoupling for Multi-Subject Personalization of Text-to-Image Models

Sangwon Jang, Jaehyeong Jo, Kimin Lee, Sung Ju Hwang

Text-to-image diffusion models have shown remarkable success in generating personalized subjects based on a few reference images. However, current methods often fail when generating multiple subjects simultaneously, resulting in mixed identities with combined attributes from different subjects. In this work, we present MuDI, a novel framework that enables multi-subject personalization by effectively decoupling identities from multiple subjects. Our main idea is to utilize segmented subjects generated by a foundation model for segmentation (Segment Anything) for both training and inference, as a form of data augmentation for training and initialization for the generation process. Moreover, we further introduce a new metric to better evaluate the performance of our method on multi-subject personalization. Experimental results show that our MuDI can produce high-quality personalized images without identity mixing, even for highly similar subjects as shown in Figure 1. Specifically, in human evaluation, MuDI obtains twice the success rate for personalizing multiple subjects without identity mixing over existing baselines and is preferred over 70% against the strongest baseline.

5/29/2024

🧪

Multi-Subject Personalization

Arushi Jain, Shubham Paliwal, Monika Sharma, Vikram Jamwal, Lovekesh Vig

Creative story illustration requires a consistent interplay of multiple characters or objects. However, conventional text-to-image models face significant challenges while producing images featuring multiple personalized subjects. For example, they distort the subject rendering, or the text descriptions fail to render coherent subject interactions. We present Multi-Subject Personalization (MSP) to alleviate some of these challenges. We implement MSP using Stable Diffusion and assess our approach against other text-to-image models, showcasing its consistent generation of good-quality images representing intended subjects and interactions.

5/22/2024

MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance

X. Wang, Siming Fu, Qihan Huang, Wanggui He, Hao Jiang

Recent advancements in text-to-image generation models have dramatically enhanced the generation of photorealistic images from textual prompts, leading to an increased interest in personalized text-to-image applications, particularly in multi-subject scenarios. However, these advances are hindered by two main challenges: firstly, the need to accurately maintain the details of each referenced subject in accordance with the textual descriptions; and secondly, the difficulty in achieving a cohesive representation of multiple subjects in a single image without introducing inconsistencies. To address these concerns, our research introduces the MS-Diffusion framework for layout-guided zero-shot image personalization with multi-subjects. This innovative approach integrates grounding tokens with the feature resampler to maintain detail fidelity among subjects. With the layout guidance, MS-Diffusion further improves the cross-attention to adapt to the multi-subject inputs, ensuring that each subject condition acts on specific areas. The proposed multi-subject cross-attention orchestrates harmonious inter-subject compositions while preserving the control of texts. Comprehensive quantitative and qualitative experiments affirm that this method surpasses existing models in both image and text fidelity, promoting the development of personalized text-to-image generation.

6/12/2024

🛸

Subject-Diffusion:Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning

Jian Ma, Junhao Liang, Chen Chen, Haonan Lu

Recent progress in personalized image generation using diffusion models has been significant. However, development in the area of open-domain and non-fine-tuning personalized image generation is proceeding rather slowly. In this paper, we propose Subject-Diffusion, a novel open-domain personalized image generation model that, in addition to not requiring test-time fine-tuning, also only requires a single reference image to support personalized generation of single- or multi-subject in any domain. Firstly, we construct an automatic data labeling tool and use the LAION-Aesthetics dataset to construct a large-scale dataset consisting of 76M images and their corresponding subject detection bounding boxes, segmentation masks and text descriptions. Secondly, we design a new unified framework that combines text and image semantics by incorporating coarse location and fine-grained reference image control to maximize subject fidelity and generalization. Furthermore, we also adopt an attention control mechanism to support multi-subject generation. Extensive qualitative and quantitative results demonstrate that our method outperforms other SOTA frameworks in single, multiple, and human customized image generation. Please refer to our href{https://oppo-mente-lab.github.io/subject_diffusion/}{project page}

5/21/2024