The Chosen One: Consistent Characters in Text-to-Image Diffusion Models

2311.10093

Published 6/6/2024 by Omri Avrahami, Amir Hertz, Yael Vinker, Moab Arar, Shlomi Fruchter, Ohad Fried, Daniel Cohen-Or, Dani Lischinski

cs.CV cs.GR cs.LG

🏋️

Abstract

Recent advances in text-to-image generation models have unlocked vast potential for visual creativity. However, the users that use these models struggle with the generation of consistent characters, a crucial aspect for numerous real-world applications such as story visualization, game development, asset design, advertising, and more. Current methods typically rely on multiple pre-existing images of the target character or involve labor-intensive manual processes. In this work, we propose a fully automated solution for consistent character generation, with the sole input being a text prompt. We introduce an iterative procedure that, at each stage, identifies a coherent set of images sharing a similar identity and extracts a more consistent identity from this set. Our quantitative analysis demonstrates that our method strikes a better balance between prompt alignment and identity consistency compared to the baseline methods, and these findings are reinforced by a user study. To conclude, we showcase several practical applications of our approach.

Create account to get full access

Overview

Recent advances in text-to-image generation models have greatly expanded visual creativity, but these models struggle with generating consistent characters, which is crucial for many real-world applications.
Current methods typically rely on multiple pre-existing images of the target character or labor-intensive manual processes.
This work proposes a fully automated solution for consistent character generation, with only a text prompt as input.

Plain English Explanation

Text-to-image generation models have made it much easier for people to create visual content. These models can take a written description and generate an image to match it. However, these models often have trouble creating characters that look the same across multiple images. This consistency is important for things like story visualization, game development, and advertising.

The typical solutions for this problem involve either using many pre-existing images of the character or manually tweaking the images. This new research proposes a fully automatic way to generate consistent character images from just a text description. It does this by repeatedly identifying a set of similar images and extracting a more consistent character identity from that set. This helps balance making the images match the text prompt while also keeping the character looking the same across multiple images.

Technical Explanation

The key innovation of this work is an iterative procedure that, at each stage, identifies a coherent set of images sharing a similar identity and extracts a more consistent identity from this set. This is achieved through a series of clustering and extraction steps.

First, the method generates an initial set of character images from the text prompt. Then, it identifies clusters of similar images within this set. For each cluster, it extracts a more consistent character representation, which is used to generate a new set of images. This process is repeated until a satisfactory level of consistency is reached.

The authors' quantitative analysis shows that this approach strikes a better balance between prompt alignment and identity consistency compared to baseline methods. These findings are further reinforced by a user study.

Critical Analysis

The paper provides a valuable contribution by addressing the important challenge of generating consistent characters from text prompts. However, the authors acknowledge some limitations of their approach.

One key limitation is that the iterative clustering and extraction process can be computationally expensive, especially for large-scale generation. The authors suggest exploring ways to make this process more efficient.

Additionally, the paper does not explore the generalization of the method to more diverse character types or the incorporation of additional modalities, such as voice or personality traits, to further enhance character consistency.

Further research could also investigate the robustness of the method to various types of text prompts and its ability to handle complex character relationships or narratives.

Conclusion

This research presents a promising step towards fully automated, text-driven generation of consistent characters, which has numerous practical applications in fields like story generation, game development, and advertising. The iterative clustering and extraction approach helps balance prompt alignment and identity consistency, as demonstrated through quantitative analysis and user studies.

While the method has some computational limitations, the authors' work highlights the potential for further advancements in this area. Continued research on efficient algorithms, generalization to diverse character types, and integration with other modalities could lead to even more powerful and versatile character generation capabilities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

ORACLE: Leveraging Mutual Information for Consistent Character Generation with LoRAs in Diffusion Models

Kiymet Akdemir, Pinar Yanardag

Text-to-image diffusion models have recently taken center stage as pivotal tools in promoting visual creativity across an array of domains such as comic book artistry, children's literature, game development, and web design. These models harness the power of artificial intelligence to convert textual descriptions into vivid images, thereby enabling artists and creators to bring their imaginative concepts to life with unprecedented ease. However, one of the significant hurdles that persist is the challenge of maintaining consistency in character generation across diverse contexts. Variations in textual prompts, even if minor, can yield vastly different visual outputs, posing a considerable problem in projects that require a uniform representation of characters throughout. In this paper, we introduce a novel framework designed to produce consistent character representations from a single text prompt across diverse settings. Through both quantitative and qualitative analyses, we demonstrate that our framework outperforms existing methods in generating characters with consistent visual identities, underscoring its potential to transform creative industries. By addressing the critical challenge of character consistency, we not only enhance the practical utility of these models but also broaden the horizons for artistic and creative expression.

6/6/2024

cs.CV cs.LG

🛸

Training-Free Consistent Text-to-Image Generation

Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, Yuval Atzmon

Text-to-image models offer a new level of creative flexibility by allowing users to guide the image generation process through natural language. However, using these models to consistently portray the same subject across diverse prompts remains challenging. Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects or add image conditioning to the model. These methods require lengthy per-subject optimization or large-scale pre-training. Moreover, they struggle to align generated images with text prompts and face difficulties in portraying multiple subjects. Here, we present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model. We introduce a subject-driven shared attention block and correspondence-based feature injection to promote subject consistency between images. Additionally, we develop strategies to encourage layout diversity while maintaining subject consistency. We compare ConsiStory to a range of baselines, and demonstrate state-of-the-art performance on subject consistency and text alignment, without requiring a single optimization step. Finally, ConsiStory can naturally extend to multi-subject scenarios, and even enable training-free personalization for common objects.

5/31/2024

cs.CV cs.AI cs.GR cs.LG

OneActor: Consistent Character Generation via Cluster-Conditioned Guidance

Jiahao Wang, Caixia Yan, Haonan Lin, Weizhan Zhang

Text-to-image diffusion models benefit artists with high-quality image generation. Yet its stochastic nature prevent artists from creating consistent images of the same character. Existing methods try to tackle this challenge and generate consistent content in various ways. However, they either depend on external data or require expensive tuning of the diffusion model. For this issue, we argue that a lightweight but intricate guidance is enough to function. Aiming at this, we lead the way to formalize the objective of consistent generation, derive a clustering-based score function and propose a novel paradigm, OneActor. We design a cluster-conditioned model which incorporates posterior samples to guide the denoising trajectories towards the target cluster. To overcome the overfitting challenge shared by one-shot tuning pipelines, we devise auxiliary components to simultaneously augment the tuning and regulate the inference. This technique is later verified to significantly enhance the content diversity of generated images. Comprehensive experiments show that our method outperforms a variety of baselines with satisfactory character consistency, superior prompt conformity as well as high image quality. And our method is at least 4 times faster than tuning-based baselines. Furthermore, to our best knowledge, we first prove that the semantic space has the same interpolation property as the latent space dose. This property can serve as another promising tool for fine generation control.

4/17/2024

cs.CV cs.AI

LatentMan: Generating Consistent Animated Characters using Image Diffusion Models

Abdelrahman Eldesokey, Peter Wonka

We propose a zero-shot approach for generating consistent videos of animated characters based on Text-to-Image (T2I) diffusion models. Existing Text-to-Video (T2V) methods are expensive to train and require large-scale video datasets to produce diverse characters and motions. At the same time, their zero-shot alternatives fail to produce temporally consistent videos with continuous motion. We strive to bridge this gap, and we introduce LatentMan, which leverages existing text-based motion diffusion models to generate diverse continuous motions to guide the T2I model. To boost the temporal consistency, we introduce the Spatial Latent Alignment module that exploits cross-frame dense correspondences that we compute to align the latents of the video frames. Furthermore, we propose Pixel-Wise Guidance to steer the diffusion process in a direction that minimizes visual discrepancies between frames. Our proposed approach outperforms existing zero-shot T2V approaches in generating videos of animated characters in terms of pixel-wise consistency and user preference. Project page https://abdo-eldesokey.github.io/latentman/.

6/4/2024

cs.CV cs.LG