OneActor: Consistent Character Generation via Cluster-Conditioned Guidance






Published 4/17/2024 by Jiahao Wang, Caixia Yan, Haonan Lin, Weizhan Zhang
OneActor: Consistent Character Generation via Cluster-Conditioned Guidance


Text-to-image diffusion models benefit artists with high-quality image generation. Yet its stochastic nature prevent artists from creating consistent images of the same character. Existing methods try to tackle this challenge and generate consistent content in various ways. However, they either depend on external data or require expensive tuning of the diffusion model. For this issue, we argue that a lightweight but intricate guidance is enough to function. Aiming at this, we lead the way to formalize the objective of consistent generation, derive a clustering-based score function and propose a novel paradigm, OneActor. We design a cluster-conditioned model which incorporates posterior samples to guide the denoising trajectories towards the target cluster. To overcome the overfitting challenge shared by one-shot tuning pipelines, we devise auxiliary components to simultaneously augment the tuning and regulate the inference. This technique is later verified to significantly enhance the content diversity of generated images. Comprehensive experiments show that our method outperforms a variety of baselines with satisfactory character consistency, superior prompt conformity as well as high image quality. And our method is at least 4 times faster than tuning-based baselines. Furthermore, to our best knowledge, we first prove that the semantic space has the same interpolation property as the latent space dose. This property can serve as another promising tool for fine generation control.

Create account to get full access


If you already have an account, we'll log you in


• This paper presents a novel approach called "OneActor" for generating consistent character animations in text-to-image AI models.

• The key idea is to use cluster-conditioned guidance to maintain the visual consistency of generated characters across different images, addressing a common issue in diffusion models.

• The proposed method demonstrates improvements over existing techniques for ensuring consistent character appearance and behavior, as evaluated through quantitative and qualitative experiments.

Plain English Explanation

• Imagine you're using an AI system to generate images based on text descriptions. One common problem is that the characters or objects in these images can sometimes look or behave inconsistently from one image to the next, even when the text prompt is the same.

• The OneActor approach aims to solve this by organizing the AI model's internal representations of characters into distinct "clusters." When generating a new image, the model is guided to stay within the same cluster, ensuring the character maintains a consistent appearance and behavior across different images.

• This helps the AI system create a more coherent and believable set of images, where the characters feel like they belong to the same world or story, rather than randomly changing from one picture to the next.

• The researchers demonstrate that OneActor outperforms other techniques for ensuring visual consistency, as measured by both objective metrics and subjective human evaluations. This could be an important step towards creating more natural and immersive text-to-image AI systems.

Technical Explanation

• The paper first reviews prior work on improving the consistency of diffusion models, including approaches like [semantic-approach-to-quantifying-consistency-diffusion-model], [rethinking-spatial-inconsistency-classifier-free-diffusion-guidance], and [trajectory-consistency-distillation-improved-latent-consistency-distillation].

• The key technical innovation in OneActor is the use of "cluster-conditioned guidance," where the model learns to organize its internal representations of characters into distinct clusters. During generation, the model is then guided to stay within the same cluster, encouraging consistent visual attributes.

• This is implemented by training a separate clustering module alongside the main diffusion model, using techniques like [controlnet-improving-conditional-controls-efficient-consistency-feedback] to efficiently incorporate the cluster information into the generation process.

• Experiments on various text-to-image benchmarks show that OneActor achieves superior performance in terms of maintaining character consistency, as evaluated by both quantitative metrics and human judgments. This includes comparisons to [versatile-scene-consistent-traffic-scenario-generation-as] and other baselines.

Critical Analysis

• While the OneActor approach demonstrates promising results, the paper acknowledges some limitations, such as the potential for the clustering module to introduce biases or fail to capture all the nuances of character representation.

• Additionally, the experiments focus primarily on character-centric tasks, so further research may be needed to understand how the technique generalizes to more diverse scenes and image compositions.

• It would also be interesting to explore the computational efficiency and scalability of the cluster-conditioned guidance, as the added complexity could impact the overall performance and deployment of the system.

• Overall, the OneActor paper presents a thoughtful and well-executed approach to a meaningful problem in text-to-image generation. Further research and real-world deployments will help reveal the broader implications and potential issues that deserve closer examination.


• The OneActor method introduces a novel cluster-conditioned guidance technique to improve the visual consistency of character representations in text-to-image AI models.

• By organizing the model's internal character representations into distinct clusters and guiding the generation process to stay within those clusters, OneActor demonstrates superior performance in maintaining consistent character appearances and behaviors across different images.

• This work represents an important step towards creating more coherent and believable text-to-image systems, with potential applications in areas like interactive storytelling, virtual worlds, and immersive entertainment experiences.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers


The Chosen One: Consistent Characters in Text-to-Image Diffusion Models

Omri Avrahami, Amir Hertz, Yael Vinker, Moab Arar, Shlomi Fruchter, Ohad Fried, Daniel Cohen-Or, Dani Lischinski





Recent advances in text-to-image generation models have unlocked vast potential for visual creativity. However, the users that use these models struggle with the generation of consistent characters, a crucial aspect for numerous real-world applications such as story visualization, game development, asset design, advertising, and more. Current methods typically rely on multiple pre-existing images of the target character or involve labor-intensive manual processes. In this work, we propose a fully automated solution for consistent character generation, with the sole input being a text prompt. We introduce an iterative procedure that, at each stage, identifies a coherent set of images sharing a similar identity and extracts a more consistent identity from this set. Our quantitative analysis demonstrates that our method strikes a better balance between prompt alignment and identity consistency compared to the baseline methods, and these findings are reinforced by a user study. To conclude, we showcase several practical applications of our approach.

Read more



Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, Liefeng Bo





Character Animation aims to generating character videos from still images through driving signals. Currently, diffusion models have become the mainstream in visual generation research, owing to their robust generative capabilities. However, challenges persist in the realm of image-to-video, especially in character animation, where temporally maintaining consistency with detailed information from character remains a formidable problem. In this paper, we leverage the power of diffusion models and propose a novel framework tailored for character animation. To preserve consistency of intricate appearance features from reference image, we design ReferenceNet to merge detail features via spatial attention. To ensure controllability and continuity, we introduce an efficient pose guider to direct character's movements and employ an effective temporal modeling approach to ensure smooth inter-frame transitions between video frames. By expanding the training data, our approach can animate arbitrary characters, yielding superior results in character animation compared to other image-to-video methods. Furthermore, we evaluate our method on benchmarks for fashion video and human dance synthesis, achieving state-of-the-art results.

Read more



Training-Free Consistent Text-to-Image Generation

Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, Yuval Atzmon





Text-to-image models offer a new level of creative flexibility by allowing users to guide the image generation process through natural language. However, using these models to consistently portray the same subject across diverse prompts remains challenging. Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects or add image conditioning to the model. These methods require lengthy per-subject optimization or large-scale pre-training. Moreover, they struggle to align generated images with text prompts and face difficulties in portraying multiple subjects. Here, we present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model. We introduce a subject-driven shared attention block and correspondence-based feature injection to promote subject consistency between images. Additionally, we develop strategies to encourage layout diversity while maintaining subject consistency. We compare ConsiStory to a range of baselines, and demonstrate state-of-the-art performance on subject consistency and text alignment, without requiring a single optimization step. Finally, ConsiStory can naturally extend to multi-subject scenarios, and even enable training-free personalization for common objects.

Read more


StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, Qibin Hou





For recent diffusion-based generative models, maintaining consistent content across a series of generated images, especially those containing subjects and complex details, presents a significant challenge. In this paper, we propose a new way of self-attention calculation, termed Consistent Self-Attention, that significantly boosts the consistency between the generated images and augments prevalent pretrained diffusion-based text-to-image models in a zero-shot manner. To extend our method to long-range video generation, we further introduce a novel semantic space temporal motion prediction module, named Semantic Motion Predictor. It is trained to estimate the motion conditions between two provided images in the semantic spaces. This module converts the generated sequence of images into videos with smooth transitions and consistent subjects that are significantly more stable than the modules based on latent spaces only, especially in the context of long video generation. By merging these two novel components, our framework, referred to as StoryDiffusion, can describe a text-based story with consistent images or videos encompassing a rich variety of contents. The proposed StoryDiffusion encompasses pioneering explorations in visual story generation with the presentation of images and videos, which we hope could inspire more research from the aspect of architectural modifications. Our code is made publicly available at

Read more
