Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation

2406.02485

Published 6/5/2024 by Jiajun Wang, Morteza Ghahremani, Yitong Li, Bjorn Ommer, Christian Wachinger

Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation

Abstract

Controllable text-to-image (T2I) diffusion models have shown impressive performance in generating high-quality visual content through the incorporation of various conditions. Current methods, however, exhibit limited performance when guided by skeleton human poses, especially in complex pose conditions such as side or rear perspectives of human figures. To address this issue, we present Stable-Pose, a novel adapter model that introduces a coarse-to-fine attention masking strategy into a vision Transformer (ViT) to gain accurate pose guidance for T2I models. Stable-Pose is designed to adeptly handle pose conditions within pre-trained Stable Diffusion, providing a refined and efficient way of aligning pose representation during image synthesis. We leverage the query-key self-attention mechanism of ViTs to explore the interconnections among different anatomical parts in human pose skeletons. Masked pose images are used to smoothly refine the attention maps based on target pose-related features in a hierarchical manner, transitioning from coarse to fine levels. Additionally, our loss function is formulated to allocate increased emphasis to the pose region, thereby augmenting the model's precision in capturing intricate pose details. We assessed the performance of Stable-Pose across five public datasets under a wide range of indoor and outdoor human pose scenarios. Stable-Pose achieved an AP score of 57.1 in the LAION-Human dataset, marking around 13% improvement over the established technique ControlNet. The project link and code is available at https://github.com/ai-med/StablePose.

Create account to get full access

Overview

The paper "Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation" explores a novel approach to text-to-image generation that incorporates pose information to improve the stability and realism of the generated images.
The key idea is to combine a text encoder, a pose encoder, and an image generator using transformers to enable fine-grained control over the pose of the generated human figures.
This builds upon recent advancements in text-to-image generation, such as VividPose and PoseAnimate, while also addressing the challenge of generating diverse and realistic human poses.

Plain English Explanation

The researchers developed a new way to generate images from text descriptions that allows for more control over the poses of the people in the images. Typically, text-to-image generation models can struggle to create consistent and realistic human poses. The Stable-Pose approach aims to address this by using a combination of text, pose, and image encoders based on transformers.

The key idea is that by including information about the desired pose of the human figures, the model can generate images that more closely match the text description and have more natural-looking poses. This builds on previous work that has explored using pose information to improve text-to-image generation, such as VividPose and PoseAnimate.

The researchers tested their Stable-Pose model on a range of text-to-image generation tasks and found that it was able to produce images with more consistent and realistic human poses compared to previous approaches. This could have important applications in fields like computer graphics, virtual reality, and video game development, where generating realistic and diverse human figures is a key challenge.

Technical Explanation

The Stable-Pose approach combines a text encoder, a pose encoder, and an image generator using transformers to enable fine-grained control over the pose of the generated human figures. The text encoder takes the input text description and encodes it into a semantic representation. The pose encoder takes a desired pose (e.g., a stick figure representation) and encodes it into a latent pose representation. These two latent representations are then combined and used to condition the image generator, which produces the final image.

The key innovation of the Stable-Pose model is the way it leverages pose information to improve the stability and realism of the generated images. By explicitly incorporating the desired pose into the generation process, the model is able to better align the generated images with the text description and produce more natural-looking human figures.

The researchers evaluated the Stable-Pose model on a variety of text-to-image generation tasks, including generating images of people in different poses and scenes. They compared the performance of Stable-Pose to other state-of-the-art text-to-image generation models, such as DiversifyingHumanPose and SATO, and found that Stable-Pose was able to generate images with more consistent and realistic human poses.

Critical Analysis

One potential limitation of the Stable-Pose approach is that it may be more computationally intensive than some other text-to-image generation models, as it requires the additional pose encoding step. The researchers acknowledge this trade-off and suggest that future work could explore ways to optimize the model architecture or training process to improve computational efficiency.

Additionally, while the Stable-Pose model showed improvements in generating consistent and realistic human poses, it may still struggle with generating highly diverse or creative human figures, as the pose information is constrained to the desired input. Exploring ways to introduce more flexibility or randomness into the pose generation process could be an interesting area for future research.

Finally, it's important to note that like any text-to-image generation model, the Stable-Pose approach may reflect and amplify societal biases present in the training data. Careful consideration of these potential biases and their impact on the generated images will be crucial as this technology is further developed and deployed.

Conclusion

The Stable-Pose paper presents a novel approach to text-to-image generation that leverages transformer-based encoders to incorporate pose information and improve the stability and realism of the generated human figures. By explicitly modeling the desired pose of the human subjects, the Stable-Pose model is able to generate images that more closely align with the input text description and have more natural-looking poses.

This work builds on recent advancements in text-to-image generation and could have important applications in fields like computer graphics, virtual reality, and video game development, where generating realistic and diverse human figures is a key challenge. As the field of text-to-image generation continues to evolve, the Stable-Pose approach offers a promising direction for further exploration and research.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation

Qilin Wang, Zhengkai Jiang, Chengming Xu, Jiangning Zhang, Yabiao Wang, Xinyi Zhang, Yun Cao, Weijian Cao, Chengjie Wang, Yanwei Fu

Human image animation involves generating a video from a static image by following a specified pose sequence. Current approaches typically adopt a multi-stage pipeline that separately learns appearance and motion, which often leads to appearance degradation and temporal inconsistencies. To address these issues, we propose VividPose, an innovative end-to-end pipeline based on Stable Video Diffusion (SVD) that ensures superior temporal stability. To enhance the retention of human identity, we propose an identity-aware appearance controller that integrates additional facial information without compromising other appearance details such as clothing texture and background. This approach ensures that the generated videos maintain high fidelity to the identity of human subject, preserving key facial features across various poses. To accommodate diverse human body shapes and hand movements, we introduce a geometry-aware pose controller that utilizes both dense rendering maps from SMPL-X and sparse skeleton maps. This enables accurate alignment of pose and shape in the generated videos, providing a robust framework capable of handling a wide range of body shapes and dynamic hand movements. Extensive qualitative and quantitative experiments on the UBCFashion and TikTok benchmarks demonstrate that our method achieves state-of-the-art performance. Furthermore, VividPose exhibits superior generalization capabilities on our proposed in-the-wild dataset. Codes and models will be available.

5/29/2024

cs.CV

Follow-Your-Pose v2: Multiple-Condition Guided Character Image Animation for Stable Pose Control

Jingyun Xue, Hongfa Wang, Qi Tian, Yue Ma, Andong Wang, Zhiyuan Zhao, Shaobo Min, Wenzhe Zhao, Kaihao Zhang, Heung-Yeung Shum, Wei Liu, Mengyang Liu, Wenhan Luo

Pose-controllable character video generation is in high demand with extensive applications for fields such as automatic advertising and content creation on social media platforms. While existing character image animation methods using pose sequences and reference images have shown promising performance, they tend to struggle with incoherent animation in complex scenarios, such as multiple character animation and body occlusion. Additionally, current methods request large-scale high-quality videos with stable backgrounds and temporal consistency as training datasets, otherwise, their performance will greatly deteriorate. These two issues hinder the practical utilization of character image animation tools. In this paper, we propose a practical and robust framework Follow-Your-Pose v2, which can be trained on noisy open-sourced videos readily available on the internet. Multi-condition guiders are designed to address the challenges of background stability, body occlusion in multi-character generation, and consistency of character appearance. Moreover, to fill the gap of fair evaluation of multi-character pose animation, we propose a new benchmark comprising approximately 4,000 frames. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods by a margin of over 35% across 2 datasets and on 7 metrics. Meanwhile, qualitative assessments reveal a significant improvement in the quality of generated video, particularly in scenarios involving complex backgrounds and body occlusion of multi-character, suggesting the superiority of our approach.

6/14/2024

cs.CV

Diversifying Human Pose in Synthetic Data for Aerial-view Human Detection

Yi-Ting Shen, Hyungtae Lee, Heesung Kwon, Shuvra S. Bhattacharyya

We present a framework for diversifying human poses in a synthetic dataset for aerial-view human detection. Our method firstly constructs a set of novel poses using a pose generator and then alters images in the existing synthetic dataset to assume the novel poses while maintaining the original style using an image translator. Since images corresponding to the novel poses are not available in training, the image translator is trained to be applicable only when the input and target poses are similar, thus training does not require the novel poses and their corresponding images. Next, we select a sequence of target novel poses from the novel pose set, using Dijkstra's algorithm to ensure that poses closer to each other are located adjacently in the sequence. Finally, we repeatedly apply the image translator to each target pose in sequence to produce a group of novel pose images representing a variety of different limited body movements from the source pose. Experiments demonstrate that, regardless of how the synthetic data is used for training or the data size, leveraging the pose-diversified synthetic dataset in training generally presents remarkably better accuracy than using the original synthetic dataset on three aerial-view human detection benchmarks (VisDrone, Okutama-Action, and ICG) in the few-shot regime.

5/28/2024

cs.CV

✨

SATO: Stable Text-to-Motion Framework

Wenshuo Chen, Hongru Xiao, Erhang Zhang, Lijie Hu, Lei Wang, Mengyuan Liu, Chen Chen

Is the Text to Motion model robust? Recent advancements in Text to Motion models primarily stem from more accurate predictions of specific actions. However, the text modality typically relies solely on pre-trained Contrastive Language-Image Pretraining (CLIP) models. Our research has uncovered a significant issue with the text-to-motion model: its predictions often exhibit inconsistent outputs, resulting in vastly different or even incorrect poses when presented with semantically similar or identical text inputs. In this paper, we undertake an analysis to elucidate the underlying causes of this instability, establishing a clear link between the unpredictability of model outputs and the erratic attention patterns of the text encoder module. Consequently, we introduce a formal framework aimed at addressing this issue, which we term the Stable Text-to-Motion Framework (SATO). SATO consists of three modules, each dedicated to stable attention, stable prediction, and maintaining a balance between accuracy and robustness trade-off. We present a methodology for constructing an SATO that satisfies the stability of attention and prediction. To verify the stability of the model, we introduced a new textual synonym perturbation dataset based on HumanML3D and KIT-ML. Results show that SATO is significantly more stable against synonyms and other slight perturbations while keeping its high accuracy performance.

5/6/2024

cs.CV