Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis






Published 4/10/2024 by Yanzuo Lu, Manlin Zhang, Andy J Ma, Xiaohua Xie, Jian-Huang Lai
Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis


Diffusion model is a promising approach to image generation and has been employed for Pose-Guided Person Image Synthesis (PGPIS) with competitive performance. While existing methods simply align the person appearance to the target pose, they are prone to overfitting due to the lack of a high-level semantic understanding on the source person image. In this paper, we propose a novel Coarse-to-Fine Latent Diffusion (CFLD) method for PGPIS. In the absence of image-caption pairs and textual prompts, we develop a novel training paradigm purely based on images to control the generation process of a pre-trained text-to-image diffusion model. A perception-refined decoder is designed to progressively refine a set of learnable queries and extract semantic understanding of person images as a coarse-grained prompt. This allows for the decoupling of fine-grained appearance and pose information controls at different stages, and thus circumventing the potential overfitting problem. To generate more realistic texture details, a hybrid-granularity attention module is proposed to encode multi-scale fine-grained appearance features as bias terms to augment the coarse-grained prompt. Both quantitative and qualitative experimental results on the DeepFashion benchmark demonstrate the superiority of our method over the state of the arts for PGPIS. Code is available at

Create account to get full access


If you already have an account, we'll log you in


  • This paper proposes a novel coarse-to-fine latent diffusion model for pose-guided person image synthesis.
  • The model generates high-quality images of people by progressively refining the output from a coarse to a fine scale.
  • Key innovations include a multi-scale architecture, a novel global-to-local sampling strategy, and a pose-guided synthesis approach.

Plain English Explanation

The researchers developed a new machine learning model that can generate realistic images of people based on their poses or body positions. Their approach works by first creating a rough, low-resolution version of the image and then gradually refining it to produce a high-quality, detailed result.

This "coarse-to-fine" process is made possible by the model's multi-scale architecture, which allows it to capture both the overall structure and fine details of the person. Additionally, the researchers introduced a novel way of sampling the model's internal representations to guide the synthesis from broad to specific aspects of the image.

Importantly, the model also takes into account the person's pose, using this information to help generate images that match the desired body position. This pose-guided synthesis is a key innovation that sets this work apart from previous person image generation techniques.

Overall, this research advances the state-of-the-art in computer vision and graphics by enabling the creation of high-fidelity, pose-specific images of people. This could have applications in areas like virtual fashion, computer animation, and video games.

Technical Explanation

The paper introduces a [object Object] model for [object Object]. The model uses a multi-scale diffusion architecture to progressively refine the output from a coarse to a fine scale.

A key innovation is the [object Object] strategy, which guides the synthesis process from broad to specific aspects of the image. This is combined with a [object Object] approach that leverages the person's pose information to generate images that match the desired body position.

Experiments on several person image synthesis benchmarks demonstrate the model's ability to generate high-quality, pose-specific images. The results highlight the benefits of the coarse-to-fine approach and the effectiveness of the proposed global-to-local sampling and pose-guided synthesis techniques.

Critical Analysis

The paper presents a comprehensive and well-designed study, with thorough experiments and detailed analysis. However, there are a few potential limitations and areas for further research:

  1. The model's reliance on pose information could limit its applicability to scenarios where such data is not available. Exploring ,[object Object], person image synthesis could be a valuable direction for future work.

  2. The paper does not delve into the model's computational efficiency or inference speed, which could be important considerations for real-world applications. Investigating ways to improve the [object Object] would be a relevant research direction.

  3. While the qualitative and quantitative results are impressive, further analysis of the model's limitations and failure cases could provide valuable insights to inform future improvements.

Overall, the proposed coarse-to-fine latent diffusion model represents a significant advance in pose-guided person image synthesis, with promising applications in various domains. The innovations introduced in this work could inspire future research to push the boundaries of generative modeling even further.


This paper presents a novel coarse-to-fine latent diffusion model for pose-guided person image synthesis. By combining a multi-scale architecture, global-to-local sampling, and pose-guided synthesis, the researchers have developed a system capable of generating high-quality, pose-specific images of people.

The key innovations introduced in this work, such as the coarse-to-fine refinement process and the incorporation of pose information, demonstrate the potential of this approach to advance the state-of-the-art in computer vision and graphics. The findings could have significant implications for applications ranging from virtual fashion and computer animation to video game development and beyond.

As with any research, there are opportunities for further exploration and improvement. Investigating pose-free synthesis, improving computational efficiency, and analyzing the model's limitations could all be valuable directions for future work. Overall, this paper represents an important contribution to the field of generative modeling and person image synthesis.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers


High-fidelity Person-centric Subject-to-Image Synthesis

Yibin Wang, Weizhong Zhang, Jianwei Zheng, Cheng Jin





Current subject-driven image generation methods encounter significant challenges in person-centric image generation. The reason is that they learn the semantic scene and person generation by fine-tuning a common pre-trained diffusion, which involves an irreconcilable training imbalance. Precisely, to generate realistic persons, they need to sufficiently tune the pre-trained model, which inevitably causes the model to forget the rich semantic scene prior and makes scene generation over-fit to the training data. Moreover, even with sufficient fine-tuning, these methods can still not generate high-fidelity persons since joint learning of the scene and person generation also lead to quality compromise. In this paper, we propose Face-diffuser, an effective collaborative generation pipeline to eliminate the above training imbalance and quality compromise. Specifically, we first develop two specialized pre-trained diffusion models, i.e., Text-driven Diffusion Model (TDM) and Subject-augmented Diffusion Model (SDM), for scene and person generation, respectively. The sampling process is divided into three sequential stages, i.e., semantic scene construction, subject-scene fusion, and subject enhancement. The first and last stages are performed by TDM and SDM respectively. The subject-scene fusion stage, that is the collaboration achieved through a novel and highly effective mechanism, Saliency-adaptive Noise Fusion (SNF). Specifically, it is based on our key observation that there exists a robust link between classifier-free guidance responses and the saliency of generated images. In each time step, SNF leverages the unique strengths of each model and allows for the spatial blending of predicted noises from both models automatically in a saliency-aware manner. Extensive experiments confirm the impressive effectiveness and robustness of the Face-diffuser.

Read more



FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models

Jinglin Xu, Yijie Guo, Yuxin Peng





The 3D Human Pose Estimation (3D HPE) task uses 2D images or videos to predict human joint coordinates in 3D space. Despite recent advancements in deep learning-based methods, they mostly ignore the capability of coupling accessible texts and naturally feasible knowledge of humans, missing out on valuable implicit supervision to guide the 3D HPE task. Moreover, previous efforts often study this task from the perspective of the whole human body, neglecting fine-grained guidance hidden in different body parts. To this end, we present a new Fine-Grained Prompt-Driven Denoiser based on a diffusion model for 3D HPE, named textbf{FinePOSE}. It consists of three core blocks enhancing the reverse process of the diffusion model: (1) Fine-grained Part-aware Prompt learning (FPP) block constructs fine-grained part-aware prompts via coupling accessible texts and naturally feasible knowledge of body parts with learnable prompts to model implicit guidance. (2) Fine-grained Prompt-pose Communication (FPC) block establishes fine-grained communications between learned part-aware prompts and poses to improve the denoising quality. (3) Prompt-driven Timestamp Stylization (PTS) block integrates learned prompt embedding and temporal information related to the noise level to enable adaptive adjustment at each denoising step. Extensive experiments on public single-human pose estimation datasets show that FinePOSE outperforms state-of-the-art methods. We further extend FinePOSE to multi-human pose estimation. Achieving 34.3mm average MPJPE on the EgoHumans dataset demonstrates the potential of FinePOSE to deal with complex multi-human scenarios. Code is available at

Read more



DiLightNet: Fine-grained Lighting Control for Diffusion-based Image Generation

Chong Zeng, Yue Dong, Pieter Peers, Youkang Kong, Hongzhi Wu, Xin Tong





This paper presents a novel method for exerting fine-grained lighting control during text-driven diffusion-based image generation. While existing diffusion models already have the ability to generate images under any lighting condition, without additional guidance these models tend to correlate image content and lighting. Moreover, text prompts lack the necessary expressional power to describe detailed lighting setups. To provide the content creator with fine-grained control over the lighting during image generation, we augment the text-prompt with detailed lighting information in the form of radiance hints, i.e., visualizations of the scene geometry with a homogeneous canonical material under the target lighting. However, the scene geometry needed to produce the radiance hints is unknown. Our key observation is that we only need to guide the diffusion process, hence exact radiance hints are not necessary; we only need to point the diffusion model in the right direction. Based on this observation, we introduce a three stage method for controlling the lighting during image generation. In the first stage, we leverage a standard pretrained diffusion model to generate a provisional image under uncontrolled lighting. Next, in the second stage, we resynthesize and refine the foreground object in the generated image by passing the target lighting to a refined diffusion model, named DiLightNet, using radiance hints computed on a coarse shape of the foreground object inferred from the provisional image. To retain the texture details, we multiply the radiance hints with a neural encoding of the provisional synthesized image before passing it to DiLightNet. Finally, in the third stage, we resynthesize the background to be consistent with the lighting on the foreground object. We demonstrate and validate our lighting controlled diffusion model on a variety of text prompts and lighting conditions.

Read more


Synthesizing Efficient Data with Diffusion Models for Person Re-Identification Pre-Training

Synthesizing Efficient Data with Diffusion Models for Person Re-Identification Pre-Training

Ke Niu, Haiyang Yu, Xuelin Qian, Teng Fu, Bin Li, Xiangyang Xue





Existing person re-identification (Re-ID) methods principally deploy the ImageNet-1K dataset for model initialization, which inevitably results in sub-optimal situations due to the large domain gap. One of the key challenges is that building large-scale person Re-ID datasets is time-consuming. Some previous efforts address this problem by collecting person images from the internet e.g., LUPerson, but it struggles to learn from unlabeled, uncontrollable, and noisy data. In this paper, we present a novel paradigm Diffusion-ReID to efficiently augment and generate diverse images based on known identities without requiring any cost of data collection and annotation. Technically, this paradigm unfolds in two stages: generation and filtering. During the generation stage, we propose Language Prompts Enhancement (LPE) to ensure the ID consistency between the input image sequence and the generated images. In the diffusion process, we propose a Diversity Injection (DI) module to increase attribute diversity. In order to make the generated data have higher quality, we apply a Re-ID confidence threshold filter to further remove the low-quality images. Benefiting from our proposed paradigm, we first create a new large-scale person Re-ID dataset Diff-Person, which consists of over 777K images from 5,183 identities. Next, we build a stronger person Re-ID backbone pre-trained on our Diff-Person. Extensive experiments are conducted on four person Re-ID benchmarks in six widely used settings. Compared with other pre-training and self-supervised competitors, our approach shows significant superiority.

Read more
