One-Shot Learning for Pose-Guided Person Image Synthesis in the Wild

Read original: arXiv:2409.09593 - Published 9/17/2024 by Dongqi Fan, Tao Chen, Mingjie Wang, Rui Ma, Qiang Tang, Zili Yi, Qian Wang, Liang Chang

One-Shot Learning for Pose-Guided Person Image Synthesis in the Wild

Overview

One-Shot Learning for Pose-Guided Person Image Synthesis in the Wild is a research paper that proposes a new approach for generating realistic person images based on a single reference image and a target pose.
The paper introduces a novel neural network architecture and training strategy to address the challenges of pose-guided person image synthesis in unconstrained, "in the wild" settings.
Key contributions include a one-shot learning framework, a dual-stream generator network, and techniques to handle occlusions and diverse poses.

Plain English Explanation

The paper presents a way to create realistic images of people in new poses, starting with just a single reference image. This is useful for applications like virtual try-on, where you want to see how clothing would look on a person in a different pose.

The approach works by taking the reference image and a target pose as inputs. It then uses a neural network to generate a new image of the person in the target pose. The network is designed to handle a wide variety of poses and body occlusions, which can be challenging for previous methods.

Importantly, the system only requires a single reference image to generate new poses, rather than needing a large dataset of labeled training images. This "one-shot learning" capability makes the approach more flexible and practical for real-world applications.

Technical Explanation

The paper introduces a one-shot learning framework for pose-guided person image synthesis. This allows generating novel images of a person in new poses based on a single reference image.

The key technical components include:

A dual-stream generator network that separately encodes the reference image and target pose, then combines them to generate the output image.
Techniques to handle occlusions and diverse poses during generation.
A one-shot learning training strategy that only requires a single reference image per person.

The authors evaluate their approach on a challenging in-the-wild dataset, demonstrating improved performance over previous pose-guided person synthesis methods.

Critical Analysis

The paper tackles an important problem in computer vision and image synthesis. The one-shot learning capability is a key strength, as it reduces the burden of collecting large training datasets.

However, the paper does not fully address the potential for generating inaccurate or unrealistic images. There may be cases where the synthesized person does not accurately reflect the reference individual, which could be a concern for applications like virtual try-on.

Additionally, the computational complexity of the dual-stream generator network is not discussed in detail. This could be an important practical consideration for real-world deployment.

Further research could explore ways to improve the fidelity and efficiency of the pose-guided synthesis, as well as investigate the ethical implications of generating synthetic person images.

Conclusion

This paper presents a novel one-shot learning approach for generating realistic person images in new poses. The technical contributions, including the dual-stream generator network and techniques for handling occlusions and diverse poses, represent important advancements in the field of pose-guided image synthesis.

While the paper demonstrates promising results, there are some potential limitations and areas for future work. Nonetheless, the work showcases the potential of one-shot learning for flexible and practical image synthesis applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

One-Shot Learning for Pose-Guided Person Image Synthesis in the Wild

Dongqi Fan, Tao Chen, Mingjie Wang, Rui Ma, Qiang Tang, Zili Yi, Qian Wang, Liang Chang

Current Pose-Guided Person Image Synthesis (PGPIS) methods depend heavily on large amounts of labeled triplet data to train the generator in a supervised manner. However, they often falter when applied to in-the-wild samples, primarily due to the distribution gap between the training datasets and real-world test samples. While some researchers aim to enhance model generalizability through sophisticated training procedures, advanced architectures, or by creating more diverse datasets, we adopt the test-time fine-tuning paradigm to customize a pre-trained Text2Image (T2I) model. However, naively applying test-time tuning results in inconsistencies in facial identities and appearance attributes. To address this, we introduce a Visual Consistency Module (VCM), which enhances appearance consistency by combining the face, text, and image embedding. Our approach, named OnePoseTrans, requires only a single source image to generate high-quality pose transfer results, offering greater stability than state-of-the-art data-driven methods. For each test case, OnePoseTrans customizes a model in around 48 seconds with an NVIDIA V100 GPU.

9/17/2024

Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation

Jiajun Wang, Morteza Ghahremani, Yitong Li, Bjorn Ommer, Christian Wachinger

Controllable text-to-image (T2I) diffusion models have shown impressive performance in generating high-quality visual content through the incorporation of various conditions. Current methods, however, exhibit limited performance when guided by skeleton human poses, especially in complex pose conditions such as side or rear perspectives of human figures. To address this issue, we present Stable-Pose, a novel adapter model that introduces a coarse-to-fine attention masking strategy into a vision Transformer (ViT) to gain accurate pose guidance for T2I models. Stable-Pose is designed to adeptly handle pose conditions within pre-trained Stable Diffusion, providing a refined and efficient way of aligning pose representation during image synthesis. We leverage the query-key self-attention mechanism of ViTs to explore the interconnections among different anatomical parts in human pose skeletons. Masked pose images are used to smoothly refine the attention maps based on target pose-related features in a hierarchical manner, transitioning from coarse to fine levels. Additionally, our loss function is formulated to allocate increased emphasis to the pose region, thereby augmenting the model's precision in capturing intricate pose details. We assessed the performance of Stable-Pose across five public datasets under a wide range of indoor and outdoor human pose scenarios. Stable-Pose achieved an AP score of 57.1 in the LAION-Human dataset, marking around 13% improvement over the established technique ControlNet. The project link and code is available at https://github.com/ai-med/StablePose.

6/5/2024

🖼️

VehicleGAN: Pair-flexible Pose Guided Image Synthesis for Vehicle Re-identification

Baolu Li, Ping Liu, Lan Fu, Jinlong Li, Jianwu Fang, Zhigang Xu, Hongkai Yu

Vehicle Re-identification (Re-ID) has been broadly studied in the last decade; however, the different camera view angle leading to confused discrimination in the feature subspace for the vehicles of various poses, is still challenging for the Vehicle Re-ID models in the real world. To promote the Vehicle Re-ID models, this paper proposes to synthesize a large number of vehicle images in the target pose, whose idea is to project the vehicles of diverse poses into the unified target pose so as to enhance feature discrimination. Considering that the paired data of the same vehicles in different traffic surveillance cameras might be not available in the real world, we propose the first Pair-flexible Pose Guided Image Synthesis method for Vehicle Re-ID, named as VehicleGAN in this paper, which works for both supervised and unsupervised settings without the knowledge of geometric 3D models. Because of the feature distribution difference between real and synthetic data, simply training a traditional metric learning based Re-ID model with data-level fusion (i.e., data augmentation) is not satisfactory, therefore we propose a new Joint Metric Learning (JML) via effective feature-level fusion from both real and synthetic data. Intensive experimental results on the public VeRi-776 and VehicleID datasets prove the accuracy and effectiveness of our proposed VehicleGAN and JML.

4/17/2024

🧪

PoseCrafter: One-Shot Personalized Video Synthesis Following Flexible Poses

Yong Zhong, Min Zhao, Zebin You, Xiaofeng Yu, Changwang Zhang, Chongxuan Li

In this paper, we introduce PoseCrafter, a one-shot method for personalized video generation following the control of flexible poses. Built upon Stable Diffusion and ControlNet, we carefully design an inference process to produce high-quality videos without the corresponding ground-truth frames. First, we select an appropriate reference frame from the training video and invert it to initialize all latent variables for generation. Then, we insert the corresponding training pose into the target pose sequences to enhance faithfulness through a trained temporal attention module. Furthermore, to alleviate the face and hand degradation resulting from discrepancies between poses of training videos and inference poses, we implement simple latent editing through an affine transformation matrix involving facial and hand landmarks. Extensive experiments on several datasets demonstrate that PoseCrafter achieves superior results to baselines pre-trained on a vast collection of videos under 8 commonly used metrics. Besides, PoseCrafter can follow poses from different individuals or artificial edits and simultaneously retain the human identity in an open-domain training video. Our project page is available at https://ml-gsai.github.io/PoseCrafter-demo/.

7/19/2024