PoseCrafter: One-Shot Personalized Video Synthesis Following Flexible Poses

2405.14582

Published 5/27/2024 by Yong Zhong, Min Zhao, Zebin You, Xiaofeng Yu, Changwang Zhang, Chongxuan Li

🧪

Abstract

In this paper, we introduce PoseCrafter, a one-shot method for personalized video generation following the control of flexible poses. Built upon Stable Diffusion and ControlNet, we carefully design an inference process to produce high-quality videos without the corresponding ground-truth frames. First, we select an appropriate reference frame from the training video and invert it to initialize all latent variables for generation. Then, we insert the corresponding training pose into the target pose sequences to enhance faithfulness through a trained temporal attention module. Furthermore, to alleviate the face and hand degradation resulting from discrepancies between poses of training videos and inference poses, we implement simple latent editing through an affine transformation matrix involving facial and hand landmarks. Extensive experiments on several datasets demonstrate that PoseCrafter achieves superior results to baselines pre-trained on a vast collection of videos under 8 commonly used metrics. Besides, PoseCrafter can follow poses from different individuals or artificial edits and simultaneously retain the human identity in an open-domain training video. Our project page is available at https://ml-gsai.github.io/PoseCrafter-demo/.

Create account to get full access

Overview

PoseCrafter is a one-shot method for generating personalized videos that follow flexible poses.
It builds upon Stable Diffusion and ControlNet to produce high-quality videos without the need for corresponding ground-truth frames.
The method involves carefully selecting a reference frame, inserting training poses into target pose sequences, and applying simple latent editing to address face and hand degradation.
Experiments show that PoseCrafter outperforms baselines on several common metrics and can follow poses from different individuals or artificial edits while retaining the human identity.

Plain English Explanation

PoseCrafter is a new technique that allows you to create personalized videos where the characters move and pose in specific ways. It's built on top of existing AI models like Stable Diffusion and ControlNet, but the researchers have added some clever tricks to make the videos look really high-quality.

The key idea is that instead of starting from scratch, PoseCrafter uses a reference frame from the original training video to kick things off. Then, it takes the poses from the training video and inserts them into the new video you want to create. This helps the model stay faithful to the original human movements.

One tricky part is that the poses in the training video might not match up perfectly with the poses you want in the new video. To fix this, PoseCrafter uses some smart "latent editing" techniques that adjust the faces and hands to look right. This helps the characters maintain their identity even when the poses change.

The researchers tested PoseCrafter on a bunch of different video datasets and found that it produces better results than other methods, especially when it comes to common metrics like visual quality and faithfulness to the original poses. Plus, it can handle poses from different people or even totally artificial edits, which is pretty cool.

Technical Explanation

PoseCrafter is a one-shot method for generating personalized videos that follow flexible poses. It builds upon the capabilities of Stable Diffusion and ControlNet to produce high-quality videos without the need for corresponding ground-truth frames.

The key steps in the PoseCrafter inference process are:

Reference Frame Selection: The researchers select an appropriate reference frame from the training video and invert it to initialize all latent variables for generation.
Pose Insertion: The corresponding training pose is inserted into the target pose sequences to enhance faithfulness through a trained temporal attention module.
Latent Editing: To mitigate face and hand degradation caused by discrepancies between training and inference poses, the researchers implement simple latent editing through an affine transformation matrix involving facial and hand landmarks.

Extensive experiments on several datasets demonstrate that PoseCrafter outperforms baselines pre-trained on a vast collection of videos across 8 commonly used metrics. Additionally, PoseCrafter can follow poses from different individuals or artificial edits while simultaneously retaining the human identity in the open-domain training video.

Critical Analysis

The paper presents a thorough evaluation of PoseCrafter's performance, but it does not explicitly discuss the limitations or potential downsides of the approach. For example, the method relies on having a high-quality reference frame from the training video, which may not always be available or easy to identify.

Additionally, the latent editing technique, while effective, may not be able to fully address all the nuances of pose discrepancies, especially for more complex or unconventional movements. The paper could have delved deeper into the failure cases or edge cases where PoseCrafter's performance may degrade.

Furthermore, the authors do not provide much insight into the computational costs or runtime efficiency of the PoseCrafter method, which could be an important consideration for real-world applications. Exploring these aspects in more detail could help readers better understand the practical implications and trade-offs of the proposed approach.

Conclusion

PoseCrafter is a promising one-shot method for generating personalized videos that follow flexible poses. By leveraging the capabilities of Stable Diffusion and ControlNet, the researchers have developed a technique that can produce high-quality videos without the need for corresponding ground-truth frames.

The key innovations of PoseCrafter, including reference frame selection, pose insertion, and latent editing, have enabled the model to outperform baselines on several common metrics. Additionally, the ability to follow poses from different individuals or artificial edits while retaining the human identity is a notable feature of the method.

While the paper provides a thorough evaluation, further exploration of the limitations and practical considerations could help readers better understand the strengths and weaknesses of the PoseCrafter approach. Overall, this research represents an exciting advancement in the field of video generation and pose-controlled animation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

PoseAnimate: Zero-shot high fidelity pose controllable character animation

Bingwen Zhu, Fanyi Wang, Tianyi Lu, Peng Liu, Jingwen Su, Jinxiu Liu, Yanhao Zhang, Zuxuan Wu, Guo-Jun Qi, Yu-Gang Jiang

Image-to-video (I2V) generation aims to create a video sequence from a single image, which requires high temporal coherence and visual fidelity. However, existing approaches suffer from inconsistency of character appearances and poor preservation of fine details. Moreover, they require a large amount of video data for training, which can be computationally demanding. To address these limitations, we propose PoseAnimate, a novel zero-shot I2V framework for character animation. PoseAnimate contains three key components: 1) a Pose-Aware Control Module (PACM) that incorporates diverse pose signals into text embeddings, to preserve character-independent content and maintain precise alignment of actions. 2) a Dual Consistency Attention Module (DCAM) that enhances temporal consistency and retains character identity and intricate background details. 3) a Mask-Guided Decoupling Module (MGDM) that refines distinct feature perception abilities, improving animation fidelity by decoupling the character and background. We also propose a Pose Alignment Transition Algorithm (PATA) to ensure smooth action transition. Extensive experiment results demonstrate that our approach outperforms the state-of-the-art training-based methods in terms of character consistency and detail fidelity. Moreover, it maintains a high level of temporal coherence throughout the generated animations.

6/6/2024

cs.CV cs.AI

👁️

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, Liefeng Bo

Character Animation aims to generating character videos from still images through driving signals. Currently, diffusion models have become the mainstream in visual generation research, owing to their robust generative capabilities. However, challenges persist in the realm of image-to-video, especially in character animation, where temporally maintaining consistency with detailed information from character remains a formidable problem. In this paper, we leverage the power of diffusion models and propose a novel framework tailored for character animation. To preserve consistency of intricate appearance features from reference image, we design ReferenceNet to merge detail features via spatial attention. To ensure controllability and continuity, we introduce an efficient pose guider to direct character's movements and employ an effective temporal modeling approach to ensure smooth inter-frame transitions between video frames. By expanding the training data, our approach can animate arbitrary characters, yielding superior results in character animation compared to other image-to-video methods. Furthermore, we evaluate our method on benchmarks for fashion video and human dance synthesis, achieving state-of-the-art results.

6/14/2024

cs.CV

Do As I Do: Pose Guided Human Motion Copy

Sifan Wu, Zhenguang Liu, Beibei Zhang, Roger Zimmermann, Zhongjie Ba, Xiaosong Zhang, Kui Ren

Human motion copy is an intriguing yet challenging task in artificial intelligence and computer vision, which strives to generate a fake video of a target person performing the motion of a source person. The problem is inherently challenging due to the subtle human-body texture details to be generated and the temporal consistency to be considered. Existing approaches typically adopt a conventional GAN with an L1 or L2 loss to produce the target fake video, which intrinsically necessitates a large number of training samples that are challenging to acquire. Meanwhile, current methods still have difficulties in attaining realistic image details and temporal consistency, which unfortunately can be easily perceived by human observers. Motivated by this, we try to tackle the issues from three aspects: (1) We constrain pose-to-appearance generation with a perceptual loss and a theoretically motivated Gromov-Wasserstein loss to bridge the gap between pose and appearance. (2) We present an episodic memory module in the pose-to-appearance generation to propel continuous learning that helps the model learn from its past poor generations. We also utilize geometrical cues of the face to optimize facial details and refine each key body part with a dedicated local GAN. (3) We advocate generating the foreground in a sequence-to-sequence manner rather than a single-frame manner, explicitly enforcing temporal inconsistency. Empirical results on five datasets, iPER, ComplexMotion, SoloDance, Fish, and Mouse datasets, demonstrate that our method is capable of generating realistic target videos while precisely copying motion from a source video. Our method significantly outperforms state-of-the-art approaches and gains 7.2% and 12.4% improvements in PSNR and FID respectively.

6/26/2024

cs.CV

Diversifying Human Pose in Synthetic Data for Aerial-view Human Detection

Yi-Ting Shen, Hyungtae Lee, Heesung Kwon, Shuvra S. Bhattacharyya

We present a framework for diversifying human poses in a synthetic dataset for aerial-view human detection. Our method firstly constructs a set of novel poses using a pose generator and then alters images in the existing synthetic dataset to assume the novel poses while maintaining the original style using an image translator. Since images corresponding to the novel poses are not available in training, the image translator is trained to be applicable only when the input and target poses are similar, thus training does not require the novel poses and their corresponding images. Next, we select a sequence of target novel poses from the novel pose set, using Dijkstra's algorithm to ensure that poses closer to each other are located adjacently in the sequence. Finally, we repeatedly apply the image translator to each target pose in sequence to produce a group of novel pose images representing a variety of different limited body movements from the source pose. Experiments demonstrate that, regardless of how the synthetic data is used for training or the data size, leveraging the pose-diversified synthetic dataset in training generally presents remarkably better accuracy than using the original synthetic dataset on three aerial-view human detection benchmarks (VisDrone, Okutama-Action, and ICG) in the few-shot regime.

5/28/2024

cs.CV