MagicPose: Realistic Human Poses and Facial Expressions Retargeting with Identity-aware Diffusion

2311.12052

Published 5/7/2024 by Di Chang, Yichun Shi, Quankai Gao, Jessica Fu, Hongyi Xu, Guoxian Song, Qing Yan, Yizhe Zhu, Xiao Yang, Mohammad Soleymani

cs.CV

❗

Abstract

In this work, we propose MagicPose, a diffusion-based model for 2D human pose and facial expression retargeting. Specifically, given a reference image, we aim to generate a person's new images by controlling the poses and facial expressions while keeping the identity unchanged. To this end, we propose a two-stage training strategy to disentangle human motions and appearance (e.g., facial expressions, skin tone and dressing), consisting of (1) the pre-training of an appearance-control block and (2) learning appearance-disentangled pose control. Our novel design enables robust appearance control over generated human images, including body, facial attributes, and even background. By leveraging the prior knowledge of image diffusion models, MagicPose generalizes well to unseen human identities and complex poses without the need for additional fine-tuning. Moreover, the proposed model is easy to use and can be considered as a plug-in module/extension to Stable Diffusion. The code is available at: https://github.com/Boese0601/MagicDance

Create account to get full access

Overview

Proposes MagicPose, a diffusion-based model for 2D human pose and facial expression retargeting
Aims to generate new images of a person by controlling their poses and facial expressions while keeping their identity unchanged
Introduces a two-stage training strategy to disentangle human motions and appearance
Enables robust appearance control over generated human images, including body, facial attributes, and background
Leverages the prior knowledge of image diffusion models to generalize well to unseen human identities and complex poses
Can be considered as a plug-in module/extension to Stable Diffusion

Plain English Explanation

MagicPose is a new AI model that can take a reference image of a person and generate new images of that person with different poses and facial expressions, while keeping their overall identity the same. The key idea is to separate the person's appearance (like their face, skin tone, and clothes) from their motion (like their poses and facial expressions).

The model is trained in two stages: First, it learns to control the person's appearance, and then it learns to control their pose and facial expressions independently. This allows the model to generate realistic images where the person's identity is preserved, but their poses and expressions can be changed.

One of the cool things about MagicPose is that it can work well even on people the model has never seen before, and can handle complex poses without needing additional training. The model is also designed to be easily integrated with other AI systems, like Stable Diffusion, making it a versatile tool for creating personalized images.

Technical Explanation

The key technical innovations of MagicPose include:

Two-Stage Training: The model is trained in two stages - first, an "appearance-control block" is pre-trained to capture the person's visual appearance (like facial features, skin tone, and clothing). Then, the model learns to control the person's pose and facial expressions independently, while keeping their appearance constant.
Diffusion-Based Approach: MagicPose leverages the power of diffusion models, which have shown great success in generating high-quality images. By building on this foundational technology, the model can generate realistic, 4D facial expressions and 3D-consistent poses without the need for additional fine-tuning.
Robust Appearance Control: The model's novel design enables fine-grained control over the generated human images, including the body, facial attributes, and even the background. This allows for a high degree of customization and personalization.

Critical Analysis

The paper does a solid job of explaining the technical details of MagicPose and demonstrating its capabilities through various experiments. However, there are a few potential limitations and areas for further research:

Evaluation Metrics: The paper primarily relies on qualitative assessments of the generated images, which can be subjective. It would be helpful to see more quantitative evaluation metrics, such as user studies or comparisons to other state-of-the-art models.
Diversity and Generalization: While the model is shown to generalize well to unseen identities and poses, it's unclear how diverse the generated outputs can be. Exploring the model's ability to capture a wide range of human appearances and movements would be an interesting avenue for future research.
Real-World Applications: The paper focuses on the technical aspects of the model, but doesn't delve deeply into potential real-world applications. Discussing how MagicPose could be used in areas like animation, virtual try-on, or personalized content creation would help readers better understand the practical implications of this work.

Overall, MagicPose represents a promising step forward in the field of human pose and facial expression control, and the researchers have done a commendable job in developing this innovative model. Further exploration of the model's capabilities and potential use cases could yield even more impactful insights.

Conclusion

In summary, MagicPose is a novel diffusion-based model that enables the generation of personalized human images with controlled poses and facial expressions. By disentangling appearance and motion, the model can produce realistic, customizable outputs that generalize well to unseen identities and complex poses. This work could have significant implications for a variety of applications, from animation and virtual try-on to personalized content creation. As the field of generative AI continues to advance, innovative approaches like MagicPose will likely play an increasingly important role in shaping the future of human-centric visual media.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation

Qilin Wang, Zhengkai Jiang, Chengming Xu, Jiangning Zhang, Yabiao Wang, Xinyi Zhang, Yun Cao, Weijian Cao, Chengjie Wang, Yanwei Fu

Human image animation involves generating a video from a static image by following a specified pose sequence. Current approaches typically adopt a multi-stage pipeline that separately learns appearance and motion, which often leads to appearance degradation and temporal inconsistencies. To address these issues, we propose VividPose, an innovative end-to-end pipeline based on Stable Video Diffusion (SVD) that ensures superior temporal stability. To enhance the retention of human identity, we propose an identity-aware appearance controller that integrates additional facial information without compromising other appearance details such as clothing texture and background. This approach ensures that the generated videos maintain high fidelity to the identity of human subject, preserving key facial features across various poses. To accommodate diverse human body shapes and hand movements, we introduce a geometry-aware pose controller that utilizes both dense rendering maps from SMPL-X and sparse skeleton maps. This enables accurate alignment of pose and shape in the generated videos, providing a robust framework capable of handling a wide range of body shapes and dynamic hand movements. Extensive qualitative and quantitative experiments on the UBCFashion and TikTok benchmarks demonstrate that our method achieves state-of-the-art performance. Furthermore, VividPose exhibits superior generalization capabilities on our proposed in-the-wild dataset. Codes and models will be available.

5/29/2024

cs.CV

Pose-Diversified Augmentation with Diffusion Model for Person Re-Identification

In`es Hyeonsu Kim, JoungBin Lee, Soowon Son, Woojeong Jin, Kyusun Cho, Junyoung Seo, Min-Seop Kwak, Seokju Cho, JeongYeol Baek, Byeongwon Lee, Seungryong Kim

Person re-identification (Re-ID) often faces challenges due to variations in human poses and camera viewpoints, which significantly affect the appearance of individuals across images. Existing datasets frequently lack diversity and scalability in these aspects, hindering the generalization of Re-ID models to new camera systems. Previous methods have attempted to address these issues through data augmentation; however, they rely on human poses already present in the training dataset, failing to effectively reduce the human pose bias in the dataset. We propose Diff-ID, a novel data augmentation approach that incorporates sparse and underrepresented human pose and camera viewpoint examples into the training data, addressing the limited diversity in the original training data distribution. Our objective is to augment a training dataset that enables existing Re-ID models to learn features unbiased by human pose and camera viewpoint variations. To achieve this, we leverage the knowledge of pre-trained large-scale diffusion models. Using the SMPL model, we simultaneously capture both the desired human poses and camera viewpoints, enabling realistic human rendering. The depth information provided by the SMPL model indirectly conveys the camera viewpoints. By conditioning the diffusion model on both the human pose and camera viewpoint concurrently through the SMPL model, we generate realistic images with diverse human poses and camera viewpoints. Qualitative results demonstrate the effectiveness of our method in addressing human pose bias and enhancing the generalizability of Re-ID models compared to other data augmentation-based Re-ID approaches. The performance gains achieved by training Re-ID models on our offline augmented dataset highlight the potential of our proposed framework in improving the scalability and generalizability of person Re-ID models.

6/26/2024

cs.CV

🎲

MagicPose4D: Crafting Articulated Models with Appearance and Motion Control

Hao Zhang, Di Chang, Fang Li, Mohammad Soleymani, Narendra Ahuja

With the success of 2D and 3D visual generative models, there is growing interest in generating 4D content. Existing methods primarily rely on text prompts to produce 4D content, but they often fall short of accurately defining complex or rare motions. To address this limitation, we propose MagicPose4D, a novel framework for refined control over both appearance and motion in 4D generation. Unlike traditional methods, MagicPose4D accepts monocular videos as motion prompts, enabling precise and customizable motion generation. MagicPose4D comprises two key modules: i) Dual-Phase 4D Reconstruction Module} which operates in two phases. The first phase focuses on capturing the model's shape using accurate 2D supervision and less accurate but geometrically informative 3D pseudo-supervision without imposing skeleton constraints. The second phase refines the model using more accurate pseudo-3D supervision, obtained in the first phase and introduces kinematic chain-based skeleton constraints to ensure physical plausibility. Additionally, we propose a Global-local Chamfer loss that aligns the overall distribution of predicted mesh vertices with the supervision while maintaining part-level alignment without extra annotations. ii) Cross-category Motion Transfer Module} leverages the predictions from the 4D reconstruction module and uses a kinematic-chain-based skeleton to achieve cross-category motion transfer. It ensures smooth transitions between frames through dynamic rigidity, facilitating robust generalization without additional training. Through extensive experiments, we demonstrate that MagicPose4D significantly improves the accuracy and consistency of 4D content generation, outperforming existing methods in various benchmarks.

5/24/2024

cs.CV

🤿

Morphable Diffusion: 3D-Consistent Diffusion for Single-image Avatar Creation

Xiyi Chen, Marko Mihajlovic, Shaofei Wang, Sergey Prokudin, Siyu Tang

Recent advances in generative diffusion models have enabled the previously unfeasible capability of generating 3D assets from a single input image or a text prompt. In this work, we aim to enhance the quality and functionality of these models for the task of creating controllable, photorealistic human avatars. We achieve this by integrating a 3D morphable model into the state-of-the-art multi-view-consistent diffusion approach. We demonstrate that accurate conditioning of a generative pipeline on the articulated 3D model enhances the baseline model performance on the task of novel view synthesis from a single image. More importantly, this integration facilitates a seamless and accurate incorporation of facial expression and body pose control into the generation process. To the best of our knowledge, our proposed framework is the first diffusion model to enable the creation of fully 3D-consistent, animatable, and photorealistic human avatars from a single image of an unseen subject; extensive quantitative and qualitative evaluations demonstrate the advantages of our approach over existing state-of-the-art avatar creation models on both novel view and novel expression synthesis tasks. The code for our project is publicly available.

4/3/2024

cs.CV cs.AI