From Parts to Whole: A Unified Reference Framework for Controllable Human Image Generation






Published 4/24/2024 by Zehuan Huang, Hongxing Fan, Lipeng Wang, Lu Sheng



Recent advancements in controllable human image generation have led to zero-shot generation using structural signals (e.g., pose, depth) or facial appearance. Yet, generating human images conditioned on multiple parts of human appearance remains challenging. Addressing this, we introduce Parts2Whole, a novel framework designed for generating customized portraits from multiple reference images, including pose images and various aspects of human appearance. To achieve this, we first develop a semantic-aware appearance encoder to retain details of different human parts, which processes each image based on its textual label to a series of multi-scale feature maps rather than one image token, preserving the image dimension. Second, our framework supports multi-image conditioned generation through a shared self-attention mechanism that operates across reference and target features during the diffusion process. We enhance the vanilla attention mechanism by incorporating mask information from the reference human images, allowing for the precise selection of any part. Extensive experiments demonstrate the superiority of our approach over existing alternatives, offering advanced capabilities for multi-part controllable human image customization. See our project page at

Get summaries of the top AI research delivered straight to your inbox:


  • This paper presents a new approach for simultaneous granular identity expression control and personalized 3D human reconstruction from multi-view images.
  • The proposed method leverages synthetic data using 3D human reconstruction from wild data and cross-view, cross-pose completion of 3D human models to enable high-quality 3D human generation.
  • The framework also supports joint optimization of 3D human reconstruction and identity-aware animation for personalized, expressive 3D avatars.

Plain English Explanation

The paper describes a new way to create realistic 3D models of people that can be customized to match an individual's identity and expressions. The key idea is to leverage existing techniques for reconstructing 3D human shapes from 2D images, and then enhance these models to capture fine-grained details about the person's identity, such as their facial features and body shape.

The researchers use a combination of real-world data and synthetic, computer-generated data to train their model. This allows the system to learn the complex relationships between a person's visual appearance and their underlying identity. The end result is a framework that can generate personalized 3D human avatars that not only look like the individual, but can also dynamically change their expressions and pose in a natural way.

This technology could have applications in areas like virtual reality, video games, and animation, where realistic and customizable human characters are in high demand. By automating the process of creating these 3D models, it becomes much easier and more accessible for content creators to incorporate personalized human characters into their projects.

Technical Explanation

The paper presents a framework called Joint2Human that tackles the problem of simultaneous granular identity expression control and personalized 3D human reconstruction from multi-view images.

The core components of the framework are:

  1. A 3D human reconstruction module that leverages synthetic data to learn an accurate mapping from 2D images to 3D human shapes.

  2. An identity-aware animation module that performs cross-view, cross-pose completion of the 3D human models to capture fine-grained identity details, such as facial features and body shape.

  3. A joint optimization process that simultaneously optimizes the 3D reconstruction and identity-aware animation for personalized, expressive 3D human avatars.

The researchers evaluate their approach on several benchmark datasets and show that it outperforms state-of-the-art methods in terms of reconstruction accuracy, identity preservation, and animation quality.

Critical Analysis

The paper presents a promising approach for generating high-quality, personalized 3D human models. However, there are a few limitations and areas for further research:

  1. The method relies on synthetic data for training, which may not fully capture the complexity and diversity of real-world human appearances. Incorporating more diverse real-world data could further improve the model's generalization.

  2. The identity-aware animation module focuses on static identity features, such as facial structure and body shape. Extending the framework to also capture dynamic identity cues, like facial expressions and mannerisms, could lead to even more realistic and expressive avatars.

  3. The paper does not explore the potential ethical implications of this technology, such as the use of personalized avatars for deepfakes or other malicious applications. Further research is needed to understand and mitigate these risks.

Overall, the Joint2Human framework represents an important step forward in the field of 3D human modeling and animation. With continued refinement and responsible development, this technology could have a significant impact on various industries and applications.


This paper presents a novel approach for simultaneous granular identity expression control and personalized 3D human reconstruction from multi-view images. The proposed Joint2Human framework leverages synthetic data and cross-view, cross-pose completion to generate high-quality, identity-preserving 3D human avatars.

The key innovation is the joint optimization of 3D reconstruction and identity-aware animation, which allows the system to capture fine-grained details about a person's appearance and expressions. This technology could have a wide range of applications, from virtual reality and video games to animation and digital entertainment.

While the paper demonstrates promising results, there are still some limitations and areas for further research, such as improving the model's generalization to more diverse real-world data and exploring the potential ethical implications of this technology. Overall, the Joint2Human framework represents an important advancement in the field of 3D human modeling and animation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers


Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation

Renshuai Liu, Bowen Ma, Wei Zhang, Zhipeng Hu, Changjie Fan, Tangjie Lv, Yu Ding, Xuan Cheng





In human-centric content generation, the pre-trained text-to-image models struggle to produce user-wanted portrait images, which retain the identity of individuals while exhibiting diverse expressions. This paper introduces our efforts towards personalized face generation. To this end, we propose a novel multi-modal face generation framework, capable of simultaneous identity-expression control and more fine-grained expression synthesis. Our expression control is so sophisticated that it can be specialized by the fine-grained emotional vocabulary. We devise a novel diffusion model that can undertake the task of simultaneously face swapping and reenactment. Due to the entanglement of identity and expression, it's nontrivial to separately and precisely control them in one framework, thus has not been explored yet. To overcome this, we propose several innovative designs in the conditional diffusion model, including balancing identity and expression encoder, improved midpoint sampling, and explicitly background conditioning. Extensive experiments have demonstrated the controllability and scalability of the proposed framework, in comparison with state-of-the-art text-to-image, face swapping, and face reenactment methods.

Read more


3D Human Reconstruction in the Wild with Synthetic Data Using Generative Models

3D Human Reconstruction in the Wild with Synthetic Data Using Generative Models

Yongtao Ge, Wenjia Wang, Yongfan Chen, Hao Chen, Chunhua Shen





In this work, we show that synthetic data created by generative models is complementary to computer graphics (CG) rendered data for achieving remarkable generalization performance on diverse real-world scenes for 3D human pose and shape estimation (HPS). Specifically, we propose an effective approach based on recent diffusion models, termed HumanWild, which can effortlessly generate human images and corresponding 3D mesh annotations. We first collect a large-scale human-centric dataset with comprehensive annotations, e.g., text captions and surface normal images. Then, we train a customized ControlNet model upon this dataset to generate diverse human images and initial ground-truth labels. At the core of this step is that we can easily obtain numerous surface normal images from a 3D human parametric model, e.g., SMPL-X, by rendering the 3D mesh onto the image plane. As there exists inevitable noise in the initial labels, we then apply an off-the-shelf foundation segmentation model, i.e., SAM, to filter negative data samples. Our data generation pipeline is flexible and customizable to facilitate different real-world tasks, e.g., ego-centric scenes and perspective-distortion scenes. The generated dataset comprises 0.79M images with corresponding 3D annotations, covering versatile viewpoints, scenes, and human identities. We train various HPS regressors on top of the generated data and evaluate them on a wide range of benchmarks (3DPW, RICH, EgoBody, AGORA, SSP-3D) to verify the effectiveness of the generated data. By exclusively employing generative models, we generate large-scale in-the-wild human images and high-quality annotations, eliminating the need for real-world data collection.

Read more



Listen, Disentangle, and Control: Controllable Speech-Driven Talking Head Generation

Changpeng Cai, Guinan Guo, Jiao Li, Junhao Su, Chenghao He, Jing Xiao, Yuanxu Chen, Lei Dai, Feiyu Zhu





Most earlier investigations on talking face generation have focused on the synchronization of lip motion and speech content. However, human head pose and facial emotions are equally important characteristics of natural human faces. While audio-driven talking face generation has seen notable advancements, existing methods either overlook facial emotions or are limited to specific individuals and cannot be applied to arbitrary subjects. In this paper, we propose a one-shot Talking Head Generation framework (SPEAK) that distinguishes itself from general Talking Face Generation by enabling emotional and postural control. Specifically, we introduce the Inter-Reconstructed Feature Disentanglement (IRFD) method to decouple human facial features into three latent spaces. We then design a face editing module that modifies speech content and facial latent codes into a single latent space. Subsequently, we present a novel generator that employs modified latent codes derived from the editing module to regulate emotional expression, head poses, and speech content in synthesizing facial animations. Extensive trials demonstrate that our method can generate realistic talking head with coordinated lip motions, authentic facial emotions, and smooth head movements. The demo video is available at the anonymous link:

Read more


Human Mesh Recovery from Arbitrary Multi-view Images

Human Mesh Recovery from Arbitrary Multi-view Images

Xiaoben Li, Mancheng Meng, Ziyan Wu, Terrence Chen, Fan Yang, Dinggang Shen





Human mesh recovery from arbitrary multi-view images involves two characteristics: the arbitrary camera poses and arbitrary number of camera views. Because of the variability, designing a unified framework to tackle this task is challenging. The challenges can be summarized as the dilemma of being able to simultaneously estimate arbitrary camera poses and recover human mesh from arbitrary multi-view images while maintaining flexibility. To solve this dilemma, we propose a divide and conquer framework for Unified Human Mesh Recovery (U-HMR) from arbitrary multi-view images. In particular, U-HMR consists of a decoupled structure and two main components: camera and body decoupling (CBD), camera pose estimation (CPE), and arbitrary view fusion (AVF). As camera poses and human body mesh are independent of each other, CBD splits the estimation of them into two sub-tasks for two individual sub-networks (ie, CPE and AVF) to handle respectively, thus the two sub-tasks are disentangled. In CPE, since each camera pose is unrelated to the others, we adopt a shared MLP to process all views in a parallel way. In AVF, in order to fuse multi-view information and make the fusion operation independent of the number of views, we introduce a transformer decoder with a SMPL parameters query token to extract cross-view features for mesh recovery. To demonstrate the efficacy and flexibility of the proposed framework and effect of each component, we conduct extensive experiments on three public datasets: Human3.6M, MPI-INF-3DHP, and TotalCapture.

Read more
