CapHuman: Capture Your Moments in Parallel Universes

2402.00627

Published 5/20/2024 by Chao Liang, Fan Ma, Linchao Zhu, Yingying Deng, Yi Yang

🧠

Abstract

We concentrate on a novel human-centric image synthesis task, that is, given only one reference facial photograph, it is expected to generate specific individual images with diverse head positions, poses, facial expressions, and illuminations in different contexts. To accomplish this goal, we argue that our generative model should be capable of the following favorable characteristics: (1) a strong visual and semantic understanding of our world and human society for basic object and human image generation. (2) generalizable identity preservation ability. (3) flexible and fine-grained head control. Recently, large pre-trained text-to-image diffusion models have shown remarkable results, serving as a powerful generative foundation. As a basis, we aim to unleash the above two capabilities of the pre-trained model. In this work, we present a new framework named CapHuman. We embrace the encode then learn to align paradigm, which enables generalizable identity preservation for new individuals without cumbersome tuning at inference. CapHuman encodes identity features and then learns to align them into the latent space. Moreover, we introduce the 3D facial prior to equip our model with control over the human head in a flexible and 3D-consistent manner. Extensive qualitative and quantitative analyses demonstrate our CapHuman can produce well-identity-preserved, photo-realistic, and high-fidelity portraits with content-rich representations and various head renditions, superior to established baselines. Code and checkpoint will be released at https://github.com/VamosC/CapHuman.

Create account to get full access

Overview

The paper introduces a novel task of generating diverse images of a specific individual, such as different head poses, facial expressions, and lighting conditions, from a single reference photograph.
The authors argue that their generative model should have strong visual and semantic understanding, generalizable identity preservation, and flexible fine-grained head control.
The proposed framework, named CapHuman, utilizes a pre-trained text-to-image diffusion model as a foundation and introduces novel components to achieve the desired capabilities.

Plain English Explanation

The researchers present a system that can generate a variety of images of a specific person, starting from just a single photograph of that person. For example, it could take a headshot and create new images of the same person with different head positions, facial expressions, or lighting conditions.

The key idea is to build on the capabilities of large, pre-trained AI models that can generate images from text descriptions. The researchers want to enhance these models to have a better understanding of people and the world, so they can create more realistic and diverse images of specific individuals.

To do this, the CapHuman framework encodes the identity features of the person in the reference photo and then learns to align those features into the model's latent space. This allows the model to preserve the individual's identity across the generated images. They also incorporate 3D facial priors to give the model better control over the head pose and orientation in a consistent, 3D manner.

The result is that CapHuman can produce high-quality portraits of a person with a wide range of variations, all while preserving the individual's identity. This could have applications in areas like 3D human understanding and retrieval or cross-view, cross-pose completion.

Technical Explanation

The authors propose a framework called CapHuman that builds on top of pre-trained text-to-image diffusion models. The key innovations are:

Identity Encoding and Alignment: CapHuman encodes the identity features from the reference photo and then learns to align those features in the model's latent space. This allows the generated images to preserve the identity of the individual, even with diverse head poses, expressions, and lighting.
3D Facial Prior: The researchers incorporate a 3D facial prior into the model, which gives it better control over the head pose and orientation in a consistent, 3D-aware manner. This enables fine-grained control over the head renditions.

The authors conduct extensive qualitative and quantitative evaluations, demonstrating that CapHuman can generate photo-realistic portraits with rich content representations and diverse head variations, outperforming established baselines.

Critical Analysis

The paper presents a compelling approach to the challenging task of generating diverse images of a specific individual. The use of pre-trained text-to-image models as a foundation, combined with the novel identity encoding and 3D facial prior components, is a promising direction.

One potential limitation mentioned in the paper is the need for a reference photo of the individual. It would be interesting to see if the model could generalize to generating images of new individuals without requiring a reference.

Additionally, the authors do not discuss the potential biases or fairness considerations that may arise when generating diverse images of individuals. This is an important aspect that should be carefully considered, especially for applications involving human likenesses.

Further research could explore the model's ability to handle more extreme variations in head pose, lighting, and expression, as well as its performance on more diverse datasets beyond just facial images.

Conclusion

The CapHuman framework represents an innovative step towards generating diverse, identity-preserving images of individuals. By leveraging pre-trained text-to-image models and introducing novel identity encoding and 3D facial prior components, the researchers have demonstrated the ability to produce high-quality, content-rich portraits with a wide range of head renditions.

This research has the potential to drive advancements in areas like 3D human understanding and retrieval and cross-view, cross-pose completion, while also raising important considerations around bias and fairness. As the field of generative AI continues to evolve, work like CapHuman highlights the importance of balancing technological progress with responsible development and deployment.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Improving face generation quality and prompt following with synthetic captions

Michail Tarasiou, Stylianos Moschoglou, Jiankang Deng, Stefanos Zafeiriou

Recent advancements in text-to-image generation using diffusion models have significantly improved the quality of generated images and expanded the ability to depict a wide range of objects. However, ensuring that these models adhere closely to the text prompts remains a considerable challenge. This issue is particularly pronounced when trying to generate photorealistic images of humans. Without significant prompt engineering efforts models often produce unrealistic images and typically fail to incorporate the full extent of the prompt information. This limitation can be largely attributed to the nature of captions accompanying the images used in training large scale diffusion models, which typically prioritize contextual information over details related to the person's appearance. In this paper we address this issue by introducing a training-free pipeline designed to generate accurate appearance descriptions from images of people. We apply this method to create approximately 250,000 captions for publicly available face datasets. We then use these synthetic captions to fine-tune a text-to-image diffusion model. Our results demonstrate that this approach significantly improves the model's ability to generate high-quality, realistic human faces and enhances adherence to the given prompts, compared to the baseline model. We share our synthetic captions, pretrained checkpoints and training code.

5/20/2024

cs.CV cs.LG

🖼️

From Parts to Whole: A Unified Reference Framework for Controllable Human Image Generation

Zehuan Huang, Hongxing Fan, Lipeng Wang, Lu Sheng

Recent advancements in controllable human image generation have led to zero-shot generation using structural signals (e.g., pose, depth) or facial appearance. Yet, generating human images conditioned on multiple parts of human appearance remains challenging. Addressing this, we introduce Parts2Whole, a novel framework designed for generating customized portraits from multiple reference images, including pose images and various aspects of human appearance. To achieve this, we first develop a semantic-aware appearance encoder to retain details of different human parts, which processes each image based on its textual label to a series of multi-scale feature maps rather than one image token, preserving the image dimension. Second, our framework supports multi-image conditioned generation through a shared self-attention mechanism that operates across reference and target features during the diffusion process. We enhance the vanilla attention mechanism by incorporating mask information from the reference human images, allowing for the precise selection of any part. Extensive experiments demonstrate the superiority of our approach over existing alternatives, offering advanced capabilities for multi-part controllable human image customization. See our project page at https://huanngzh.github.io/Parts2Whole/.

4/24/2024

cs.CV

🛸

Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation

Renshuai Liu, Bowen Ma, Wei Zhang, Zhipeng Hu, Changjie Fan, Tangjie Lv, Yu Ding, Xuan Cheng

In human-centric content generation, the pre-trained text-to-image models struggle to produce user-wanted portrait images, which retain the identity of individuals while exhibiting diverse expressions. This paper introduces our efforts towards personalized face generation. To this end, we propose a novel multi-modal face generation framework, capable of simultaneous identity-expression control and more fine-grained expression synthesis. Our expression control is so sophisticated that it can be specialized by the fine-grained emotional vocabulary. We devise a novel diffusion model that can undertake the task of simultaneously face swapping and reenactment. Due to the entanglement of identity and expression, it's nontrivial to separately and precisely control them in one framework, thus has not been explored yet. To overcome this, we propose several innovative designs in the conditional diffusion model, including balancing identity and expression encoder, improved midpoint sampling, and explicitly background conditioning. Extensive experiments have demonstrated the controllability and scalability of the proposed framework, in comparison with state-of-the-art text-to-image, face swapping, and face reenactment methods.

4/9/2024

cs.CV

🤔

ChatHuman: Language-driven 3D Human Understanding with Retrieval-Augmented Tool Reasoning

Jing Lin, Yao Feng, Weiyang Liu, Michael J. Black

Numerous methods have been proposed to detect, estimate, and analyze properties of people in images, including the estimation of 3D pose, shape, contact, human-object interaction, emotion, and more. Each of these methods works in isolation instead of synergistically. Here we address this problem and build a language-driven human understanding system -- ChatHuman, which combines and integrates the skills of many different methods. To do so, we finetune a Large Language Model (LLM) to select and use a wide variety of existing tools in response to user inputs. In doing so, ChatHuman is able to combine information from multiple tools to solve problems more accurately than the individual tools themselves and to leverage tool output to improve its ability to reason about humans. The novel features of ChatHuman include leveraging academic publications to guide the application of 3D human-related tools, employing a retrieval-augmented generation model to generate in-context-learning examples for handling new tools, and discriminating and integrating tool results to enhance 3D human understanding. Our experiments show that ChatHuman outperforms existing models in both tool selection accuracy and performance across multiple 3D human-related tasks. ChatHuman is a step towards consolidating diverse methods for human analysis into a single, powerful, system for 3D human reasoning.

5/8/2024

cs.CV cs.LG