TextGaze: Gaze-Controllable Face Generation with Natural Language

Read original: arXiv:2404.17486 - Published 9/10/2024 by Hengfei Wang, Zhongqun Zhang, Yihua Cheng, Hyung Jin Chang

🛸

Overview

This paper presents a novel approach for generating face images with specific gaze information, which is an important and challenging task in computer vision and graphics.
Existing methods typically input gaze values directly, which is unnatural and requires annotated gaze datasets for training, limiting their applicability.
The authors propose a text-to-face generation method that takes textual descriptions of gaze and head pose as input and generates corresponding face images.
They introduce a new text-of-gaze dataset and a gaze-controllable text-to-face generation method.

Plain English Explanation

The paper focuses on a task called "gaze-controllable face generation," which aims to generate face images with specific gaze and head pose information. Existing approaches often require directly inputting gaze values, which can be unnatural and limited by the availability of annotated gaze datasets for training.

To address this, the authors propose a new method that takes textual descriptions of gaze and head pose as input and generates the corresponding face images. For example, you could provide a text description like "looking to the left with a slight upward tilt of the head" and the model would generate a face image matching that description.

The key innovations in this work are:

A new text-of-gaze dataset containing over 90,000 text descriptions of gaze and head pose, which the authors use to train their model.
A gaze-controllable text-to-face generation method that uses a sketch-conditioned face diffusion module and a model-based sketch diffusion module to generate face images from the textual descriptions.

The authors demonstrate the effectiveness of their approach on the FFHQ dataset and plan to release the dataset and code for future research.

Technical Explanation

The paper introduces a novel gaze-controllable face generation task, where the goal is to generate face images from textual descriptions of gaze and head pose. This is in contrast to existing methods that typically require directly inputting gaze values, which can be unnatural and limited by the availability of annotated gaze datasets.

To address this, the authors first introduce a new text-of-gaze dataset containing over 90,000 text descriptions of gaze and head pose. This dataset serves as the training data for their proposed model.

The authors then present a gaze-controllable text-to-face generation method that consists of two main components:

Sketch-conditioned face diffusion module: This module generates face images from a face sketch, which is defined based on facial landmarks and an eye segmentation map.
Model-based sketch diffusion module: This module employs a 3D face model to generate the face sketch from the input text description of gaze and head pose.

By combining these two modules, the authors' method can generate face images that match the given textual description of gaze and head pose.

The authors evaluate their approach on the FFHQ dataset and demonstrate its effectiveness in generating realistic face images with the desired gaze and head pose characteristics.

Critical Analysis

The paper presents a novel and interesting approach to gaze-controllable face generation, addressing the limitations of existing methods that rely on directly inputting gaze values. The introduction of the text-of-gaze dataset is a valuable contribution, as it provides a rich resource for training models that can understand and generate faces based on textual descriptions of gaze and head pose.

One potential limitation of the work is the reliance on a 3D face model in the sketch diffusion module. While this approach can generate realistic face sketches from text, it may be constrained by the accuracy and coverage of the 3D model. It would be interesting to see if the authors could explore alternative methods for generating face sketches directly from text, potentially using text-driven image editing techniques or other approaches.

Additionally, the paper does not provide a detailed analysis of the limitations or failure cases of the proposed method. It would be helpful to understand the types of gaze and head pose descriptions that the model struggles with, as well as any biases or inconsistencies in the generated face images.

Overall, the paper presents a promising approach to gaze-controllable face generation and the introduction of the text-of-gaze dataset is a valuable contribution to the field. Further research and refinement of the method, as well as a more comprehensive evaluation, could lead to even more robust and versatile personalized face generation capabilities.

Conclusion

This paper introduces a novel gaze-controllable face generation method that takes textual descriptions of gaze and head pose as input and generates the corresponding face images. The key innovations include a new text-of-gaze dataset and a two-module generation approach that combines sketch-conditioned face diffusion and model-based sketch diffusion.

The authors demonstrate the effectiveness of their method on the FFHQ dataset and plan to release the dataset and code for future research. This work represents an important step forward in bridging the gap between textual descriptions and visual generation, with potential applications in various fields, such as computer graphics, virtual reality, and human-computer interaction.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🛸

TextGaze: Gaze-Controllable Face Generation with Natural Language

Hengfei Wang, Zhongqun Zhang, Yihua Cheng, Hyung Jin Chang

Generating face image with specific gaze information has attracted considerable attention. Existing approaches typically input gaze values directly for face generation, which is unnatural and requires annotated gaze datasets for training, thereby limiting its application. In this paper, we present a novel gaze-controllable face generation task. Our approach inputs textual descriptions that describe human gaze and head behavior and generates corresponding face images. Our work first introduces a text-of-gaze dataset containing over 90k text descriptions spanning a dense distribution of gaze and head poses. We further propose a gaze-controllable text-to-face method. Our method contains a sketch-conditioned face diffusion module and a model-based sketch diffusion module. We define a face sketch based on facial landmarks and eye segmentation map. The face diffusion module generates face images from the face sketch, and the sketch diffusion module employs a 3D face model to generate face sketch from text description. Experiments on the FFHQ dataset show the effectiveness of our method. We will release our dataset and code for future research.

9/10/2024

Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

Jiadong Liang, Feng Lu

Vivid talking face generation holds immense potential applications across diverse multimedia domains, such as film and game production. While existing methods accurately synchronize lip movements with input audio, they typically ignore crucial alignments between emotion and facial cues, which include expression, gaze, and head pose. These alignments are indispensable for synthesizing realistic videos. To address these issues, we propose a two-stage audio-driven talking face generation framework that employs 3D facial landmarks as intermediate variables. This framework achieves collaborative alignment of expression, gaze, and pose with emotions through self-supervised learning. Specifically, we decompose this task into two key steps, namely speech-to-landmarks synthesis and landmarks-to-face generation. The first step focuses on simultaneously synthesizing emotionally aligned facial cues, including normalized landmarks that represent expressions, gaze, and head pose. These cues are subsequently reassembled into relocated facial landmarks. In the second step, these relocated landmarks are mapped to latent key points using self-supervised learning and then input into a pretrained model to create high-quality face images. Extensive experiments on the MEAD dataset demonstrate that our model significantly advances the state-of-the-art performance in both visual quality and emotional alignment.

6/13/2024

Improving face generation quality and prompt following with synthetic captions

Michail Tarasiou, Stylianos Moschoglou, Jiankang Deng, Stefanos Zafeiriou

Recent advancements in text-to-image generation using diffusion models have significantly improved the quality of generated images and expanded the ability to depict a wide range of objects. However, ensuring that these models adhere closely to the text prompts remains a considerable challenge. This issue is particularly pronounced when trying to generate photorealistic images of humans. Without significant prompt engineering efforts models often produce unrealistic images and typically fail to incorporate the full extent of the prompt information. This limitation can be largely attributed to the nature of captions accompanying the images used in training large scale diffusion models, which typically prioritize contextual information over details related to the person's appearance. In this paper we address this issue by introducing a training-free pipeline designed to generate accurate appearance descriptions from images of people. We apply this method to create approximately 250,000 captions for publicly available face datasets. We then use these synthetic captions to fine-tune a text-to-image diffusion model. Our results demonstrate that this approach significantly improves the model's ability to generate high-quality, realistic human faces and enhances adherence to the given prompts, compared to the baseline model. We share our synthetic captions, pretrained checkpoints and training code.

5/20/2024

🛸

Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation

Renshuai Liu, Bowen Ma, Wei Zhang, Zhipeng Hu, Changjie Fan, Tangjie Lv, Yu Ding, Xuan Cheng

In human-centric content generation, the pre-trained text-to-image models struggle to produce user-wanted portrait images, which retain the identity of individuals while exhibiting diverse expressions. This paper introduces our efforts towards personalized face generation. To this end, we propose a novel multi-modal face generation framework, capable of simultaneous identity-expression control and more fine-grained expression synthesis. Our expression control is so sophisticated that it can be specialized by the fine-grained emotional vocabulary. We devise a novel diffusion model that can undertake the task of simultaneously face swapping and reenactment. Due to the entanglement of identity and expression, it's nontrivial to separately and precisely control them in one framework, thus has not been explored yet. To overcome this, we propose several innovative designs in the conditional diffusion model, including balancing identity and expression encoder, improved midpoint sampling, and explicitly background conditioning. Extensive experiments have demonstrated the controllability and scalability of the proposed framework, in comparison with state-of-the-art text-to-image, face swapping, and face reenactment methods.

4/9/2024