eMotion-GAN: A Motion-based GAN for Photorealistic and Facial Expression Preserving Frontal View Synthesis

2404.09940

Published 4/16/2024 by Omar Ikne, Benjamin Allaert, Ioan Marius Bilasco, Hazem Wannous

eMotion-GAN: A Motion-based GAN for Photorealistic and Facial Expression Preserving Frontal View Synthesis

Abstract

Many existing facial expression recognition (FER) systems encounter substantial performance degradation when faced with variations in head pose. Numerous frontalization methods have been proposed to enhance these systems' performance under such conditions. However, they often introduce undesirable deformations, rendering them less suitable for precise facial expression analysis. In this paper, we present eMotion-GAN, a novel deep learning approach designed for frontal view synthesis while preserving facial expressions within the motion domain. Considering the motion induced by head variation as noise and the motion induced by facial expression as the relevant information, our model is trained to filter out the noisy motion in order to retain only the motion related to facial expression. The filtered motion is then mapped onto a neutral frontal face to generate the corresponding expressive frontal face. We conducted extensive evaluations using several widely recognized dynamic FER datasets, which encompass sequences exhibiting various degrees of head pose variations in both intensity and orientation. Our results demonstrate the effectiveness of our approach in significantly reducing the FER performance gap between frontal and non-frontal faces. Specifically, we achieved a FER improvement of up to +5% for small pose variations and up to +20% improvement for larger pose variations. Code available at url{https://github.com/o-ikne/eMotion-GAN.git}.

Create account to get full access

Overview

• This paper introduces eMotion-GAN, a generative adversarial network (GAN) model designed for photorealistic and facial expression-preserving frontal view synthesis.

• The key innovations of eMotion-GAN include a motion-based generator, a discriminator that preserves facial expressions, and a self-supervised loss function to ensure high-quality facial details.

Plain English Explanation

eMotion-GAN is a machine learning model that can generate realistic frontal-facing images of a person's face, while also preserving the person's facial expressions. This is a challenging task because it requires accurately modeling both the visual appearance and the dynamic movements of the face.

The model works by taking in an input image of a person's face from a non-frontal angle, and then generating a new image that shows that person's face from a frontal perspective. Crucially, the generated image maintains the same facial expressions and emotional state as the original input.

This could be useful for applications like video conferencing, virtual avatars, or digital dubbing, where it's important to have realistic and expressive facial animations. By preserving the person's natural expressions, eMotion-GAN can create more lifelike and engaging visual outputs.

Technical Explanation

• eMotion-GAN uses a motion-based generator, which means the model explicitly learns to generate facial movements and expressions, in addition to the static appearance of the face.

• The discriminator network is designed to assess not just the realism of the generated images, but also whether the facial expressions are accurately preserved from the input.

• The model is trained using a self-supervised loss function, which allows it to learn important facial details without requiring labeled training data.

• Key technical innovations include a motion-encoding module, an expression-preserving discriminator, and a self-supervised perceptual loss function.

Critical Analysis

• While eMotion-GAN demonstrates promising results, the paper acknowledges that further work is needed to improve the model's ability to handle extreme head poses and diverse facial expressions.

• The authors also note that the model's performance may be limited by the quality and diversity of the training data used.

• An interesting direction for future research would be to explore how eMotion-GAN's techniques could be applied to 3D facial expression models or combined with motion synthesis approaches for more holistic facial animation.

Conclusion

eMotion-GAN represents an important advance in the field of facial image synthesis, demonstrating the ability to generate photorealistic frontal views while preserving the original facial expressions. This technology could have significant applications in areas like video-based deepfake detection, personalized avatar creation, and 4D facial expression modeling. As the research in this area continues to progress, we can expect to see increasingly lifelike and expressive digital representations of human faces.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

GANmut: Generating and Modifying Facial Expressions

Maria Surani

In the realm of emotion synthesis, the ability to create authentic and nuanced facial expressions continues to gain importance. The GANmut study discusses a recently introduced advanced GAN framework that, instead of relying on predefined labels, learns a dynamic and interpretable emotion space. This methodology maps each discrete emotion as vectors starting from a neutral state, their magnitude reflecting the emotion's intensity. The current project aims to extend the study of this framework by benchmarking across various datasets, image resolutions, and facial detection methodologies. This will involve conducting a series of experiments using two emotional datasets: Aff-Wild2 and AffNet. Aff-Wild2 contains videos captured in uncontrolled environments, which include diverse camera angles, head positions, and lighting conditions, providing a real-world challenge. AffNet offers images with labelled emotions, improving the diversity of emotional expressions available for training. The first two experiments will focus on training GANmut using the Aff-Wild2 dataset, processed with either RetinaFace or MTCNN, both of which are high-performance deep learning face detectors. This setup will help determine how well GANmut can learn to synthesise emotions under challenging conditions and assess the comparative effectiveness of these face detection technologies. The subsequent two experiments will merge the Aff-Wild2 and AffNet datasets, combining the real world variability of Aff-Wild2 with the diverse emotional labels of AffNet. The same face detectors, RetinaFace and MTCNN, will be employed to evaluate whether the enhanced diversity of the combined datasets improves GANmut's performance and to compare the impact of each face detection method in this hybrid setup.

6/18/2024

cs.CV

📶

New!Language-Guided Face Animation by Recurrent StyleGAN-based Generator

Tiankai Hang, Huan Yang, Bei Liu, Jianlong Fu, Xin Geng, Baining Guo

Recent works on language-guided image manipulation have shown great power of language in providing rich semantics, especially for face images. However, the other natural information, motions, in language is less explored. In this paper, we leverage the motion information and study a novel task, language-guided face animation, that aims to animate a static face image with the help of languages. To better utilize both semantics and motions from languages, we propose a simple yet effective framework. Specifically, we propose a recurrent motion generator to extract a series of semantic and motion information from the language and feed it along with visual information to a pre-trained StyleGAN to generate high-quality frames. To optimize the proposed framework, three carefully designed loss functions are proposed including a regularization loss to keep the face identity, a path length regularization loss to ensure motion smoothness, and a contrastive loss to enable video synthesis with various language guidance in one single model. Extensive experiments with both qualitative and quantitative evaluations on diverse domains (textit{e.g.,} human face, anime face, and dog face) demonstrate the superiority of our model in generating high-quality and realistic videos from one still image with the guidance of language. Code will be available at https://github.com/TiankaiHang/language-guided-animation.git.

7/4/2024

cs.CV

Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

Jiadong Liang, Feng Lu

Vivid talking face generation holds immense potential applications across diverse multimedia domains, such as film and game production. While existing methods accurately synchronize lip movements with input audio, they typically ignore crucial alignments between emotion and facial cues, which include expression, gaze, and head pose. These alignments are indispensable for synthesizing realistic videos. To address these issues, we propose a two-stage audio-driven talking face generation framework that employs 3D facial landmarks as intermediate variables. This framework achieves collaborative alignment of expression, gaze, and pose with emotions through self-supervised learning. Specifically, we decompose this task into two key steps, namely speech-to-landmarks synthesis and landmarks-to-face generation. The first step focuses on simultaneously synthesizing emotionally aligned facial cues, including normalized landmarks that represent expressions, gaze, and head pose. These cues are subsequently reassembled into relocated facial landmarks. In the second step, these relocated landmarks are mapped to latent key points using self-supervised learning and then input into a pretrained model to create high-quality face images. Extensive experiments on the MEAD dataset demonstrate that our model significantly advances the state-of-the-art performance in both visual quality and emotional alignment.

6/13/2024

cs.CV

FacEnhance: Facial Expression Enhancing with Recurrent DDPMs

Hamza Bouzid, Lahoucine Ballihi

Facial expressions, vital in non-verbal human communication, have found applications in various computer vision fields like virtual reality, gaming, and emotional AI assistants. Despite advancements, many facial expression generation models encounter challenges such as low resolution (e.g., 32x32 or 64x64 pixels), poor quality, and the absence of background details. In this paper, we introduce FacEnhance, a novel diffusion-based approach addressing constraints in existing low-resolution facial expression generation models. FacEnhance enhances low-resolution facial expression videos (64x64 pixels) to higher resolutions (192x192 pixels), incorporating background details and improving overall quality. Leveraging conditional denoising within a diffusion framework, guided by a background-free low-resolution video and a single neutral expression high-resolution image, FacEnhance generates a video incorporating the facial expression from the low-resolution video performed by the individual with background from the neutral image. By complementing lightweight low-resolution models, FacEnhance strikes a balance between computational efficiency and desirable image resolution and quality. Extensive experiments on the MUG facial expression database demonstrate the efficacy of FacEnhance in enhancing low-resolution model outputs to state-of-the-art quality while preserving content and identity consistency. FacEnhance represents significant progress towards resource-efficient, high-fidelity facial expression generation, Renewing outdated low-resolution methods to up-to-date standards.

6/14/2024

cs.CV