EFHQ: Multi-purpose ExtremePose-Face-HQ dataset

Read original: arXiv:2312.17205 - Published 4/15/2024 by Trung Tuan Dao, Duc Hong Vu, Cuong Pham, Anh Tran

🐍

Overview

Existing facial datasets often lack images with extreme head poses, leading to decreased performance of deep learning models when dealing with profile or pitched faces.
This work introduces a novel dataset called Extreme Pose Face High-Quality Dataset (EFHQ) that includes up to 450,000 high-quality images of faces at extreme poses.
The dataset is created by curating and processing two publicly available datasets, VFHQ and CelebV-HQ, which contain high-resolution face videos captured in various settings.
The dataset can complement existing datasets for various facial-related tasks such as facial synthesis, text-to-image face generation, and face reenactment.
Training models on EFHQ helps them generalize better across diverse poses, significantly improving performance in scenarios involving extreme views.
The paper also defines a challenging cross-view face verification benchmark using EFHQ, where the performance of state-of-the-art face recognition models drops 5-37% compared to frontal-to-frontal scenarios.

Plain English Explanation

Facial recognition models are often trained on datasets that mainly contain images of faces in frontal or near-frontal views. This can make the models less effective at dealing with faces in extreme poses, such as profiles or faces turned at a sharp angle. The researchers behind this work wanted to address this issue by creating a new dataset that includes a large number of high-quality images of faces in extreme poses.

To create this dataset, the researchers utilized two existing datasets, VFHQ and CelebV-HQ, which contain many high-resolution videos of faces captured in various settings. They developed a novel and thorough data processing pipeline to extract and curate the extreme pose face images from these videos, ultimately producing a dataset of up to 450,000 high-quality extreme pose face images.

The researchers believe this new dataset, called the Extreme Pose Face High-Quality Dataset (EFHQ), can be very useful for training facial recognition and related models to work better with faces in extreme poses. They found that training models on EFHQ helped them generalize much better to diverse head poses, significantly improving performance in scenarios involving extreme views.

The researchers also used EFHQ to create a new, challenging benchmark for cross-view face verification, where the performance of state-of-the-art face recognition models dropped substantially compared to the more common frontal-to-frontal scenarios. This benchmark aims to spur further research on improving facial recognition in real-world conditions with severe head poses.

Technical Explanation

The researchers noted that existing facial datasets, such as VFHQ and CelebV-HQ, often lack images with extreme head poses, leading to decreased performance of deep learning models when dealing with profile or pitched faces. To address this gap, they introduced the Extreme Pose Face High-Quality Dataset (EFHQ), which includes up to 450,000 high-quality images of faces at extreme poses.

To produce this large-scale dataset, the researchers utilized a novel and meticulous dataset processing pipeline to curate the VFHQ and CelebV-HQ datasets, which contain many high-resolution face videos captured in various settings. This allowed them to extract and select the extreme pose face images to create the EFHQ dataset.

The researchers found that training models on the EFHQ dataset helped them generalize well across diverse poses, significantly improving performance in scenarios involving extreme views. They also used EFHQ to define a challenging cross-view face verification benchmark, where the performance of state-of-the-art face recognition models dropped 5-37% compared to frontal-to-frontal scenarios. This benchmark aims to stimulate further studies on face recognition under severe pose conditions in the wild.

Critical Analysis

The researchers acknowledge that while the EFHQ dataset provides a valuable resource for training models to handle extreme head poses, it may still not capture the full diversity of real-world facial poses and expressions. There may be opportunities to further expand the dataset by incorporating additional sources of high-quality face data, such as Freeman's work on 3D human pose estimation or the 360X Panoptic dataset for multi-modal scene understanding.

Additionally, the researchers note that the cross-view face verification benchmark defined using EFHQ is a challenging task, and there is room for further research to improve the performance of state-of-the-art face recognition models in such extreme pose conditions. Approaches such as semi-supervised unconstrained head pose estimation or location-guided head pose estimation for fisheye images may provide valuable insights for this problem.

Finally, the researchers do not address potential biases or fairness issues that may arise from training models on a dataset focused on extreme poses. It would be important to consider how the EFHQ dataset and the resulting models could impact equitable deep learning for eye disease screening or other applications where facial recognition plays a role.

Conclusion

The Extreme Pose Face High-Quality Dataset (EFHQ) introduced in this work represents a significant advancement in addressing the limitations of existing facial datasets, which often lack images with extreme head poses. By utilizing a novel and thorough data curation process, the researchers have created a large-scale dataset of up to 450,000 high-quality extreme pose face images that can complement existing datasets for various facial-related tasks.

The researchers have demonstrated the value of EFHQ by showing that training models on this dataset helps them generalize better across diverse poses, leading to substantial improvements in performance for scenarios involving extreme views. Additionally, the cross-view face verification benchmark defined using EFHQ aims to stimulate further research on improving facial recognition in real-world conditions with severe head poses.

As the field of computer vision and facial analysis continues to evolve, datasets like EFHQ will play a crucial role in pushing the boundaries of what is possible and ensuring that facial recognition models can perform effectively in a wide range of real-world scenarios.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🐍

EFHQ: Multi-purpose ExtremePose-Face-HQ dataset

Trung Tuan Dao, Duc Hong Vu, Cuong Pham, Anh Tran

The existing facial datasets, while having plentiful images at near frontal views, lack images with extreme head poses, leading to the downgraded performance of deep learning models when dealing with profile or pitched faces. This work aims to address this gap by introducing a novel dataset named Extreme Pose Face High-Quality Dataset (EFHQ), which includes a maximum of 450k high-quality images of faces at extreme poses. To produce such a massive dataset, we utilize a novel and meticulous dataset processing pipeline to curate two publicly available datasets, VFHQ and CelebV-HQ, which contain many high-resolution face videos captured in various settings. Our dataset can complement existing datasets on various facial-related tasks, such as facial synthesis with 2D/3D-aware GAN, diffusion-based text-to-image face generation, and face reenactment. Specifically, training with EFHQ helps models generalize well across diverse poses, significantly improving performance in scenarios involving extreme views, confirmed by extensive experiments. Additionally, we utilize EFHQ to define a challenging cross-view face verification benchmark, in which the performance of SOTA face recognition models drops 5-37% compared to frontal-to-frontal scenarios, aiming to stimulate studies on face recognition under severe pose conditions in the wild.

4/15/2024

💬

VGGHeads: A Large-Scale Synthetic Dataset for 3D Human Heads

Orest Kupyn, Eugene Khvedchenia, Christian Rupprecht

Human head detection, keypoint estimation, and 3D head model fitting are important tasks with many applications. However, traditional real-world datasets often suffer from bias, privacy, and ethical concerns, and they have been recorded in laboratory environments, which makes it difficult for trained models to generalize. Here, we introduce VGGHeads -- a large scale synthetic dataset generated with diffusion models for human head detection and 3D mesh estimation. Our dataset comprises over 1 million high-resolution images, each annotated with detailed 3D head meshes, facial landmarks, and bounding boxes. Using this dataset we introduce a new model architecture capable of simultaneous heads detection and head meshes reconstruction from a single image in a single step. Through extensive experimental evaluations, we demonstrate that models trained on our synthetic data achieve strong performance on real images. Furthermore, the versatility of our dataset makes it applicable across a broad spectrum of tasks, offering a general and comprehensive representation of human heads. Additionally, we provide detailed information about the synthetic data generation pipeline, enabling it to be re-used for other tasks and domains.

7/26/2024

📊

On the power of data augmentation for head pose estimation

Michael Welter

Deep learning has been impressively successful in the last decade in predicting human head poses from monocular images. For in-the-wild inputs, the research community has predominantly relied on a single training set of semi-synthetic nature. This paper suggest the combination of different flavors of synthetic data in order to achieve better generalization to natural images. Moreover, additional expansion of the data volume using traditional out-of-plane rotation synthesis is considered. Together with a novel combination of losses and a network architecture with a standard feature-extractor, a competitive model is obtained, both in accuracy and efficiency, which allows full 6 DoF pose estimation in practical real-time applications.

7/12/2024

🏷️

3D Human Pose Perception from Egocentric Stereo Videos

Hiroyasu Akada, Jian Wang, Vladislav Golyanik, Christian Theobalt

While head-mounted devices are becoming more compact, they provide egocentric views with significant self-occlusions of the device user. Hence, existing methods often fail to accurately estimate complex 3D poses from egocentric views. In this work, we propose a new transformer-based framework to improve egocentric stereo 3D human pose estimation, which leverages the scene information and temporal context of egocentric stereo videos. Specifically, we utilize 1) depth features from our 3D scene reconstruction module with uniformly sampled windows of egocentric stereo frames, and 2) human joint queries enhanced by temporal features of the video inputs. Our method is able to accurately estimate human poses even in challenging scenarios, such as crouching and sitting. Furthermore, we introduce two new benchmark datasets, i.e., UnrealEgo2 and UnrealEgo-RW (RealWorld). The proposed datasets offer a much larger number of egocentric stereo views with a wider variety of human motions than the existing datasets, allowing comprehensive evaluation of existing and upcoming methods. Our extensive experiments show that the proposed approach significantly outperforms previous methods. We will release UnrealEgo2, UnrealEgo-RW, and trained models on our project page.

5/16/2024