Domain-Adaptive Full-Face Gaze Estimation via Novel-View-Synthesis and Feature Disentanglement

Read original: arXiv:2305.16140 - Published 7/9/2024 by Jiawei Qin, Takuru Shimoyama, Xucong Zhang, Yusuke Sugano

✨

Overview

The paper discusses the challenge of cross-domain gaze estimation, where gaze estimation models perform well within the same domain but struggle when applied to different domains.
The key factors believed to impact performance are the range of head poses and gaze directions in the training data.
The proposed approach includes a data synthesis pipeline to expand the range of head poses, and an unsupervised domain adaptation method to bridge the gap between synthetic and real data.

Plain English Explanation

Gaze estimation, the process of determining where someone is looking, has seen significant progress thanks to deep learning. However, these models tend to work well only when the training and testing data come from the same domain, such as the same camera setup or environment.

When applied to different domains, the performance drops dramatically. This is because factors like the range of head poses and gaze directions in the training data can greatly impact the model's accuracy.

Collecting a large, diverse dataset with extensive head pose and gaze variations is expensive and time-consuming. To address this, the researchers developed a two-part solution:

Data synthesis: They use single-image 3D reconstruction to generate synthetic images with a wider range of head poses, without needing a 3D facial shape dataset.
Unsupervised domain adaptation: They propose a method to bridge the gap between the synthetic and real images, allowing the model trained on the synthetic data to perform well on real-world data.

The goal is to create a gaze estimation system that can be deployed in various real-world applications, even when the training and test data come from different sources.

Technical Explanation

The paper presents a comprehensive pipeline for training gaze estimation models that can perform well across different domains.

The data synthesis component leverages single-image 3D reconstruction techniques, as described in Domain-Transferred Synthetic Data Generation for Improving Monocular 3D Hand Pose Estimation, to generate synthetic full-face images with a wider range of head poses. This eliminates the need for a 3D facial shape dataset, which can be expensive to acquire.

To bridge the gap between the synthetic and real data, the researchers propose an unsupervised domain adaptation method inspired by Domain Adaptive Pose Estimation via Multi-Level Consistency Regularization. Their approach uses a disentangling autoencoder network to separate gaze-related features from background information, and introduces a background augmentation consistency loss to better utilize the characteristics of the synthetic source domain.

Comprehensive experiments show that the model trained solely on the synthetic data can perform comparably to one trained on real data with a large label range. The proposed domain adaptation method further improves the performance on multiple target domains.

Critical Analysis

The researchers acknowledge that their synthetic data generation approach may not capture all the nuances of real-world data, and that the domain adaptation method still has room for improvement. Additionally, the paper does not address the potential for bias in the synthetic data or the implications of deploying such a system in real-world applications.

While the results are promising, further research is needed to ensure the robustness and fairness of the gaze estimation system, especially when applied to diverse populations and environments. Strategies like learning unsupervised gaze representation via eye mask and power of data augmentation for head pose estimation could be explored to address these concerns.

Conclusion

This paper presents an innovative approach to address the challenge of cross-domain gaze estimation, a crucial step towards deploying gaze-aware systems in real-world applications. By combining synthetic data generation and unsupervised domain adaptation, the researchers have developed a method that can perform well even when the training and test data come from different domains.

While the results are promising, further research is needed to ensure the robustness and fairness of the gaze estimation system. Nonetheless, this work represents an important advancement in the field and opens up new possibilities for gaze-based applications, such as learning gaze-aware compositional GAN for image synthesis.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

✨

Domain-Adaptive Full-Face Gaze Estimation via Novel-View-Synthesis and Feature Disentanglement

Jiawei Qin, Takuru Shimoyama, Xucong Zhang, Yusuke Sugano

Along with the recent development of deep neural networks, appearance-based gaze estimation has succeeded considerably when training and testing within the same domain. Compared to the within-domain task, the variance of different domains makes the cross-domain performance drop severely, preventing gaze estimation deployment in real-world applications. Among all the factors, ranges of head pose and gaze are believed to play significant roles in the final performance of gaze estimation, while collecting large ranges of data is expensive. This work proposes an effective model training pipeline consisting of a training data synthesis and a gaze estimation model for unsupervised domain adaptation. The proposed data synthesis leverages the single-image 3D reconstruction to expand the range of the head poses from the source domain without requiring a 3D facial shape dataset. To bridge the inevitable gap between synthetic and real images, we further propose an unsupervised domain adaptation method suitable for synthetic full-face data. We propose a disentangling autoencoder network to separate gaze-related features and introduce background augmentation consistency loss to utilize the characteristics of the synthetic source domain. Through comprehensive experiments, it shows that the model using only our synthetic training data can perform comparably to real data extended with a large label range. Our proposed domain adaptation approach further improves the performance on multiple target domains. The code and data will be available at https://github.com/ut-vision/AdaptiveGaze.

7/9/2024

Causal Representation-Based Domain Generalization on Gaze Estimation

Younghan Kim, Kangryun Moon, Yongjun Park, Yonggyu Kim

The availability of extensive datasets containing gaze information for each subject has significantly enhanced gaze estimation accuracy. However, the discrepancy between domains severely affects a model's performance explicitly trained for a particular domain. In this paper, we propose the Causal Representation-Based Domain Generalization on Gaze Estimation (CauGE) framework designed based on the general principle of causal mechanisms, which is consistent with the domain difference. We employ an adversarial training manner and an additional penalizing term to extract domain-invariant features. After extracting features, we position the attention layer to make features sufficient for inferring the actual gaze. By leveraging these modules, CauGE ensures that the neural networks learn from representations that meet the causal mechanisms' general principles. By this, CauGE generalizes across domains by extracting domain-invariant features, and spurious correlations cannot influence the model. Our method achieves state-of-the-art performance in the domain generalization on gaze estimation benchmark.

9/2/2024

Learning Gaze-aware Compositional GAN

Nerea Aranjuelo, Siyu Huang, Ignacio Arganda-Carreras, Luis Unzueta, Oihana Otaegui, Hanspeter Pfister, Donglai Wei

Gaze-annotated facial data is crucial for training deep neural networks (DNNs) for gaze estimation. However, obtaining these data is labor-intensive and requires specialized equipment due to the challenge of accurately annotating the gaze direction of a subject. In this work, we present a generative framework to create annotated gaze data by leveraging the benefits of labeled and unlabeled data sources. We propose a Gaze-aware Compositional GAN that learns to generate annotated facial images from a limited labeled dataset. Then we transfer this model to an unlabeled data domain to take advantage of the diversity it provides. Experiments demonstrate our approach's effectiveness in generating within-domain image augmentations in the ETH-XGaze dataset and cross-domain augmentations in the CelebAMask-HQ dataset domain for gaze estimation DNN training. We also show additional applications of our work, which include facial image editing and gaze redirection.

6/3/2024

Domain-Transferred Synthetic Data Generation for Improving Monocular Depth Estimation

Seungyeop Lee, Knut Peterson, Solmaz Arezoomandan, Bill Cai, Peihan Li, Lifeng Zhou, David Han

A major obstacle to the development of effective monocular depth estimation algorithms is the difficulty in obtaining high-quality depth data that corresponds to collected RGB images. Collecting this data is time-consuming and costly, and even data collected by modern sensors has limited range or resolution, and is subject to inconsistencies and noise. To combat this, we propose a method of data generation in simulation using 3D synthetic environments and CycleGAN domain transfer. We compare this method of data generation to the popular NYUDepth V2 dataset by training a depth estimation model based on the DenseDepth structure using different training sets of real and simulated data. We evaluate the performance of the models on newly collected images and LiDAR depth data from a Husky robot to verify the generalizability of the approach and show that GAN-transformed data can serve as an effective alternative to real-world data, particularly in depth estimation.

5/3/2024