Latent Embedding Clustering for Occlusion Robust Head Pose Estimation

2403.20251

Published 4/1/2024 by Jos'e Celestino, Manuel Marques, Jacinto C. Nascimento

Latent Embedding Clustering for Occlusion Robust Head Pose Estimation

Abstract

Head pose estimation has become a crucial area of research in computer vision given its usefulness in a wide range of applications, including robotics, surveillance, or driver attention monitoring. One of the most difficult challenges in this field is managing head occlusions that frequently take place in real-world scenarios. In this paper, we propose a novel and efficient framework that is robust in real world head occlusion scenarios. In particular, we propose an unsupervised latent embedding clustering with regression and classification components for each pose angle. The model optimizes latent feature representations for occluded and non-occluded images through a clustering term while improving fine-grained angle predictions. Experimental evaluation on in-the-wild head pose benchmark datasets reveal competitive performance in comparison to state-of-the-art methodologies with the advantage of having a significant data reduction. We observe a substantial improvement in occluded head pose estimation. Also, an ablation study is conducted to ascertain the impact of the clustering term within our proposed framework.

Create account to get full access

INTRODUCTION

Head pose estimation is the process of predicting the orientation and position of a person's head relative to a camera. This is an important topic in computer vision, as head pose estimation is crucial for many applications, such as human-computer/robot interaction, surveillance systems, driver attention monitoring, virtual/augmented reality, healthcare, and marketing.

One major challenge that existing head pose estimation methods often struggle with is the presence of occlusions. Occlusions can be caused by external objects, facial accessories, or even body parts, and they can make it difficult to capture reliable facial features, leading to inaccurate head pose estimates. This is a significant issue in real-world, unconstrained environments, and it has not been investigated in detail.

((a))

This paper proposes a new deep learning methodology to improve the robustness of human pose estimation (HPE) to occlusions. The key innovations are:

Combining fine-grained Euler angle regression with unsupervised latent embedding clustering to refine the feature representation for pose estimation. This approach requires significantly fewer ground truth latent embedding points compared to prior latent space regression methods.
The ability to augment the occluded training dataset with non-occluded images, overcoming the limitation of prior methods that require occluded/non-occluded image pairs.
Lower computational cost compared to fully supervised latent space regression approaches.

The proposed method achieves state-of-the-art results on occluded images in benchmark datasets, while also performing competitively on non-occluded images. An ablation study demonstrates the impact of the unsupervised embedding clustering component.

RELATED WORK

This section reviews the existing literature on head pose estimation (HPE) and unsupervised deep clustering. Regarding HPE, there are two main classes of approaches: those based on facial landmarks and model fitting, and those using deep learning on image features without the need for landmarks. The landmark-based approaches fit a head mesh to detected facial landmarks or keypoints, with some recent works using 3D morphable models to improve generalization. The deep learning approaches avoid the need for landmarks and can directly predict head pose, with recent works proposing solutions to address common HPE problems like ambiguity in rotation labels and perspective distortion.

The literature on occlusions in HPE is limited, but a few works have tackled this problem by using optical flow tracking, estimating landmark visibility probabilities, or combining landmark, pose, and deformation estimation. Some works have also explored the use of synthetically occluded datasets and multi-loss functions to improve pose prediction in the presence of occlusions.

The section also discusses unsupervised deep clustering, which aims to group data without ground truth labels based on feature similarity. Simple methods like k-means can struggle with high-dimensional data, so deep learning approaches that map data to a lower-dimensional feature space have become popular. Recent deep clustering works incorporate both clustering and reconstruction losses to preserve local data structure, and use augmented data and contrastive losses to maximize the similarity of positive pairs and penalize negative ones.

Methodology

The proposed LEC-HPE methodology aims to address the difficulties mentioned in Section I. It uses a small number of ground truth latent embeddings (K) to obtain the representation of a larger number of images (N), where K << N. To accomplish this, the method uses unsupervised latent embedding clustering motivated by previous work.

The overall LEC-HPE architecture includes a backbone encoder and four separate branches. Three branches are used for predicting each Euler angle (yaw, pitch, roll) using a multi-loss framework for classification and fine-grained estimation. The remaining branch is responsible for clustering and fine-tuning the latent space.

The training strategy has two stages. In the first stage, the model is optimized for bin prediction using classification loss and fine-grained Euler angle estimation using regression loss. In the second stage, a clustering term is added while maintaining the multi-loss functions from the first stage.

The fine-grained losses for feature learning and preservation include a cross-entropy loss for classification and a mean squared error loss for regression. These losses are combined to yield the final loss for each Euler angle.

The unsupervised latent clustering is performed by measuring the pairwise similarity between latent embedded points and cluster centers, and optimizing the cluster centers and encoder parameters by minimizing a Kullback-Leibler divergence objective.

The overall loss is the sum of the Euler angle losses and the clustering loss, with a regularization coefficient to adjust the impact of the clustering term.

V Experimental Evaluation

This section describes the extensive experimental evaluation of the proposed head pose estimation (HPE) method using several benchmark datasets. The datasets used include:

300W-LP: A synthetic dataset with 61,225 face samples used for training. It covers a wide range of individuals, illumination conditions, and poses.
BIWI: A dataset with over 15,000 images of 20 individuals used for testing. It provides depth, RGB images, and ground truth pose annotations.
AFLW2000: A dataset of 2,000 in-the-wild images used for testing, with diverse head poses and variations in lighting and background.
Pandora: A dataset with over 250,000 RGB and depth images, simulating driving poses, used for testing with real-life occlusions.

The network structure uses a ResNet-50 backbone encoder. The training is done in two stages: (1) using the original 300W-LP dataset, and (2) using the synthetically occluded version of the same dataset.

The paper compares the proposed method's performance to state-of-the-art approaches on the BIWI, AFLW2000, and Pandora datasets, both in original and occluded scenarios. The results show the proposed method delivers competitive performance, surpassing other methods in occluded scenarios by a significant margin.

An ablation study is also conducted to analyze the impact of the clustering term in the training loss. The results indicate that including the clustering term (with an optimal β value of 100) reduces the estimation error for both occluded and non-occluded images compared to not using the clustering term.

Conclusion

The paper presents an efficient method to address the challenge of occlusion in head pose estimation, a significant issue in computer vision. The proposed framework, called LEC-HPE, combines unsupervised latent embedding clustering in the latent feature space with a fine-grained Euler angle multi-loss scheme to improve occlusion robustness. The key idea is to enhance feature representation for pose estimation while refining the latent embedding space through clustering, eliminating the need for labeled embedding data for each training image. This approach offers a more efficient alternative compared to recent occlusion-focused state-of-the-art methods, without requiring a constrained expansion of the training dataset. Experiments show the method can achieve similar results without the need for ground truth for each latent embedding label and surpass the state of the art for standard head pose estimation in occluded images. An ablation study quantifies the impact of the clustering term and the need to include the fine-grained Euler angles scheme. Future work will include automatic selection of the optimal number of cluster centroids and evaluation with smaller, more efficient backbone encoders for low-power applications. The use of clustering losses for the classification component in the multi-loss scheme will also be explored.

Acknowledgements

The work described in this paper was funded by several sources. These include LARSyS, which provided funding through the specified DOI numbers. Funding also came from the Fundação para a Ciência e a Tecnologia. Additionally, the SmartRetail project, managed by IAPMEI - Agência para a Competitividade e Inovação, provided funding through the specified project number.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Semi-Supervised Unconstrained Head Pose Estimation in the Wild

Huayi Zhou, Fei Jiang, Hongtao Lu

Existing head pose estimation datasets are either composed of numerous samples by non-realistic synthesis or lab collection, or limited images by labor-intensive annotating. This makes deep supervised learning based solutions compromised due to the reliance on generous labeled data. To alleviate it, we propose the first semi-supervised unconstrained head pose estimation (SemiUHPE) method, which can leverage a large amount of unlabeled wild head images. Specifically, we follow the recent semi-supervised rotation regression, and focus on the diverse and complex head pose domain. Firstly, we claim that the aspect-ratio invariant cropping of heads is superior to the previous landmark-based affine alignment, which does not fit unlabeled natural heads or practical applications where landmarks are often unavailable. Then, instead of using an empirically fixed threshold to filter out pseudo labels, we propose the dynamic entropy-based filtering by updating thresholds for adaptively removing unlabeled outliers. Moreover, we revisit the design of weak-strong augmentations, and further exploit its superiority by devising two novel head-oriented strong augmentations named pose-irrelevant cut-occlusion and pose-altering rotation consistency. Extensive experiments show that SemiUHPE can surpass SOTAs with remarkable improvements on public benchmarks under both front-range and full-range. Our code is released in url{https://github.com/hnuzhy/SemiUHPE}.

4/4/2024

cs.CV

Occlusion Handling in 3D Human Pose Estimation with Perturbed Positional Encoding

Niloofar Azizi, Mohsen Fayyaz, Horst Bischof

Understanding human behavior fundamentally relies on accurate 3D human pose estimation. Graph Convolutional Networks (GCNs) have recently shown promising advancements, delivering state-of-the-art performance with rather lightweight architectures. In the context of graph-structured data, leveraging the eigenvectors of the graph Laplacian matrix for positional encoding is effective. Yet, the approach does not specify how to handle scenarios where edges in the input graph are missing. To this end, we propose a novel positional encoding technique, PerturbPE, that extracts consistent and regular components from the eigenbasis. Our method involves applying multiple perturbations and taking their average to extract the consistent and regular component from the eigenbasis. PerturbPE leverages the Rayleigh-Schrodinger Perturbation Theorem (RSPT) for calculating the perturbed eigenvectors. Employing this labeling technique enhances the robustness and generalizability of the model. Our results support our theoretical findings, e.g. our experimental analysis observed a performance enhancement of up to $12%$ on the Human3.6M dataset in instances where occlusion resulted in the absence of one edge. Furthermore, our novel approach significantly enhances performance in scenarios where two edges are missing, setting a new benchmark for state-of-the-art.

5/28/2024

cs.CV

HPE-CogVLM: New Head Pose Grounding Task Exploration on Vision Language Model

Yu Tian, Tianqi Shao, Tsukasa Demizu, Xuyang Wu, Hsin-Tai Wu

Head pose estimation (HPE) task requires a sophisticated understanding of 3D spatial relationships and precise numerical output of yaw, pitch, and roll Euler angles. Previous HPE studies are mainly based on Non-large language models (Non-LLMs), which rely on close-up human heads cropped from the full image as inputs and lack robustness in real-world scenario. In this paper, we present a novel framework to enhance the HPE prediction task by leveraging the visual grounding capability of CogVLM. CogVLM is a vision language model (VLM) with grounding capability of predicting object bounding boxes (BBoxes), which enables HPE training and prediction using full image information input. To integrate the HPE task into the VLM, we first cop with the catastrophic forgetting problem in large language models (LLMs) by investigating the rehearsal ratio in the data rehearsal method. Then, we propose and validate a LoRA layer-based model merging method, which keeps the integrity of parameters, to enhance the HPE performance in the framework. The results show our HPE-CogVLM achieves a 31.5% reduction in Mean Absolute Error for HPE prediction over the current Non-LLM based state-of-the-art in cross-dataset evaluation. Furthermore, we compare our LoRA layer-based model merging method with LoRA fine-tuning only and other merging methods in CogVLM. The results demonstrate our framework outperforms them in all HPE metrics.

6/5/2024

cs.CV cs.AI cs.CL

🏷️

Improved cryo-EM Pose Estimation and 3D Classification through Latent-Space Disentanglement

Weijie Chen, Yuhang Wang, Lin Yao

Due to the extremely low signal-to-noise ratio (SNR) and unknown poses (projection angles and image shifts) in cryo-electron microscopy (cryo-EM) experiments, reconstructing 3D volumes from 2D images is very challenging. In addition to these challenges, heterogeneous cryo-EM reconstruction requires conformational classification. In popular cryo-EM reconstruction algorithms, poses and conformation classification labels must be predicted for every input cryo-EM image, which can be computationally costly for large datasets. An emerging class of methods adopted the amortized inference approach. In these methods, only a subset of the input dataset is needed to train neural networks for the estimation of poses and conformations. Once trained, these neural networks can make pose/conformation predictions and 3D reconstructions at low cost for the entire dataset during inference. Unfortunately, when facing heterogeneous reconstruction tasks, it is hard for current amortized-inference-based methods to effectively estimate the conformational distribution and poses from entangled latent variables. Here, we propose a self-supervised variational autoencoder architecture called HetACUMN based on amortized inference. We employed an auxiliary conditional pose prediction task by inverting the order of encoder-decoder to explicitly enforce the disentanglement of conformation and pose predictions. Results on simulated datasets show that HetACUMN generated more accurate conformational classifications than other amortized or non-amortized methods. Furthermore, we show that HetACUMN is capable of performing heterogeneous 3D reconstructions of a real experimental dataset.

4/24/2024

eess.IV cs.CV