3D Human Pose Estimation with Occlusions: Introducing BlendMimic3D Dataset and GCN Refinement

2404.16136

Published 4/26/2024 by Filipa Lino, Carlos Santiago, Manuel Marques

🤯

Abstract

In the field of 3D Human Pose Estimation (HPE), accurately estimating human pose, especially in scenarios with occlusions, is a significant challenge. This work identifies and addresses a gap in the current state of the art in 3D HPE concerning the scarcity of data and strategies for handling occlusions. We introduce our novel BlendMimic3D dataset, designed to mimic real-world situations where occlusions occur for seamless integration in 3D HPE algorithms. Additionally, we propose a 3D pose refinement block, employing a Graph Convolutional Network (GCN) to enhance pose representation through a graph model. This GCN block acts as a plug-and-play solution, adaptable to various 3D HPE frameworks without requiring retraining them. By training the GCN with occluded data from BlendMimic3D, we demonstrate significant improvements in resolving occluded poses, with comparable results for non-occluded ones. Project web page is available at https://blendmimic3d.github.io/BlendMimic3D/.

Create account to get full access

Overview

Addresses the challenge of accurately estimating 3D human poses, especially in scenarios with occlusions
Introduces the novel BlendMimic3D dataset to mimic real-world occlusions for 3D human pose estimation (HPE)
Proposes a 3D pose refinement block using a Graph Convolutional Network (GCN) to enhance pose representation and improve handling of occluded poses

Plain English Explanation

Accurately estimating the 3D poses, or positions, of humans in images and videos is an important task in computer vision. This is especially challenging when parts of the human body are hidden or obstructed, a common occurrence in real-world scenarios.

To address this, the researchers created a new dataset called BlendMimic3D that simulates realistic occlusions. By training 3D HPE models on this dataset, the researchers were able to develop a 3D pose refinement module using a Graph Convolutional Network (GCN). This GCN module acts as a plug-and-play solution that can be added to existing 3D HPE frameworks to improve their ability to handle occluded poses, without requiring the entire model to be retrained.

The key innovation is using the BlendMimic3D dataset to teach the GCN module how to better understand and reconstruct human poses even when parts of the body are hidden from view. This allows 3D HPE systems to perform more robustly in real-world conditions with occlusions, which is an important step forward for applications like 3D human reconstruction in the wild and multi-person 3D pose estimation.

Technical Explanation

The researchers identified a gap in the current state of the art for 3D HPE regarding the scarcity of data and strategies for handling occlusions. To address this, they introduced the BlendMimic3D dataset, which is designed to mimic real-world situations where occlusions occur. The dataset is created by blending human poses from the AMASS motion capture dataset with synthetic occlusions.

In addition, the researchers proposed a 3D pose refinement block that employs a Graph Convolutional Network (GCN) to enhance the pose representation through a graph model. This GCN block can be integrated as a plug-and-play solution into various 3D HPE frameworks without requiring retraining of the entire model.

By training the GCN module with the occluded data from BlendMimic3D, the researchers demonstrated significant improvements in resolving occluded poses, while maintaining comparable results for non-occluded poses. This suggests that the GCN-based refinement approach is effective at handling occlusions in 3D human pose estimation.

Critical Analysis

The researchers acknowledge the limitation that the BlendMimic3D dataset, while designed to mimic real-world occlusions, may not fully capture the complexity and variability of occlusions observed in actual environments. Further research is needed to evaluate the performance of the proposed approach on more diverse and challenging real-world datasets.

Additionally, the researchers do not provide a detailed analysis of the trade-offs or computational costs associated with integrating the GCN-based refinement block into existing 3D HPE frameworks. The impact on inference speed and overall system complexity should be considered when deploying this solution in practical applications.

While the results demonstrate improvements in handling occluded poses, it would be valuable to see a more comprehensive evaluation of the approach's robustness across a wider range of occlusion scenarios, including varying degrees of occlusion, different types of occluding objects, and diverse human poses.

Conclusion

This work addresses an important challenge in 3D human pose estimation by introducing the BlendMimic3D dataset and a novel 3D pose refinement block using a Graph Convolutional Network. The GCN-based refinement module is shown to be effective at improving the handling of occluded poses, without requiring retraining of the entire 3D HPE system.

This research represents a significant step forward in enhancing the robustness of 3D HPE algorithms, which is crucial for practical applications such as 3D human reconstruction in the wild, multi-person 3D pose estimation, and uncertainty-aware 3D human pose estimation. The proposed approach provides a promising direction for further research and development in improving the robustness of 3D human pose estimation to handle real-world challenges.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Multi-hop graph transformer network for 3D human pose estimation

Zaedul Islam, A. Ben Hamza

Accurate 3D human pose estimation is a challenging task due to occlusion and depth ambiguity. In this paper, we introduce a multi-hop graph transformer network designed for 2D-to-3D human pose estimation in videos by leveraging the strengths of multi-head self-attention and multi-hop graph convolutional networks with disentangled neighborhoods to capture spatio-temporal dependencies and handle long-range interactions. The proposed network architecture consists of a graph attention block composed of stacked layers of multi-head self-attention and graph convolution with learnable adjacency matrix, and a multi-hop graph convolutional block comprised of multi-hop convolutional and dilated convolutional layers. The combination of multi-head self-attention and multi-hop graph convolutional layers enables the model to capture both local and global dependencies, while the integration of dilated convolutional layers enhances the model's ability to handle spatial details required for accurate localization of the human body joints. Extensive experiments demonstrate the effectiveness and generalization ability of our model, achieving competitive performance on benchmark datasets.

5/7/2024

cs.CV

🏋️

Quater-GCN: Enhancing 3D Human Pose Estimation with Orientation and Semi-supervised Training

Xingyu Song, Zhan Li, Shi Chen, Kazuyuki Demachi

3D human pose estimation is a vital task in computer vision, involving the prediction of human joint positions from images or videos to reconstruct a skeleton of a human in three-dimensional space. This technology is pivotal in various fields, including animation, security, human-computer interaction, and automotive safety, where it promotes both technological progress and enhanced human well-being. The advent of deep learning significantly advances the performance of 3D pose estimation by incorporating temporal information for predicting the spatial positions of human joints. However, traditional methods often fall short as they primarily focus on the spatial coordinates of joints and overlook the orientation and rotation of the connecting bones, which are crucial for a comprehensive understanding of human pose in 3D space. To address these limitations, we introduce Quater-GCN (Q-GCN), a directed graph convolutional network tailored to enhance pose estimation by orientation. Q-GCN excels by not only capturing the spatial dependencies among node joints through their coordinates but also integrating the dynamic context of bone rotations in 2D space. This approach enables a more sophisticated representation of human poses by also regressing the orientation of each bone in 3D space, moving beyond mere coordinate prediction. Furthermore, we complement our model with a semi-supervised training strategy that leverages unlabeled data, addressing the challenge of limited orientation ground truth data. Through comprehensive evaluations, Q-GCN has demonstrated outstanding performance against current state-of-the-art methods.

5/1/2024

cs.CV

Occlusion Handling in 3D Human Pose Estimation with Perturbed Positional Encoding

Niloofar Azizi, Mohsen Fayyaz, Horst Bischof

Understanding human behavior fundamentally relies on accurate 3D human pose estimation. Graph Convolutional Networks (GCNs) have recently shown promising advancements, delivering state-of-the-art performance with rather lightweight architectures. In the context of graph-structured data, leveraging the eigenvectors of the graph Laplacian matrix for positional encoding is effective. Yet, the approach does not specify how to handle scenarios where edges in the input graph are missing. To this end, we propose a novel positional encoding technique, PerturbPE, that extracts consistent and regular components from the eigenbasis. Our method involves applying multiple perturbations and taking their average to extract the consistent and regular component from the eigenbasis. PerturbPE leverages the Rayleigh-Schrodinger Perturbation Theorem (RSPT) for calculating the perturbed eigenvectors. Employing this labeling technique enhances the robustness and generalizability of the model. Our results support our theoretical findings, e.g. our experimental analysis observed a performance enhancement of up to $12%$ on the Human3.6M dataset in instances where occlusion resulted in the absence of one edge. Furthermore, our novel approach significantly enhances performance in scenarios where two edges are missing, setting a new benchmark for state-of-the-art.

5/28/2024

cs.CV

New!HOIMotion: Forecasting Human Motion During Human-Object Interactions Using Egocentric 3D Object Bounding Boxes

Zhiming Hu, Zheming Yin, Daniel Haeufle, Syn Schmitt, Andreas Bulling

We present HOIMotion - a novel approach for human motion forecasting during human-object interactions that integrates information about past body poses and egocentric 3D object bounding boxes. Human motion forecasting is important in many augmented reality applications but most existing methods have only used past body poses to predict future motion. HOIMotion first uses an encoder-residual graph convolutional network (GCN) and multi-layer perceptrons to extract features from body poses and egocentric 3D object bounding boxes, respectively. Our method then fuses pose and object features into a novel pose-object graph and uses a residual-decoder GCN to forecast future body motion. We extensively evaluate our method on the Aria digital twin (ADT) and MoGaze datasets and show that HOIMotion consistently outperforms state-of-the-art methods by a large margin of up to 8.7% on ADT and 7.2% on MoGaze in terms of mean per joint position error. Complementing these evaluations, we report a human study (N=20) that shows that the improvements achieved by our method result in forecasted poses being perceived as both more precise and more realistic than those of existing methods. Taken together, these results reveal the significant information content available in egocentric 3D object bounding boxes for human motion forecasting and the effectiveness of our method in exploiting this information.

7/4/2024

cs.CV