Unsupervised View-Invariant Human Posture Representation

Read original: arXiv:2109.08730 - Published 7/9/2024 by Faegheh Sardari, Bjorn Ommer, Majid Mirmehdi

🤷

Overview

The paper presents a novel unsupervised approach for learning view-invariant 3D human pose representations from 2D image data, without the need for 3D skeleton annotations.
The model is trained to exploit the intrinsic view-invariance and equivariance properties of human poses across different viewpoints and augmented frames.
The learned representations are evaluated on two downstream tasks: cross-view action classification and cross-view human movement quality assessment.
The approach outperforms state-of-the-art unsupervised methods for cross-view action recognition and achieves the first ever unsupervised cross-view and cross-subject results on a human movement quality dataset.

Plain English Explanation

The paper introduces a new way to teach a machine learning model to understand human poses and actions, without needing 3D data that can be difficult to collect. The model learns to extract view-invariant features from 2D images - this means it can recognize the same pose or action even when the camera angle changes.

The key idea is to train the model using two types of data: 1) simultaneous frames from different viewpoints, which have inherent view-invariant properties, and 2) augmented frames from the same viewpoint, which have equivariant properties (i.e., the pose changes in a predictable way). By learning from these cues, the model can build a robust representation of human pose that works across different views.

The authors then show this learned representation is useful for two real-world applications: 1) classifying human actions in a cross-view setting, where the camera angle changes, and 2) assessing the quality of human movements, which is important for applications like physical therapy. The results demonstrate significant improvements over previous unsupervised methods, and even approach the performance of supervised techniques.

Technical Explanation

The paper proposes a novel unsupervised approach for learning view-invariant 3D human pose representations from 2D image data, without relying on 3D skeleton annotations. The key idea is to exploit the intrinsic view-invariant properties of human poses between simultaneous frames from different viewpoints, as well as the equivariant properties of poses between augmented frames from the same viewpoint.

The model architecture consists of an encoder network that takes 2D images as input and outputs a view-invariant 3D pose representation. This encoder is trained using a combination of self-supervised losses that encourage the learned representations to be view-invariant and equivariant. Specifically, the authors use a contrastive loss to pull together representations of the same pose from different viewpoints, and a reconstruction loss to ensure the representations can be used to accurately predict 2D joint locations under different augmentations.

The authors evaluate the learned representations on two downstream tasks. First, they demonstrate significant improvements in unsupervised cross-view action classification accuracy on the NTU RGB+D dataset compared to previous state-of-the-art methods. Second, they show the learned representations can be effectively transferred to obtain the first ever unsupervised cross-view and cross-subject human movement quality assessment results on the QMAR dataset, and even marginally outperform the state-of-the-art supervised results.

The authors also conduct ablation studies to examine the contributions of the different components of their proposed network, such as the view-invariance and equivariance losses.

Critical Analysis

The key strength of this work is the ability to learn view-invariant 3D pose representations from 2D image data alone, without relying on expensive and hard-to-obtain 3D skeleton annotations. This is a significant advancement over previous approaches that required such 3D data.

However, the paper does not discuss potential limitations or failure cases of the proposed method. For example, it is unclear how the model would perform in highly cluttered scenes or with significant occlusions, where accurately estimating 2D joint locations may be challenging. Additionally, the evaluation is focused on constrained datasets, and further research is needed to assess the real-world applicability of the approach.

It would also be interesting to see how the learned representations compare to those obtained from 3D pose estimation models trained on 3D data, in terms of downstream task performance and generalization. A more comprehensive comparison could provide valuable insights into the tradeoffs between supervised and unsupervised approaches for pose representation learning.

Overall, the paper presents a compelling and technically sound approach that demonstrates the potential of unsupervised methods for learning robust 3D pose representations from 2D data. However, further research is needed to fully understand the limitations and broader applicability of the proposed technique.

Conclusion

This paper introduces a novel unsupervised approach for learning view-invariant 3D human pose representations from 2D image data, without the need for 3D skeleton annotations. The key idea is to exploit the inherent view-invariance and equivariance properties of human poses across different viewpoints and augmented frames during training.

The learned representations are shown to be highly effective for two downstream tasks: unsupervised cross-view action classification and unsupervised cross-view human movement quality assessment. The results significantly outperform previous state-of-the-art unsupervised methods and even approach the performance of supervised techniques.

This work highlights the potential of unsupervised methods for building robust 3D pose understanding from 2D data, which could have important implications for a wide range of applications, from action recognition to physical rehabilitation. Further research is needed to fully understand the limitations and real-world applicability of the proposed approach, but this paper represents an important step forward in this direction.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤷

Unsupervised View-Invariant Human Posture Representation

Faegheh Sardari, Bjorn Ommer, Majid Mirmehdi

Most recent view-invariant action recognition and performance assessment approaches rely on a large amount of annotated 3D skeleton data to extract view-invariant features. However, acquiring 3D skeleton data can be cumbersome, if not impractical, in in-the-wild scenarios. To overcome this problem, we present a novel unsupervised approach that learns to extract view-invariant 3D human pose representation from a 2D image without using 3D joint data. Our model is trained by exploiting the intrinsic view-invariant properties of human pose between simultaneous frames from different viewpoints and their equivariant properties between augmented frames from the same viewpoint. We evaluate the learned view-invariant pose representations for two downstream tasks. We perform comparative experiments that show improvements on the state-of-the-art unsupervised cross-view action classification accuracy on NTU RGB+D by a significant margin, on both RGB and depth images. We also show the efficiency of transferring the learned representations from NTU RGB+D to obtain the first ever unsupervised cross-view and cross-subject rank correlation results on the multi-view human movement quality dataset, QMAR, and marginally improve on the-state-of-the-art supervised results for this dataset. We also carry out ablation studies to examine the contributions of the different components of our proposed network.

7/9/2024

Markerless Multi-view 3D Human Pose Estimation: a survey

Ana Filipa Rodrigues Nogueira, H'elder P. Oliveira, Lu'is F. Teixeira

3D human pose estimation aims to reconstruct the human skeleton of all the individuals in a scene by detecting several body joints. The creation of accurate and efficient methods is required for several real-world applications including animation, human-robot interaction, surveillance systems or sports, among many others. However, several obstacles such as occlusions, random camera perspectives, or the scarcity of 3D labelled data, have been hampering the models' performance and limiting their deployment in real-world scenarios. The higher availability of cameras has led researchers to explore multi-view solutions due to the advantage of being able to exploit different perspectives to reconstruct the pose. Thus, the goal of this survey is to present an overview of the methodologies used to estimate the 3D pose in multi-view settings, understand what were the strategies found to address the various challenges and also, identify their limitations. Based on the reviewed articles, it was possible to find that no method is yet capable of solving all the challenges associated with the reconstruction of the 3D pose. Due to the existing trade-off between complexity and performance, the best method depends on the application scenario. Therefore, further research is still required to develop an approach capable of quickly inferring a highly accurate 3D pose with bearable computation cost. To this goal, techniques such as active learning, methods that learn with a low level of supervision, the incorporation of temporal consistency, view selection, estimation of depth information and multi-modal approaches might be interesting strategies to keep in mind when developing a new methodology to solve this task.

7/8/2024

Mask as Supervision: Leveraging Unified Mask Information for Unsupervised 3D Pose Estimation

Yuchen Yang, Yu Qiao, Xiao Sun

Automatic estimation of 3D human pose from monocular RGB images is a challenging and unsolved problem in computer vision. In a supervised manner, approaches heavily rely on laborious annotations and present hampered generalization ability due to the limited diversity of 3D pose datasets. To address these challenges, we propose a unified framework that leverages mask as supervision for unsupervised 3D pose estimation. With general unsupervised segmentation algorithms, the proposed model employs skeleton and physique representations that exploit accurate pose information from coarse to fine. Compared with previous unsupervised approaches, we organize the human skeleton in a fully unsupervised way which enables the processing of annotation-free data and provides ready-to-use estimation results. Comprehensive experiments demonstrate our state-of-the-art pose estimation performance on Human3.6M and MPI-INF-3DHP datasets. Further experiments on in-the-wild datasets also illustrate the capability to access more data to boost our model. Code will be available at https://github.com/Charrrrrlie/Mask-as-Supervision.

7/9/2024

📊

Multi-person 3D pose estimation from unlabelled data

Daniel Rodriguez-Criado, Pilar Bachiller, George Vogiatzis, Luis J. Manso

Its numerous applications make multi-human 3D pose estimation a remarkably impactful area of research. Nevertheless, assuming a multiple-view system composed of several regular RGB cameras, 3D multi-pose estimation presents several challenges. First of all, each person must be uniquely identified in the different views to separate the 2D information provided by the cameras. Secondly, the 3D pose estimation process from the multi-view 2D information of each person must be robust against noise and potential occlusions in the scenario. In this work, we address these two challenges with the help of deep learning. Specifically, we present a model based on Graph Neural Networks capable of predicting the cross-view correspondence of the people in the scenario along with a Multilayer Perceptron that takes the 2D points to yield the 3D poses of each person. These two models are trained in a self-supervised manner, thus avoiding the need for large datasets with 3D annotations.

4/10/2024