MPM: A Unified 2D-3D Human Pose Representation via Masked Pose Modeling

Read original: arXiv:2306.17201 - Published 7/16/2024 by Zhenyu Zhang, Wenhao Chai, Zhongyu Jiang, Tian Ye, Mingli Song, Jenq-Neng Hwang, Gaoang Wang

MPM: A Unified 2D-3D Human Pose Representation via Masked Pose Modeling

Overview

This paper proposes a novel unified 2D-3D human pose representation called Masked Pose Modeling (MPM) that leverages masked information to improve both 2D and 3D pose estimation.
MPM uses a single encoder-decoder architecture to jointly learn 2D and 3D pose representations from a partially occluded input.
The authors demonstrate that the unified 2D-3D representation learned by MPM outperforms separate 2D and 3D pose estimation models on various benchmarks.

Plain English Explanation

The paper introduces a new way to represent human poses in both 2D (flat) and 3D (depth) using a single neural network model. This model, called Masked Pose Modeling (MPM), learns to estimate the full human pose by looking at only a partial, or "masked," view of the person.

The key insight is that forcing the model to "fill in the blanks" and recover the missing parts of the pose helps it learn a more robust and unified understanding of human pose in 2D and 3D. This is beneficial because 2D and 3D pose estimation are often treated as separate problems, but in reality, they are closely related.

By jointly learning 2D and 3D pose representations, MPM is able to outperform specialized models that only focus on one or the other. The authors show that this unified approach leads to better performance on standard benchmarks for both 2D and 3D human pose estimation.

The Masked Pose Modeling technique is similar in spirit to the Unified Masked Autoencoder and Semi-Supervised 2D Human Pose Estimation approaches, which also leverage partial information to learn more generalizable pose representations. The Multimodal Sense-Informed Prediction and Graph-Skipped Transformer models also explore ways to jointly model 2D and 3D human pose.

Technical Explanation

The key technical contribution of this paper is the Masked Pose Modeling (MPM) architecture, which consists of an encoder-decoder network that learns a unified 2D-3D human pose representation.

The encoder takes in a partially occluded (masked) 2D image of a person as input and produces a latent representation. This latent representation is then passed to the decoder, which is tasked with predicting the full 2D and 3D pose of the person, including the missing parts.

By training the model to recover the complete pose from partial inputs, the authors hypothesize that the encoder will learn a more robust and generalizable pose representation that captures the inherent 2D-3D structure of human bodies.

The authors evaluate MPM on standard 2D and 3D human pose estimation benchmarks, including Human3.6M and MPII. They show that MPM outperforms specialized 2D and 3D pose estimation models, demonstrating the benefits of the unified 2D-3D representation.

Furthermore, the authors analyze the learned representations and find that the encoder is indeed able to capture the correspondence between 2D and 3D pose, even without being explicitly trained on 3D data.

Critical Analysis

The paper makes a compelling case for the benefits of a unified 2D-3D pose representation, and the MPM model demonstrates strong performance on benchmark tasks. However, there are a few potential limitations and areas for further research:

Dependence on Partial Inputs: While the masked input approach is a key innovation, it may limit the model's ability to perform well on fully visible inputs. The authors could explore ways to seamlessly integrate both partial and full inputs during training and inference.
Generalization to Diverse Datasets: The experiments are primarily conducted on well-curated datasets like Human3.6M and MPII. It would be valuable to evaluate the model's performance on more diverse and challenging real-world datasets to assess its practical applicability.
Interpretability of Learned Representations: The paper provides some analysis of the learned representations, but a more in-depth exploration of what the model has learned about the 2D-3D pose relationship could yield additional insights.
Computational Efficiency: The use of a large encoder-decoder architecture may limit the model's deployment on resource-constrained devices. Potential avenues for improving efficiency, such as model compression or knowledge distillation, could be investigated.

Overall, the Masked Pose Modeling approach is a promising step towards unified 2D-3D human pose estimation, and the authors have demonstrated its effectiveness on standard benchmarks. Further research to address the limitations mentioned above could lead to even more practical and impactful applications of this technology.

Conclusion

The MPM paper presents a novel unified 2D-3D human pose representation that leverages masked inputs to learn a more robust and generalizable pose model. By jointly learning 2D and 3D pose estimation in a single architecture, the authors show that MPM outperforms specialized models on various benchmarks.

This work contributes to the broader effort to develop more versatile and data-efficient human pose estimation systems, which have numerous applications in areas like human-computer interaction, robotics, and animation. The masked input approach and the insights gained about the correspondence between 2D and 3D pose representations could inspire further advancements in this important field of computer vision and machine learning.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

MPM: A Unified 2D-3D Human Pose Representation via Masked Pose Modeling

Zhenyu Zhang, Wenhao Chai, Zhongyu Jiang, Tian Ye, Mingli Song, Jenq-Neng Hwang, Gaoang Wang

Estimating 3D human poses only from a 2D human pose sequence is thoroughly explored in recent years. Yet, prior to this, no such work has attempted to unify 2D and 3D pose representations in the shared feature space. In this paper, we propose mpm, a unified 2D-3D human pose representation framework via masked pose modeling. We treat 2D and 3D poses as two different modalities like vision and language and build a single-stream transformer-based architecture. We apply two pretext tasks, which are masked 2D pose modeling, and masked 3D pose modeling to pre-train our network and use full-supervision to perform further fine-tuning. A high masking ratio of $71.8~%$ in total with a spatio-temporal mask sampling strategy leads to better relation modeling both in spatial and temporal domains. mpm~can handle multiple tasks including 3D human pose estimation, 3D pose estimation from occluded 2D pose, and 3D pose completion in a textbf{single} framework. We conduct extensive experiments and ablation studies on several widely used human pose datasets and achieve state-of-the-art performance on MPI-INF-3DHP.

7/16/2024

MPL: Lifting 3D Human Pose from Multi-view 2D Poses

Seyed Abolfazl Ghasemzadeh, Alexandre Alahi, Christophe De Vleeschouwer

Estimating 3D human poses from 2D images is challenging due to occlusions and projective acquisition. Learning-based approaches have been largely studied to address this challenge, both in single and multi-view setups. These solutions however fail to generalize to real-world cases due to the lack of (multi-view) 'in-the-wild' images paired with 3D poses for training. For this reason, we propose combining 2D pose estimation, for which large and rich training datasets exist, and 2D-to-3D pose lifting, using a transformer-based network that can be trained from synthetic 2D-3D pose pairs. Our experiments demonstrate decreases up to 45% in MPJPE errors compared to the 3D pose obtained by triangulating the 2D poses. The framework's source code is available at https://github.com/aghasemzadeh/OpenMPL .

8/21/2024

Mask as Supervision: Leveraging Unified Mask Information for Unsupervised 3D Pose Estimation

Yuchen Yang, Yu Qiao, Xiao Sun

Automatic estimation of 3D human pose from monocular RGB images is a challenging and unsolved problem in computer vision. In a supervised manner, approaches heavily rely on laborious annotations and present hampered generalization ability due to the limited diversity of 3D pose datasets. To address these challenges, we propose a unified framework that leverages mask as supervision for unsupervised 3D pose estimation. With general unsupervised segmentation algorithms, the proposed model employs skeleton and physique representations that exploit accurate pose information from coarse to fine. Compared with previous unsupervised approaches, we organize the human skeleton in a fully unsupervised way which enables the processing of annotation-free data and provides ready-to-use estimation results. Comprehensive experiments demonstrate our state-of-the-art pose estimation performance on Human3.6M and MPI-INF-3DHP datasets. Further experiments on in-the-wild datasets also illustrate the capability to access more data to boost our model. Code will be available at https://github.com/Charrrrrlie/Mask-as-Supervision.

7/9/2024

PoseMamba: Monocular 3D Human Pose Estimation with Bidirectional Global-Local Spatio-Temporal State Space Model

Yunlong Huang, Junshuo Liu, Ke Xian, Robert Caiming Qiu

Transformers have significantly advanced the field of 3D human pose estimation (HPE). However, existing transformer-based methods primarily use self-attention mechanisms for spatio-temporal modeling, leading to a quadratic complexity, unidirectional modeling of spatio-temporal relationships, and insufficient learning of spatial-temporal correlations. Recently, the Mamba architecture, utilizing the state space model (SSM), has exhibited superior long-range modeling capabilities in a variety of vision tasks with linear complexity. In this paper, we propose PoseMamba, a novel purely SSM-based approach with linear complexity for 3D human pose estimation in monocular video. Specifically, we propose a bidirectional global-local spatio-temporal SSM block that comprehensively models human joint relations within individual frames as well as temporal correlations across frames. Within this bidirectional global-local spatio-temporal SSM block, we introduce a reordering strategy to enhance the local modeling capability of the SSM. This strategy provides a more logical geometric scanning order and integrates it with the global SSM, resulting in a combined global-local spatial scan. We have quantitatively and qualitatively evaluated our approach using two benchmark datasets: Human3.6M and MPI-INF-3DHP. Extensive experiments demonstrate that PoseMamba achieves state-of-the-art performance on both datasets while maintaining a smaller model size and reducing computational costs. The code and models will be released.

8/9/2024