SkelFormer: Markerless 3D Pose and Shape Estimation using Skeletal Transformers

Read original: arXiv:2404.12625 - Published 4/22/2024 by Vandad Davoodnia, Saeed Ghorbani, Alexandre Messier, Ali Etemad

SkelFormer: Markerless 3D Pose and Shape Estimation using Skeletal Transformers

Overview

This paper introduces SkelFormer, a novel markerless 3D pose and shape estimation method that uses skeletal transformers.
SkelFormer aims to accurately capture the 3D pose and shape of human subjects without the need for specialized motion capture equipment.
The method leverages the power of transformer models to efficiently process skeletal data and estimate the full 3D body pose and shape.

Plain English Explanation

SkelFormer is a new system that can estimate the 3D position and shape of a person's body just from video footage, without the need for any special sensors or markers. Traditional motion capture systems require the person to wear a suit with reflective markers, which can be cumbersome and expensive. SkelFormer instead uses advanced artificial intelligence (AI) techniques to analyze the video and infer the 3D pose and shape of the person's body.

At the core of SkelFormer is a type of AI model called a "transformer." Transformers are a powerful machine learning architecture that can efficiently process and understand complex sequential data, like the movements of a person's body over time. SkelFormer uses a specialized transformer model that is designed to work with skeletal data - the key joints and bones that make up the human body structure.

By feeding video footage into SkelFormer, the system can automatically detect the person's body and estimate the 3D position of all the major joints, as well as the overall shape and proportions of the body. This information can then be used for a variety of applications, such as link to KTPFormer in robotics, link to Improving Robustness in sports analytics, or link to Unified Masked Autoencoder in animation and visual effects.

Technical Explanation

The key innovation in SkelFormer is the use of a transformer-based architecture to process the skeletal data. Transformers have shown great success in natural language processing tasks, and the researchers hypothesized that their ability to model long-range dependencies would also be beneficial for 3D pose estimation.

SkelFormer takes in video frames as input and first extracts 2D keypoints representing the locations of major body joints. These 2D keypoints are then lifted into 3D space using a technique called unprojection. The resulting 3D skeletal data is then fed into a series of transformer encoder and decoder layers that progressively refine the 3D pose and shape estimation.

The transformer modules in SkelFormer are designed to effectively capture the kinematic structure and dynamics of the human body. This allows the model to better understand how the different body parts should move in relation to each other, leading to more accurate and robust 3D pose predictions.

SkelFormer is extensively evaluated on standard benchmarks for 3D human pose and shape estimation, such as link to Mask4Former and link to SelfPose3D. The results demonstrate that SkelFormer outperforms previous state-of-the-art methods, particularly in challenging scenarios with occlusions or complex motions.

Critical Analysis

One potential limitation of SkelFormer is that it relies on accurate 2D keypoint detection as a first step. While the 2D keypoint extraction is performed by a separate model, any errors or inaccuracies in this stage could propagate through to the final 3D pose estimation. The researchers acknowledge this and suggest future work to improve the integration of the 2D and 3D components.

Additionally, the computational complexity of the transformer-based architecture may limit the real-time performance of SkelFormer, especially on resource-constrained devices. The authors mention that they are exploring ways to optimize the model for faster inference without sacrificing accuracy.

Despite these minor caveats, SkelFormer represents a significant advance in the field of markerless 3D pose estimation. By leveraging the power of transformer models, the system is able to capture the complex kinematics and dynamics of the human body, leading to state-of-the-art performance on benchmark tasks.

Conclusion

In summary, the SkelFormer system introduced in this paper demonstrates a novel approach to markerless 3D pose and shape estimation using skeletal transformers. By combining advanced AI techniques with a deep understanding of human biomechanics, SkelFormer is able to accurately estimate the 3D pose and shape of a person's body from standard video footage, without the need for specialized motion capture equipment.

This breakthrough has the potential to enable a wide range of applications in areas such as link to KTPFormer robotics, link to Improving Robustness sports analytics, and link to Unified Masked Autoencoder animation and visual effects. As the research in this field continues to advance, we can expect to see even more innovative and practical applications of markerless 3D pose estimation technology.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

SkelFormer: Markerless 3D Pose and Shape Estimation using Skeletal Transformers

Vandad Davoodnia, Saeed Ghorbani, Alexandre Messier, Ali Etemad

We introduce SkelFormer, a novel markerless motion capture pipeline for multi-view human pose and shape estimation. Our method first uses off-the-shelf 2D keypoint estimators, pre-trained on large-scale in-the-wild data, to obtain 3D joint positions. Next, we design a regression-based inverse-kinematic skeletal transformer that maps the joint positions to pose and shape representations from heavily noisy observations. This module integrates prior knowledge about pose space and infers the full pose state at runtime. Separating the 3D keypoint detection and inverse-kinematic problems, along with the expressive representations learned by our skeletal transformer, enhance the generalization of our method to unseen noisy data. We evaluate our method on three public datasets in both in-distribution and out-of-distribution settings using three datasets, and observe strong performance with respect to prior works. Moreover, ablation experiments demonstrate the impact of each of the modules of our architecture. Finally, we study the performance of our method in dealing with noise and heavy occlusions and find considerable robustness with respect to other solutions.

4/22/2024

EgoPoseFormer: A Simple Baseline for Stereo Egocentric 3D Human Pose Estimation

Chenhongyi Yang, Anastasia Tkach, Shreyas Hampali, Linguang Zhang, Elliot J. Crowley, Cem Keskin

We present EgoPoseFormer, a simple yet effective transformer-based model for stereo egocentric human pose estimation. The main challenge in egocentric pose estimation is overcoming joint invisibility, which is caused by self-occlusion or a limited field of view (FOV) of head-mounted cameras. Our approach overcomes this challenge by incorporating a two-stage pose estimation paradigm: in the first stage, our model leverages the global information to estimate each joint's coarse location, then in the second stage, it employs a DETR style transformer to refine the coarse locations by exploiting fine-grained stereo visual features. In addition, we present a Deformable Stereo Attention operation to enable our transformer to effectively process multi-view features, which enables it to accurately localize each joint in the 3D world. We evaluate our method on the stereo UnrealEgo dataset and show it significantly outperforms previous approaches while being computationally efficient: it improves MPJPE by 27.4mm (45% improvement) with only 7.9% model parameters and 13.1% FLOPs compared to the state-of-the-art. Surprisingly, with proper training settings, we find that even our first-stage pose proposal network can achieve superior performance compared to previous arts. We also show that our method can be seamlessly extended to monocular settings, which achieves state-of-the-art performance on the SceneEgo dataset, improving MPJPE by 25.5mm (21% improvement) compared to the best existing method with only 60.7% model parameters and 36.4% FLOPs. Code is available at: https://github.com/ChenhongyiYang/egoposeformer .

8/16/2024

🤷

Unsupervised View-Invariant Human Posture Representation

Faegheh Sardari, Bjorn Ommer, Majid Mirmehdi

Most recent view-invariant action recognition and performance assessment approaches rely on a large amount of annotated 3D skeleton data to extract view-invariant features. However, acquiring 3D skeleton data can be cumbersome, if not impractical, in in-the-wild scenarios. To overcome this problem, we present a novel unsupervised approach that learns to extract view-invariant 3D human pose representation from a 2D image without using 3D joint data. Our model is trained by exploiting the intrinsic view-invariant properties of human pose between simultaneous frames from different viewpoints and their equivariant properties between augmented frames from the same viewpoint. We evaluate the learned view-invariant pose representations for two downstream tasks. We perform comparative experiments that show improvements on the state-of-the-art unsupervised cross-view action classification accuracy on NTU RGB+D by a significant margin, on both RGB and depth images. We also show the efficiency of transferring the learned representations from NTU RGB+D to obtain the first ever unsupervised cross-view and cross-subject rank correlation results on the multi-view human movement quality dataset, QMAR, and marginally improve on the-state-of-the-art supervised results for this dataset. We also carry out ablation studies to examine the contributions of the different components of our proposed network.

7/9/2024

Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation

Mengmeng Cui, Kunbo Zhang, Zhenan Sun

In recent years, 2D-to-3D pose uplifting in monocular 3D Human Pose Estimation (HPE) has attracted widespread research interest. GNN-based methods and Transformer-based methods have become mainstream architectures due to their advanced spatial and temporal feature learning capacities. However, existing approaches typically construct joint-wise and frame-wise attention alignments in spatial and temporal domains, resulting in dense connections that introduce considerable local redundancy and computational overhead. In this paper, we take a global approach to exploit spatio-temporal information and realise efficient 3D HPE with a concise Graph and Skipped Transformer architecture. Specifically, in Spatial Encoding stage, coarse-grained body parts are deployed to construct Spatial Graph Network with a fully data-driven adaptive topology, ensuring model flexibility and generalizability across various poses. In Temporal Encoding and Decoding stages, a simple yet effective Skipped Transformer is proposed to capture long-range temporal dependencies and implement hierarchical feature aggregation. A straightforward Data Rolling strategy is also developed to introduce dynamic information into 2D pose sequence. Extensive experiments are conducted on Human3.6M, MPI-INF-3DHP and Human-Eva benchmarks. G-SFormer series methods achieve superior performances compared with previous state-of-the-arts with only around ten percent of parameters and significantly reduced computational complexity. Additionally, G-SFormer also exhibits outstanding robustness to inaccuracies in detected 2D poses.

7/4/2024