SMPLer: Taming Transformers for Monocular 3D Human Shape and Pose Estimation

2404.15276

Published 4/24/2024 by Xiangyu Xu, Lijuan Liu, Shuicheng Yan

SMPLer: Taming Transformers for Monocular 3D Human Shape and Pose Estimation

Abstract

Existing Transformers for monocular 3D human shape and pose estimation typically have a quadratic computation and memory complexity with respect to the feature length, which hinders the exploitation of fine-grained information in high-resolution features that is beneficial for accurate reconstruction. In this work, we propose an SMPL-based Transformer framework (SMPLer) to address this issue. SMPLer incorporates two key ingredients: a decoupled attention operation and an SMPL-based target representation, which allow effective utilization of high-resolution features in the Transformer. In addition, based on these two designs, we also introduce several novel modules including a multi-scale attention and a joint-aware attention to further boost the reconstruction performance. Extensive experiments demonstrate the effectiveness of SMPLer against existing 3D human shape and pose estimation methods both quantitatively and qualitatively. Notably, the proposed algorithm achieves an MPJPE of 45.2 mm on the Human3.6M dataset, improving upon Mesh Graphormer by more than 10% with fewer than one-third of the parameters. Code and pretrained models are available at https://github.com/xuxy09/SMPLer.

Create account to get full access

Overview

This paper presents SMPLer, a new method for monocular 3D human shape and pose estimation that uses transformers.
The key innovations include a joint-aware attention mechanism and a multi-scale transformer architecture.
The method outperforms previous state-of-the-art approaches on standard benchmarks for 3D human pose and shape estimation.

Plain English Explanation

The paper introduces a new system called SMPLer that can take a single 2D image of a person and estimate their 3D body shape and pose. This is an important task in computer vision with applications in areas like animation, AR/VR, and human-computer interaction.

Previous methods for this problem have used convolutional neural networks, but the authors argue that transformers - a type of deep learning model that focuses on learning relationships between different parts of the input - can be more effective. SMPLer uses a novel transformer architecture that pays special attention to the relationships between different body parts, which helps it better model the complex 3D structure of the human form.

The authors also use a multi-scale approach, where the transformer processes the image at multiple levels of detail, allowing it to capture both high-level and fine-grained information. This combination of joint-aware attention and multi-scale processing enables SMPLer to outperform previous state-of-the-art methods on standard 3D human pose and shape estimation benchmarks.

Overall, this work demonstrates how transformers can be a powerful tool for tackling 3D human modeling tasks, and the authors' innovations around joint-aware attention and multi-scale processing provide a blueprint for how to effectively apply these models to this domain.

Technical Explanation

The core of the SMPLer model is a transformer-based architecture that takes a monocular RGB image as input and outputs the parameters of a SMPL [<a href="https://aimodels.fyi/papers/arxiv/skelformer-markerless-3d-pose-shape-estimation-using">1</a>] body model, representing the 3D shape and pose of the person in the image.

The key innovations in the SMPLer architecture are:

Joint-Aware Attention: The transformer uses a specialized attention mechanism that explicitly models the relationships between different body joints, rather than treating the image as a generic 2D grid. This joint-aware attention helps the model better capture the complex 3D structure of the human body.
Multi-Scale Processing: SMPLer processes the input image at multiple scales using a cascade of transformer blocks. This allows the model to capture both high-level and fine-grained visual features, further enhancing its ability to estimate accurate 3D shape and pose.

The authors demonstrate the effectiveness of SMPLer through extensive experiments on standard benchmarks for 3D human pose and shape estimation, such as [<a href="https://aimodels.fyi/papers/arxiv/mansformer-efficient-transformer-mixed-attention-image-deblurring">2</a>] and [<a href="https://aimodels.fyi/papers/arxiv/ktpformer-kinematics-trajectory-prior-knowledge-enhanced-transformer">3</a>]. SMPLer outperforms previous state-of-the-art methods, showing the advantages of the transformer-based approach and the authors' novel architectural choices.

Critical Analysis

The authors acknowledge several limitations of their work. First, SMPLer is currently only designed for single-person images and may not generalize well to more complex scenes with multiple people. Additionally, the model's performance may degrade in the presence of severe occlusions or challenging viewing angles.

Another potential concern is the computational cost of the multi-scale transformer architecture, which could limit its deployment on resource-constrained devices. The authors mention that they have taken steps to optimize the model's efficiency, but further work may be needed to make it more practical for real-world applications.

It would also be interesting to see how SMPLer compares to other recent transformer-based approaches for 3D human modeling, such as [<a href="https://aimodels.fyi/papers/arxiv/sgformer-spherical-geometry-transformer-360-depth-estimation">4</a>] and [<a href="https://aimodels.fyi/papers/arxiv/robust-human-motion-forecasting-using-transformer-based">5</a>]. Exploring the relative strengths and weaknesses of different transformer architectures and attention mechanisms could lead to further advancements in this field.

Conclusion

The SMPLer model presented in this paper demonstrates the power of transformers for monocular 3D human shape and pose estimation. By incorporating joint-aware attention and multi-scale processing, the authors have created a state-of-the-art system that outperforms previous approaches.

While the current version of SMPLer has some limitations, the authors' innovations around transformer-based 3D human modeling open up exciting new directions for future research. As transformer models continue to evolve and become more efficient, we can expect to see even more advanced and practical solutions for this important computer vision task, with potential applications in a wide range of industries and domains.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

SkelFormer: Markerless 3D Pose and Shape Estimation using Skeletal Transformers

Vandad Davoodnia, Saeed Ghorbani, Alexandre Messier, Ali Etemad

We introduce SkelFormer, a novel markerless motion capture pipeline for multi-view human pose and shape estimation. Our method first uses off-the-shelf 2D keypoint estimators, pre-trained on large-scale in-the-wild data, to obtain 3D joint positions. Next, we design a regression-based inverse-kinematic skeletal transformer that maps the joint positions to pose and shape representations from heavily noisy observations. This module integrates prior knowledge about pose space and infers the full pose state at runtime. Separating the 3D keypoint detection and inverse-kinematic problems, along with the expressive representations learned by our skeletal transformer, enhance the generalization of our method to unseen noisy data. We evaluate our method on three public datasets in both in-distribution and out-of-distribution settings using three datasets, and observe strong performance with respect to prior works. Moreover, ablation experiments demonstrate the impact of each of the modules of our architecture. Finally, we study the performance of our method in dealing with noise and heavy occlusions and find considerable robustness with respect to other solutions.

4/22/2024

cs.CV

SMPLX-Lite: A Realistic and Drivable Avatar Benchmark with Rich Geometry and Texture Annotations

Yujiao Jiang, Qingmin Liao, Zhaolong Wang, Xiangru Lin, Zongqing Lu, Yuxi Zhao, Hanqing Wei, Jingrui Ye, Yu Zhang, Zhijing Shao

Recovering photorealistic and drivable full-body avatars is crucial for numerous applications, including virtual reality, 3D games, and tele-presence. Most methods, whether reconstruction or generation, require large numbers of human motion sequences and corresponding textured meshes. To easily learn a drivable avatar, a reasonable parametric body model with unified topology is paramount. However, existing human body datasets either have images or textured models and lack parametric models which fit clothes well. We propose a new parametric model SMPLX-Lite-D, which can fit detailed geometry of the scanned mesh while maintaining stable geometry in the face, hand and foot regions. We present SMPLX-Lite dataset, the most comprehensive clothing avatar dataset with multi-view RGB sequences, keypoints annotations, textured scanned meshes, and textured SMPLX-Lite-D models. With the SMPLX-Lite dataset, we train a conditional variational autoencoder model that takes human pose and facial keypoints as input, and generates a photorealistic drivable human avatar.

5/31/2024

cs.CV cs.GR

Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance

Shenhao Zhu, Junming Leo Chen, Zuozhuo Dai, Qingkun Su, Yinghui Xu, Xun Cao, Yao Yao, Hao Zhu, Siyu Zhu

In this study, we introduce a methodology for human image animation by leveraging a 3D human parametric model within a latent diffusion framework to enhance shape alignment and motion guidance in curernt human generative techniques. The methodology utilizes the SMPL(Skinned Multi-Person Linear) model as the 3D human parametric model to establish a unified representation of body shape and pose. This facilitates the accurate capture of intricate human geometry and motion characteristics from source videos. Specifically, we incorporate rendered depth images, normal maps, and semantic maps obtained from SMPL sequences, alongside skeleton-based motion guidance, to enrich the conditions to the latent diffusion model with comprehensive 3D shape and detailed pose attributes. A multi-layer motion fusion module, integrating self-attention mechanisms, is employed to fuse the shape and motion latent representations in the spatial domain. By representing the 3D human parametric model as the motion guidance, we can perform parametric shape alignment of the human body between the reference image and the source video motion. Experimental evaluations conducted on benchmark datasets demonstrate the methodology's superior ability to generate high-quality human animations that accurately capture both pose and shape variations. Furthermore, our approach also exhibits superior generalization capabilities on the proposed in-the-wild dataset. Project page: https://fudan-generative-vision.github.io/champ.

6/4/2024

cs.CV

🤖

Multiple Prior Representation Learning for Self-Supervised Monocular Depth Estimation via Hybrid Transformer

Guodong Sun, Junjie Liu, Mingxuan Liu, Moyun Liu, Yang Zhang

Self-supervised monocular depth estimation aims to infer depth information without relying on labeled data. However, the lack of labeled information poses a significant challenge to the model's representation, limiting its ability to capture the intricate details of the scene accurately. Prior information can potentially mitigate this issue, enhancing the model's understanding of scene structure and texture. Nevertheless, solely relying on a single type of prior information often falls short when dealing with complex scenes, necessitating improvements in generalization performance. To address these challenges, we introduce a novel self-supervised monocular depth estimation model that leverages multiple priors to bolster representation capabilities across spatial, context, and semantic dimensions. Specifically, we employ a hybrid transformer and a lightweight pose network to obtain long-range spatial priors in the spatial dimension. Then, the context prior attention is designed to improve generalization, particularly in complex structures or untextured areas. In addition, semantic priors are introduced by leveraging semantic boundary loss, and semantic prior attention is supplemented, further refining the semantic features extracted by the decoder. Experiments on three diverse datasets demonstrate the effectiveness of the proposed model. It integrates multiple priors to comprehensively enhance the representation ability, improving the accuracy and reliability of depth estimation. Codes are available at: url{https://github.com/MVME-HBUT/MPRLNet}

6/14/2024

cs.CV eess.IV