Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot

Read original: arXiv:2402.14654 - Published 7/25/2024 by Fabien Baradel, Matthieu Armando, Salma Galaaoui, Romain Br'egier, Philippe Weinzaepfel, Gr'egory Rogez, Thomas Lucas

📉

Overview

Multi-HMR is a single-shot model for recovering 3D human meshes from a single RGB image.
It can predict the whole-body pose, shape, and 3D location of multiple people in the image.
The model uses a Vision Transformer (ViT) backbone and a new cross-attention module called the Human Prediction Head (HPH) to make these predictions.
The model is trained on a new dataset called CUFFS, which contains close-up frames of full-body subjects with diverse hand poses.
Multi-HMR can optionally use camera intrinsics to improve performance.

Plain English Explanation

Multi-HMR is a powerful computer vision model that can take a single photograph and determine the 3D positions and poses of all the people in the image. This includes not just the overall body shape and pose, but also fine details like the hands and facial expressions.

The model works by first detecting where the people are in the image, using a Vision Transformer (a type of deep learning architecture). It then uses a new module called the Human Prediction Head (HPH) to analyze the features around each detected person and predict their full 3D body, including hands and face.

One challenge is that existing datasets don't have enough examples of people with diverse hand poses. To address this, the researchers created a new dataset called CUFFS, which contains close-up images of people's full bodies in different poses. Incorporating this dataset into the training process helps the model make more accurate predictions, especially for the hands.

The model can also optionally use information about the camera that took the photo, such as the lens properties, to further improve its 3D predictions. Overall, this single-shot model is able to quickly and accurately recover the full 3D body, hands, and face of multiple people in a single image.

Technical Explanation

Multi-HMR is a deep learning model for multi-person 3D human mesh recovery from a single RGB image. The model is designed to predict the whole-body pose, shape, and 3D location of people in the image, including their hands and facial expressions.

The model uses a Vision Transformer (ViT) as the backbone to extract visual features from the input image. It then employs a new module called the Human Prediction Head (HPH), which uses cross-attention to make predictions for each detected person. The HPH attends to the entire set of features for each person, allowing it to capture their full-body pose and shape.

To address the challenge of modeling fine-grained hand and facial poses, the researchers introduced a new dataset called CUFFS (Close-Up Frames of Full-Body Subjects). CUFFS contains images of people's full bodies, including their hands, captured at close range. Incorporating this dataset into the training process helps the model make more accurate predictions, especially for the hands.

The model can also optionally take into account the camera intrinsics, if available, by encoding the camera ray directions for each image token. This additional information can further improve the 3D predictions.

The key innovation in Multi-HMR is its ability to perform accurate, single-shot 3D human mesh recovery for multiple people in an image, including their hands and facial expressions, using a simple and efficient design.

Critical Analysis

The researchers acknowledge that directly predicting fine-grained hand and facial poses in a single shot is a challenging task, given the limited data available in existing datasets. The introduction of the CUFFS dataset is a valuable contribution to address this limitation, and the results show that it helps improve the model's performance, particularly for the hands.

However, the paper does not provide a detailed analysis of the model's performance on different body parts or discuss any potential biases or limitations in the dataset. It would be interesting to see how the model handles diverse body types, poses, and occlusions, and whether there are any systematic errors or failure cases.

Additionally, while the model can optionally use camera intrinsics, the paper does not explore the impact of this feature in depth or discuss the practical implications of requiring this information in real-world applications.

Overall, Multi-HMR presents a promising approach to whole-body 3D human mesh recovery, but further research and evaluation could provide deeper insights into the model's strengths, weaknesses, and potential use cases.

Conclusion

Multi-HMR is a powerful single-shot model for 3D human mesh recovery that can predict the whole-body pose, shape, and location of multiple people in a single image. By using a Vision Transformer backbone and a novel cross-attention module, as well as leveraging the new CUFFS dataset, the model is able to make accurate predictions, including for the hands and facial expressions.

This technology has exciting applications in areas like computer animation, virtual reality, and human-computer interaction, where accurately capturing the 3D structure and motion of the human body is crucial. As the field of 3D human pose estimation continues to advance, models like Multi-HMR will play an increasingly important role in enabling more natural and intuitive user experiences.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📉

Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot

Fabien Baradel, Matthieu Armando, Salma Galaaoui, Romain Br'egier, Philippe Weinzaepfel, Gr'egory Rogez, Thomas Lucas

We present Multi-HMR, a strong sigle-shot model for multi-person 3D human mesh recovery from a single RGB image. Predictions encompass the whole body, i.e., including hands and facial expressions, using the SMPL-X parametric model and 3D location in the camera coordinate system. Our model detects people by predicting coarse 2D heatmaps of person locations, using features produced by a standard Vision Transformer (ViT) backbone. It then predicts their whole-body pose, shape and 3D location using a new cross-attention module called the Human Prediction Head (HPH), with one query attending to the entire set of features for each detected person. As direct prediction of fine-grained hands and facial poses in a single shot, i.e., without relying on explicit crops around body parts, is hard to learn from existing data, we introduce CUFFS, the Close-Up Frames of Full-Body Subjects dataset, containing humans close to the camera with diverse hand poses. We show that incorporating it into the training data further enhances predictions, particularly for hands. Multi-HMR also optionally accounts for camera intrinsics, if available, by encoding camera ray directions for each image token. This simple design achieves strong performance on whole-body and body-only benchmarks simultaneously: a ViT-S backbone on $448{times}448$ images already yields a fast and competitive model, while larger models and higher resolutions obtain state-of-the-art results.

7/25/2024

Human Mesh Recovery from Arbitrary Multi-view Images

Xiaoben Li, Mancheng Meng, Ziyan Wu, Terrence Chen, Fan Yang, Dinggang Shen

Human mesh recovery from arbitrary multi-view images involves two characteristics: the arbitrary camera poses and arbitrary number of camera views. Because of the variability, designing a unified framework to tackle this task is challenging. The challenges can be summarized as the dilemma of being able to simultaneously estimate arbitrary camera poses and recover human mesh from arbitrary multi-view images while maintaining flexibility. To solve this dilemma, we propose a divide and conquer framework for Unified Human Mesh Recovery (U-HMR) from arbitrary multi-view images. In particular, U-HMR consists of a decoupled structure and two main components: camera and body decoupling (CBD), camera pose estimation (CPE), and arbitrary view fusion (AVF). As camera poses and human body mesh are independent of each other, CBD splits the estimation of them into two sub-tasks for two individual sub-networks (ie, CPE and AVF) to handle respectively, thus the two sub-tasks are disentangled. In CPE, since each camera pose is unrelated to the others, we adopt a shared MLP to process all views in a parallel way. In AVF, in order to fuse multi-view information and make the fusion operation independent of the number of views, we introduce a transformer decoder with a SMPL parameters query token to extract cross-view features for mesh recovery. To demonstrate the efficacy and flexibility of the proposed framework and effect of each component, we conduct extensive experiments on three public datasets: Human3.6M, MPI-INF-3DHP, and TotalCapture.

6/18/2024

MEGA: Masked Generative Autoencoder for Human Mesh Recovery

Gu'enol'e Fiche, Simon Leglaive, Xavier Alameda-Pineda, Francesc Moreno-Noguer

Human Mesh Recovery (HMR) from a single RGB image is a highly ambiguous problem, as similar 2D projections can correspond to multiple 3D interpretations. Nevertheless, most HMR methods overlook this ambiguity and make a single prediction without accounting for the associated uncertainty. A few approaches generate a distribution of human meshes, enabling the sampling of multiple predictions; however, none of them is competitive with the latest single-output model when making a single prediction. This work proposes a new approach based on masked generative modeling. By tokenizing the human pose and shape, we formulate the HMR task as generating a sequence of discrete tokens conditioned on an input image. We introduce MEGA, a MaskEd Generative Autoencoder trained to recover human meshes from images and partial human mesh token sequences. Given an image, our flexible generation scheme allows us to predict a single human mesh in deterministic mode or to generate multiple human meshes in stochastic mode. MEGA enables us to propose multiple outputs and to evaluate the uncertainty of the predictions. Experiments on in-the-wild benchmarks show that MEGA achieves state-of-the-art performance in deterministic and stochastic modes, outperforming single-output and multi-output approaches.

6/3/2024

New!PSHuman: Photorealistic Single-view Human Reconstruction using Cross-Scale Diffusion

Peng Li, Wangguandong Zheng, Yuan Liu, Tao Yu, Yangguang Li, Xingqun Qi, Mengfei Li, Xiaowei Chi, Siyu Xia, Wei Xue, Wenhan Luo, Qifeng Liu, Yike Guo

Detailed and photorealistic 3D human modeling is essential for various applications and has seen tremendous progress. However, full-body reconstruction from a monocular RGB image remains challenging due to the ill-posed nature of the problem and sophisticated clothing topology with self-occlusions. In this paper, we propose PSHuman, a novel framework that explicitly reconstructs human meshes utilizing priors from the multiview diffusion model. It is found that directly applying multiview diffusion on single-view human images leads to severe geometric distortions, especially on generated faces. To address it, we propose a cross-scale diffusion that models the joint probability distribution of global full-body shape and local facial characteristics, enabling detailed and identity-preserved novel-view generation without any geometric distortion. Moreover, to enhance cross-view body shape consistency of varied human poses, we condition the generative model on parametric models like SMPL-X, which provide body priors and prevent unnatural views inconsistent with human anatomy. Leveraging the generated multi-view normal and color images, we present SMPLX-initialized explicit human carving to recover realistic textured human meshes efficiently. Extensive experimental results and quantitative evaluations on CAPE and THuman2.1 datasets demonstrate PSHumans superiority in geometry details, texture fidelity, and generalization capability.

9/17/2024