MeshPose: Unifying DensePose and 3D Body Mesh reconstruction

Read original: arXiv:2406.10180 - Published 6/17/2024 by Eric-Tuan L^e, Antonis Kakolyris, Petros Koutras, Himmy Tam, Efstratios Skordos, George Papandreou, R{i}za Alp Guler, Iasonas Kokkinos
Total Score

0

MeshPose: Unifying DensePose and 3D Body Mesh reconstruction

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • Presents a novel method called MeshPose that unifies DensePose and 3D body mesh reconstruction
  • Aims to recover accurate 3D human body shape and pose from a single image
  • Leverages both 2D surface correspondences and 3D shape constraints to achieve robust performance

Plain English Explanation

MeshPose is a new technique that combines two important tasks in computer vision: DensePose and 3D body mesh reconstruction. DensePose involves mapping 2D image pixels to corresponding points on a 3D human body model, while 3D body mesh reconstruction generates a complete 3D mesh representation of the human body.

By unifying these two capabilities, MeshPose can recover an accurate 3D model of a person's body shape and pose from a single 2D image. This is achieved by using both the 2D surface correspondences identified by DensePose as well as constraints on the 3D shape of the body. The researchers show that this combined approach leads to more robust and reliable 3D human body reconstruction compared to previous methods that tackled these problems independently.

The ability to accurately reconstruct 3D human body shape and pose from single images has many potential applications, such as human-computer interaction, motion capture, and virtual try-on. MeshPose represents an important step forward in unifying these key computer vision tasks to enable more realistic and practical 3D human modeling from simple 2D images.

Technical Explanation

The core of the MeshPose approach is to leverage both 2D surface correspondences from DensePose and 3D shape constraints to jointly optimize for an accurate 3D body mesh reconstruction. The method takes a single 2D image as input and produces a 3D mesh representation of the person's body, including their pose and shape.

Specifically, MeshPose first uses a convolutional neural network to predict a DensePose map, which establishes correspondences between 2D image pixels and a template 3D body model. It then uses another network to regress the 3D mesh vertices and joint locations directly from the image. By combining these 2D and 3D cues, MeshPose can resolve ambiguities and achieve more robust 3D reconstruction compared to prior methods that relied on one type of information alone, such as TokenHMR or HiPose.

The researchers evaluate MeshPose on standard benchmarks for 3D human pose and shape estimation, demonstrating state-of-the-art performance. They also show that the unified architecture leads to improvements over sequential pipelines that first predict 2D correspondences and then infer 3D, as in Synergistic.

Critical Analysis

The MeshPose paper presents a well-designed and effective approach for jointly recovering 3D human body shape and pose from 2D images. By combining 2D surface correspondences and 3D shape cues, the method is able to achieve more robust and accurate 3D reconstruction compared to previous techniques.

However, the paper does not extensively discuss potential limitations or failure cases of the MeshPose approach. For example, it is not clear how the method would perform on challenging scenarios like occluded bodies, unusual poses, or diverse clothing and accessories. Additionally, the computational efficiency of the unified architecture is not compared to more sequential pipelines.

Further research could explore ways to make MeshPose more generalizable and efficient, perhaps by incorporating probabilistic approaches or leveraging additional cues beyond just 2D and 3D. Nonetheless, the core idea of unifying 2D and 3D modeling for human body reconstruction is a significant contribution that could have wide-ranging applications in computer vision and related fields.

Conclusion

The MeshPose paper presents a novel method that unifies the tasks of DensePose and 3D body mesh reconstruction, enabling accurate 3D human shape and pose recovery from single 2D images. By combining 2D surface correspondences and 3D shape constraints, the approach achieves state-of-the-art performance on standard benchmarks.

This work represents an important step forward in enabling robust and practical 3D human modeling from simple 2D inputs, with potential applications in areas like human-computer interaction, motion capture, and virtual try-on. While the paper does not extensively discuss limitations, the core idea of jointly leveraging 2D and 3D cues is a significant contribution that could inspire further research and development in this active area of computer vision.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on š• ā†’

Related Papers

MeshPose: Unifying DensePose and 3D Body Mesh reconstruction
Total Score

0

MeshPose: Unifying DensePose and 3D Body Mesh reconstruction

Eric-Tuan L^e, Antonis Kakolyris, Petros Koutras, Himmy Tam, Efstratios Skordos, George Papandreou, R{i}za Alp Guler, Iasonas Kokkinos

DensePose provides a pixel-accurate association of images with 3D mesh coordinates, but does not provide a 3D mesh, while Human Mesh Reconstruction (HMR) systems have high 2D reprojection error, as measured by DensePose localization metrics. In this work we introduce MeshPose to jointly tackle DensePose and HMR. For this we first introduce new losses that allow us to use weak DensePose supervision to accurately localize in 2D a subset of the mesh vertices ('VertexPose'). We then lift these vertices to 3D, yielding a low-poly body mesh ('MeshPose'). Our system is trained in an end-to-end manner and is the first HMR method to attain competitive DensePose accuracy, while also being lightweight and amenable to efficient inference, making it suitable for real-time AR applications.

Read more

6/17/2024

Deep learning for 3D human pose estimation and mesh recovery: A survey
Total Score

0

Deep learning for 3D human pose estimation and mesh recovery: A survey

Yang Liu, Changzhen Qiu, Zhiyong Zhang

3D human pose estimation and mesh recovery have attracted widespread research interest in many areas, such as computer vision, autonomous driving, and robotics. Deep learning on 3D human pose estimation and mesh recovery has recently thrived, with numerous methods proposed to address different problems in this area. In this paper, to stimulate future research, we present a comprehensive review of recent progress over the past five years in deep learning methods for this area by delving into over 200 references. To the best of our knowledge, this survey is arguably the first to comprehensively cover deep learning methods for 3D human pose estimation, including both single-person and multi-person approaches, as well as human mesh recovery, encompassing methods based on explicit models and implicit representations. We also present comparative results on several publicly available datasets, together with insightful observations and inspiring future research directions. A regularly updated project page can be found at https://github.com/liuyangme/SOTA-3DHPE-HMR.

Read more

7/4/2024

šŸ“‰

Total Score

0

Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot

Fabien Baradel, Matthieu Armando, Salma Galaaoui, Romain Br'egier, Philippe Weinzaepfel, Gr'egory Rogez, Thomas Lucas

We present Multi-HMR, a strong sigle-shot model for multi-person 3D human mesh recovery from a single RGB image. Predictions encompass the whole body, i.e., including hands and facial expressions, using the SMPL-X parametric model and 3D location in the camera coordinate system. Our model detects people by predicting coarse 2D heatmaps of person locations, using features produced by a standard Vision Transformer (ViT) backbone. It then predicts their whole-body pose, shape and 3D location using a new cross-attention module called the Human Prediction Head (HPH), with one query attending to the entire set of features for each detected person. As direct prediction of fine-grained hands and facial poses in a single shot, i.e., without relying on explicit crops around body parts, is hard to learn from existing data, we introduce CUFFS, the Close-Up Frames of Full-Body Subjects dataset, containing humans close to the camera with diverse hand poses. We show that incorporating it into the training data further enhances predictions, particularly for hands. Multi-HMR also optionally accounts for camera intrinsics, if available, by encoding camera ray directions for each image token. This simple design achieves strong performance on whole-body and body-only benchmarks simultaneously: a ViT-S backbone on $448{times}448$ images already yields a fast and competitive model, while larger models and higher resolutions obtain state-of-the-art results.

Read more

7/25/2024

Human Mesh Recovery from Arbitrary Multi-view Images
Total Score

0

Human Mesh Recovery from Arbitrary Multi-view Images

Xiaoben Li, Mancheng Meng, Ziyan Wu, Terrence Chen, Fan Yang, Dinggang Shen

Human mesh recovery from arbitrary multi-view images involves two characteristics: the arbitrary camera poses and arbitrary number of camera views. Because of the variability, designing a unified framework to tackle this task is challenging. The challenges can be summarized as the dilemma of being able to simultaneously estimate arbitrary camera poses and recover human mesh from arbitrary multi-view images while maintaining flexibility. To solve this dilemma, we propose a divide and conquer framework for Unified Human Mesh Recovery (U-HMR) from arbitrary multi-view images. In particular, U-HMR consists of a decoupled structure and two main components: camera and body decoupling (CBD), camera pose estimation (CPE), and arbitrary view fusion (AVF). As camera poses and human body mesh are independent of each other, CBD splits the estimation of them into two sub-tasks for two individual sub-networks (ie, CPE and AVF) to handle respectively, thus the two sub-tasks are disentangled. In CPE, since each camera pose is unrelated to the others, we adopt a shared MLP to process all views in a parallel way. In AVF, in order to fuse multi-view information and make the fusion operation independent of the number of views, we introduce a transformer decoder with a SMPL parameters query token to extract cross-view features for mesh recovery. To demonstrate the efficacy and flexibility of the proposed framework and effect of each component, we conduct extensive experiments on three public datasets: Human3.6M, MPI-INF-3DHP, and TotalCapture.

Read more

6/18/2024