MEGA: Masked Generative Autoencoder for Human Mesh Recovery

Read original: arXiv:2405.18839 - Published 6/3/2024 by Gu'enol'e Fiche, Simon Leglaive, Xavier Alameda-Pineda, Francesc Moreno-Noguer

MEGA: Masked Generative Autoencoder for Human Mesh Recovery

Overview

This paper proposes a new method called MEGA (Masked Generative Autoencoder) for recovering 3D human mesh models from arbitrary multi-view images.
MEGA uses a masked autoencoder architecture to take in a partially occluded or cropped human image and output a complete 3D mesh representation of the full human body.
The method leverages prior knowledge about human body structure and appearance to fill in missing information and produce high-quality mesh reconstructions.

Plain English Explanation

MEGA: Masked Generative Autoencoder for Human Mesh Recovery is a new AI system that can create 3D models of human bodies from regular 2D photos or videos. Even if parts of the person are hidden or the image is cropped, the system can still generate a complete 3D mesh representation of the full body.

The key idea is to use a special type of neural network called a "masked autoencoder." This network is trained on a large dataset of 3D human body scans and 2D images. It learns to recognize patterns in human appearance and structure, so that when it sees a new partial image, it can intelligently fill in the missing information to reconstruct the full 3D shape.

This is a significant advance over previous methods that required multiple camera views or could only handle unoccluded frontal poses. MEGA can work with arbitrary camera angles and partial occlusions, making it much more practical for real-world applications like animation, virtual try-on, and healthcare.

The MEGA system is also efficient, fast, and generalizes well to new data, thanks to its innovative neural network architecture and training approach. Overall, this research represents an important step forward in the field of 3D human body modeling from 2D imagery.

Technical Explanation

MEGA uses a masked autoencoder architecture, where the input image is partially masked, and the network must learn to predict the missing parts. This forces the model to capture the underlying structure and appearance of the human body, rather than just memorizing specific image-to-mesh mappings.

The encoder network first processes the input image into a compact latent representation. A masking module then selectively drops out parts of this representation, simulating occlusions. The decoder then takes the masked latent code and predicts the full 3D mesh of the human body, as well as other outputs like body parameters and camera viewpoint.

Key innovations include the use of a hybrid mesh representation that combines a template mesh with per-vertex displacement offsets, a Gaussian head model for realistic facial reconstructions, and a pipeline that can handle both static images and video sequences.

The MEGA model is trained on a large dataset of 3D scans and corresponding 2D images, using a combination of reconstruction, adversarial, and other losses to ensure high-fidelity mesh predictions.

Critical Analysis

The authors acknowledge several limitations of their approach. First, the method currently assumes a single person in the input image and does not handle occlusions between people. Extensions to multi-person scenarios would be an important area for future work.

Additionally, while the paper demonstrates impressive results on benchmark datasets, the real-world performance may be affected by factors like image quality, lighting, clothing, and background clutter that were not fully explored in the experiments.

Some readers may also wonder about the ethical implications of such powerful 3D body modeling technology, particularly around privacy, consent, and potential misuse. The authors do not delve into these societal considerations in depth.

Overall, the MEGA method represents a significant technical advancement, but further research is needed to enhance its robustness and address broader implications.

Conclusion

In summary, the MEGA paper introduces a novel masked autoencoder approach for recovering detailed 3D human mesh models from partial or occluded 2D images. This work advances the state-of-the-art in human body reconstruction and has numerous applications in areas like computer graphics, virtual try-on, and healthcare.

While the technical achievements are impressive, the authors acknowledge limitations around multi-person handling and the need to consider real-world deployment challenges and ethical concerns. Nevertheless, this research represents an important step forward in the field of 3D human modeling from 2D data, with the potential to enable more immersive and personalized digital experiences.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

MEGA: Masked Generative Autoencoder for Human Mesh Recovery

Gu'enol'e Fiche, Simon Leglaive, Xavier Alameda-Pineda, Francesc Moreno-Noguer

Human Mesh Recovery (HMR) from a single RGB image is a highly ambiguous problem, as similar 2D projections can correspond to multiple 3D interpretations. Nevertheless, most HMR methods overlook this ambiguity and make a single prediction without accounting for the associated uncertainty. A few approaches generate a distribution of human meshes, enabling the sampling of multiple predictions; however, none of them is competitive with the latest single-output model when making a single prediction. This work proposes a new approach based on masked generative modeling. By tokenizing the human pose and shape, we formulate the HMR task as generating a sequence of discrete tokens conditioned on an input image. We introduce MEGA, a MaskEd Generative Autoencoder trained to recover human meshes from images and partial human mesh token sequences. Given an image, our flexible generation scheme allows us to predict a single human mesh in deterministic mode or to generate multiple human meshes in stochastic mode. MEGA enables us to propose multiple outputs and to evaluate the uncertainty of the predictions. Experiments on in-the-wild benchmarks show that MEGA achieves state-of-the-art performance in deterministic and stochastic modes, outperforming single-output and multi-output approaches.

6/3/2024

📉

Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot

Fabien Baradel, Matthieu Armando, Salma Galaaoui, Romain Br'egier, Philippe Weinzaepfel, Gr'egory Rogez, Thomas Lucas

We present Multi-HMR, a strong sigle-shot model for multi-person 3D human mesh recovery from a single RGB image. Predictions encompass the whole body, i.e., including hands and facial expressions, using the SMPL-X parametric model and 3D location in the camera coordinate system. Our model detects people by predicting coarse 2D heatmaps of person locations, using features produced by a standard Vision Transformer (ViT) backbone. It then predicts their whole-body pose, shape and 3D location using a new cross-attention module called the Human Prediction Head (HPH), with one query attending to the entire set of features for each detected person. As direct prediction of fine-grained hands and facial poses in a single shot, i.e., without relying on explicit crops around body parts, is hard to learn from existing data, we introduce CUFFS, the Close-Up Frames of Full-Body Subjects dataset, containing humans close to the camera with diverse hand poses. We show that incorporating it into the training data further enhances predictions, particularly for hands. Multi-HMR also optionally accounts for camera intrinsics, if available, by encoding camera ray directions for each image token. This simple design achieves strong performance on whole-body and body-only benchmarks simultaneously: a ViT-S backbone on $448{times}448$ images already yields a fast and competitive model, while larger models and higher resolutions obtain state-of-the-art results.

7/25/2024

Human Mesh Recovery from Arbitrary Multi-view Images

Xiaoben Li, Mancheng Meng, Ziyan Wu, Terrence Chen, Fan Yang, Dinggang Shen

Human mesh recovery from arbitrary multi-view images involves two characteristics: the arbitrary camera poses and arbitrary number of camera views. Because of the variability, designing a unified framework to tackle this task is challenging. The challenges can be summarized as the dilemma of being able to simultaneously estimate arbitrary camera poses and recover human mesh from arbitrary multi-view images while maintaining flexibility. To solve this dilemma, we propose a divide and conquer framework for Unified Human Mesh Recovery (U-HMR) from arbitrary multi-view images. In particular, U-HMR consists of a decoupled structure and two main components: camera and body decoupling (CBD), camera pose estimation (CPE), and arbitrary view fusion (AVF). As camera poses and human body mesh are independent of each other, CBD splits the estimation of them into two sub-tasks for two individual sub-networks (ie, CPE and AVF) to handle respectively, thus the two sub-tasks are disentangled. In CPE, since each camera pose is unrelated to the others, we adopt a shared MLP to process all views in a parallel way. In AVF, in order to fuse multi-view information and make the fusion operation independent of the number of views, we introduce a transformer decoder with a SMPL parameters query token to extract cross-view features for mesh recovery. To demonstrate the efficacy and flexibility of the proposed framework and effect of each component, we conduct extensive experiments on three public datasets: Human3.6M, MPI-INF-3DHP, and TotalCapture.

6/18/2024

MeGA: Hybrid Mesh-Gaussian Head Avatar for High-Fidelity Rendering and Head Editing

Cong Wang, Di Kang, He-Yi Sun, Shen-Han Qian, Zi-Xuan Wang, Linchao Bao, Song-Hai Zhang

Creating high-fidelity head avatars from multi-view videos is a core issue for many AR/VR applications. However, existing methods usually struggle to obtain high-quality renderings for all different head components simultaneously since they use one single representation to model components with drastically different characteristics (e.g., skin vs. hair). In this paper, we propose a Hybrid Mesh-Gaussian Head Avatar (MeGA) that models different head components with more suitable representations. Specifically, we select an enhanced FLAME mesh as our facial representation and predict a UV displacement map to provide per-vertex offsets for improved personalized geometric details. To achieve photorealistic renderings, we obtain facial colors using deferred neural rendering and disentangle neural textures into three meaningful parts. For hair modeling, we first build a static canonical hair using 3D Gaussian Splatting. A rigid transformation and an MLP-based deformation field are further applied to handle complex dynamic expressions. Combined with our occlusion-aware blending, MeGA generates higher-fidelity renderings for the whole head and naturally supports more downstream tasks. Experiments on the NeRSemble dataset demonstrate the effectiveness of our designs, outperforming previous state-of-the-art methods and supporting various editing functionalities, including hairstyle alteration and texture editing.

5/1/2024