Mirror-Aware Neural Humans

2309.04750

Published 5/17/2024 by Daniel Ajisafe, James Tang, Shih-Yang Su, Bastian Wandt, Helge Rhodin

🧠

Abstract

Human motion capture either requires multi-camera systems or is unreliable when using single-view input due to depth ambiguities. Meanwhile, mirrors are readily available in urban environments and form an affordable alternative by recording two views with only a single camera. However, the mirror setting poses the additional challenge of handling occlusions of real and mirror image. Going beyond existing mirror approaches for 3D human pose estimation, we utilize mirrors for learning a complete body model, including shape and dense appearance. Our main contributions are extending articulated neural radiance fields to include a notion of a mirror, making it sample-efficient over potential occlusion regions. Together, our contributions realize a consumer-level 3D motion capture system that starts from off-the-shelf 2D poses by automatically calibrating the camera, estimating mirror orientation, and subsequently lifting 2D keypoint detections to 3D skeleton pose that is used to condition the mirror-aware NeRF. We empirically demonstrate the benefit of learning a body model and accounting for occlusion in challenging mirror scenes.

Create account to get full access

Overview

Current human motion capture systems have limitations, such as requiring multiple cameras or struggling with depth ambiguities from single-view input.
Mirrors offer an affordable alternative by providing two views with a single camera, but they introduce the challenge of handling occlusions of real and mirror image.
This paper goes beyond existing mirror-based 3D human pose estimation approaches by learning a complete body model, including shape and dense appearance, using mirrors.

Plain English Explanation

Capturing the 3D motion of a person's body is an important technology with many applications, such as virtual reality, robotics, and animation. However, current systems have limitations. Multi-camera systems can accurately track 3D motion, but they are expensive and complex to set up. Single-camera systems are more affordable, but they struggle to accurately determine the depth of the person's body parts due to the camera's 2D perspective.

An interesting alternative is to use mirrors to capture two views of the person with a single camera. This provides depth information without the need for multiple cameras. However, the mirror setting also introduces a new challenge: the camera can see both the real person and their reflection in the mirror, which can sometimes overlap and occlude each other.

This paper goes a step further by not just using mirrors for 3D pose estimation, but actually learning a complete 3D model of the person's body, including their shape and appearance. This model is conditioned on the 2D keypoints detected in the camera and mirror views, allowing it to accurately reconstruct the 3D body even in the presence of occlusions.

Technical Explanation

The key technical contributions of this paper are:

Extending Articulated Neural Radiance Fields (NeRFs) to Include a Mirror: The researchers developed a NeRF-based model that can account for the mirror view in addition to the real-world view. This mirror-aware NeRF is able to efficiently sample the potential occlusion regions between the real person and their reflection.
Learning a Complete 3D Body Model: Unlike previous mirror-based 3D pose estimation approaches, this paper learns a full 3D model of the person's body shape and appearance, not just their skeleton pose. This richer model is conditioned on the 2D keypoints detected in both the camera and mirror views.

The overall system works as follows:

2D Pose Estimation: Off-the-shelf 2D pose detectors are used to find the locations of the person's body joints in the camera and mirror views.
Camera and Mirror Calibration: The system automatically calibrates the camera and estimates the orientation of the mirror.
3D Pose and Body Model Estimation: The 2D keypoints are used to condition the mirror-aware NeRF, which then outputs a complete 3D model of the person's body shape and pose.

The researchers demonstrate that this approach can accurately reconstruct 3D human motion in challenging mirror scenes, outperforming previous mirror-based methods.

Critical Analysis

The paper presents an interesting and novel approach to 3D human motion capture using a single camera and mirrors. The key strength is the ability to learn a complete 3D body model, including shape and appearance, while handling the occlusion challenges introduced by the mirror setting.

However, the paper does not address certain limitations and potential areas for further research:

The system still requires some initial calibration of the camera and mirror, which may limit its practical deployment in real-world scenarios.
The experiments are conducted in controlled lab settings, and it's unclear how the approach would perform in more unstructured, real-world environments with varying lighting, backgrounds, and mirror placements.
The computational complexity of the mirror-aware NeRF model is not discussed, which could be a concern for real-time applications or deployment on resource-constrained devices.

Additionally, while the paper focuses on the technical contributions, it would be valuable to also consider the potential societal implications and ethical considerations of such a technology. For example, the ability to accurately reconstruct 3D human models from camera inputs raises privacy concerns that should be thoughtfully addressed.

Conclusion

This paper presents a novel approach to 3D human motion capture that leverages mirrors to provide depth information with a single camera. By extending articulated neural radiance fields to account for the mirror view and learning a complete 3D body model, the system can accurately reconstruct human motion in challenging scenes with occlusions.

The technical contributions demonstrate the potential for affordable and accessible 3D motion capture systems, with applications in virtual reality, robotics, and animation. However, the paper also highlights the need to consider the practical limitations and ethical implications of such technologies as they continue to develop.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

➖

MirrorCalib: Utilizing Human Pose Information for Mirror-based Virtual Camera Calibration

Longyun Liao, Rong Zheng, Andrew Mitchell

In this paper, we present the novel task of estimating the extrinsic parameters of a virtual camera relative to a real camera in exercise videos with a mirror. This task poses a significant challenge in scenarios where the views from the real and mirrored cameras have no overlap or share salient features. To address this issue, prior knowledge of a human body and 2D joint locations are utilized to estimate the camera extrinsic parameters when a person is in front of a mirror. We devise a modified eight-point algorithm to obtain an initial estimation from 2D joint locations. The 2D joint locations are then refined subject to human body constraints. Finally, a RANSAC algorithm is employed to remove outliers by comparing their epipolar distances to a predetermined threshold. MirrorCalib achieves a rotation error of 1.82{deg} and a translation error of 69.51 mm on a collected real-world dataset, which outperforms the state-of-art method.

5/21/2024

cs.CV

HINT: Learning Complete Human Neural Representations from Limited Viewpoints

Alessandro Sanvito, Andrea Ramazzina, Stefanie Walz, Mario Bijelic, Felix Heide

No augmented application is possible without animated humanoid avatars. At the same time, generating human replicas from real-world monocular hand-held or robotic sensor setups is challenging due to the limited availability of views. Previous work showed the feasibility of virtual avatars but required the presence of 360 degree views of the targeted subject. To address this issue, we propose HINT, a NeRF-based algorithm able to learn a detailed and complete human model from limited viewing angles. We achieve this by introducing a symmetry prior, regularization constraints, and training cues from large human datasets. In particular, we introduce a sagittal plane symmetry prior to the appearance of the human, directly supervise the density function of the human model using explicit 3D body modeling, and leverage a co-learned human digitization network as additional supervision for the unseen angles. As a result, our method can reconstruct complete humans even from a few viewing angles, increasing performance by more than 15% PSNR compared to previous state-of-the-art algorithms.

5/31/2024

cs.CV

Head Pose Estimation and 3D Neural Surface Reconstruction via Monocular Camera in situ for Navigation and Safe Insertion into Natural Openings

Ruijie Tang, Beilei Cui, Hongliang Ren

As the significance of simulation in medical care and intervention continues to grow, it is anticipated that a simplified and low-cost platform can be set up to execute personalized diagnoses and treatments. 3D Slicer can not only perform medical image analysis and visualization but can also provide surgical navigation and surgical planning functions. In this paper, we have chosen 3D Slicer as our base platform and monocular cameras are used as sensors. Then, We used the neural radiance fields (NeRF) algorithm to complete the 3D model reconstruction of the human head. We compared the accuracy of the NeRF algorithm in generating 3D human head scenes and utilized the MarchingCube algorithm to generate corresponding 3D mesh models. The individual's head pose, obtained through single-camera vision, is transmitted in real-time to the scene created within 3D Slicer. The demonstrations presented in this paper include real-time synchronization of transformations between the human head model in the 3D Slicer scene and the detected head posture. Additionally, we tested a scene where a tool, marked with an ArUco Maker tracked by a single camera, synchronously points to the real-time transformation of the head posture. These demos indicate that our methodology can provide a feasible real-time simulation platform for nasopharyngeal swab collection or intubation.

6/21/2024

cs.CV

📊

Multi-person 3D pose estimation from unlabelled data

Daniel Rodriguez-Criado, Pilar Bachiller, George Vogiatzis, Luis J. Manso

Its numerous applications make multi-human 3D pose estimation a remarkably impactful area of research. Nevertheless, assuming a multiple-view system composed of several regular RGB cameras, 3D multi-pose estimation presents several challenges. First of all, each person must be uniquely identified in the different views to separate the 2D information provided by the cameras. Secondly, the 3D pose estimation process from the multi-view 2D information of each person must be robust against noise and potential occlusions in the scenario. In this work, we address these two challenges with the help of deep learning. Specifically, we present a model based on Graph Neural Networks capable of predicting the cross-view correspondence of the people in the scenario along with a Multilayer Perceptron that takes the 2D points to yield the 3D poses of each person. These two models are trained in a self-supervised manner, thus avoiding the need for large datasets with 3D annotations.

4/10/2024

cs.CV cs.AI