EPOCH: Jointly Estimating the 3D Pose of Cameras and Humans

Read original: arXiv:2406.19726 - Published 7/1/2024 by Nicola Garau, Giulia Martinelli, Niccol`o Bisagno, Denis Tom`e, Carsten Stoll

EPOCH: Jointly Estimating the 3D Pose of Cameras and Humans

Overview

This paper presents EPOCH, a method for jointly estimating the 3D pose of cameras and humans in a scene.
The key innovation is that EPOCH can optimize the 3D poses of both cameras and people simultaneously, without the need for labeled training data.
This is a significant advancement over previous approaches that could only estimate either camera or human pose, but not both together.

Plain English Explanation

EPOCH is a new technique that can figure out the 3D positions and orientations of both cameras and people in a scene, all at the same time. Previous methods could only do one or the other, but not both together.

The key insight is that the 3D poses of the cameras and people are actually connected - the way the cameras are positioned and pointed affects how the people appear, and vice versa. By optimizing the 3D poses of both the cameras and the people simultaneously, EPOCH can get a much more accurate and complete understanding of the 3D geometry of the scene.

This is really important for a lot of applications, like 3D human pose estimation from monocular video, 3D human perception from egocentric stereo, and multi-person 3D pose estimation from unlabeled data. By jointly optimizing the 3D poses of the cameras and people, EPOCH can provide much richer and more accurate 3D information about the scene.

Technical Explanation

The core of the EPOCH method is a novel optimization framework that simultaneously estimates the 3D poses of both the cameras and the people in a scene. This is achieved by formulating the problem as a joint optimization over the camera parameters and the human poses, leveraging the inherent connection between the two.

Specifically, EPOCH takes in 2D keypoint detections from one or more views of the scene, and uses these to constrain the possible 3D poses of both the cameras and the people. The optimization objective encourages the 3D poses to be consistent with the observed 2D keypoints, while also enforcing physical plausibility constraints such as joint angle limits and ground plane contact.

By jointly optimizing the camera and human poses, EPOCH is able to resolve inherent ambiguities that arise when considering the 3D geometry of the scene. For example, uncertainties in the camera parameters can be mitigated by the constraints imposed by the human pose, and vice versa. This leads to more robust and accurate 3D pose estimates compared to previous approaches that considered the camera and human pose estimation problems in isolation.

The EPOCH method is also designed to work in a self-supervised manner, without requiring any labeled 3D training data. This is a significant advantage, as collecting high-quality 3D annotations can be extremely challenging and expensive. Instead, EPOCH leverages readily available 2D keypoint annotations to drive the joint optimization process.

Critical Analysis

The EPOCH method represents a significant advance in the field of 3D human pose estimation, as it is able to jointly optimize the 3D poses of both cameras and people in a scene. This is a important step forward, as previous techniques could only estimate one or the other, but not both simultaneously.

That said, the authors acknowledge that EPOCH has some limitations. For example, the method assumes that the camera intrinsic parameters are known a priori, which may not always be the case in real-world scenarios. Additionally, the current formulation is limited to single-person scenes, and would need to be extended to handle multiple people interacting in the same environment.

Another potential concern is the reliance on 2D keypoint detections as the primary input. While the authors show that EPOCH is robust to noisy 2D annotations, there may be cases where the 2D detections are unreliable or ambiguous, which could lead to errors in the estimated 3D poses.

Future research could explore ways to further improve the robustness and generalization of the EPOCH framework, such as by incorporating additional sensing modalities (e.g. depth, inertial data) or by developing more sophisticated optimization techniques. Additionally, extending the method to handle multi-person scenarios would be an important next step to enable its use in more realistic applications.

Overall, the EPOCH method represents an exciting advance in the field of 3D human pose estimation, and the authors' focus on jointly estimating camera and human poses is a promising direction for future research in this area.

Conclusion

The EPOCH method presented in this paper is a novel approach for jointly estimating the 3D poses of both cameras and humans in a scene. By formulating the problem as a joint optimization over the camera parameters and human poses, EPOCH is able to resolve inherent ambiguities and produce more accurate 3D pose estimates compared to previous techniques that considered these problems in isolation.

The key innovation of EPOCH is its ability to work in a self-supervised manner, without requiring any labeled 3D training data. This is a significant advantage, as collecting high-quality 3D annotations can be extremely challenging and expensive. Instead, EPOCH leverages readily available 2D keypoint annotations to drive the joint optimization process.

The authors demonstrate the effectiveness of EPOCH on a variety of benchmarks, showcasing its potential to enable a wide range of applications in areas like 3D human pose estimation from monocular video, 3D human perception from egocentric stereo, and multi-person 3D pose estimation from unlabeled data.

While the EPOCH method represents an important step forward, the authors acknowledge that there is still room for improvement, particularly in terms of handling more complex scenarios with multiple people and additional sensing modalities. Nonetheless, this work provides a strong foundation for future research in the field of 3D human pose estimation and scene understanding.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

EPOCH: Jointly Estimating the 3D Pose of Cameras and Humans

Nicola Garau, Giulia Martinelli, Niccol`o Bisagno, Denis Tom`e, Carsten Stoll

Monocular Human Pose Estimation (HPE) aims at determining the 3D positions of human joints from a single 2D image captured by a camera. However, a single 2D point in the image may correspond to multiple points in 3D space. Typically, the uniqueness of the 2D-3D relationship is approximated using an orthographic or weak-perspective camera model. In this study, instead of relying on approximations, we advocate for utilizing the full perspective camera model. This involves estimating camera parameters and establishing a precise, unambiguous 2D-3D relationship. To do so, we introduce the EPOCH framework, comprising two main components: the pose lifter network (LiftNet) and the pose regressor network (RegNet). LiftNet utilizes the full perspective camera model to precisely estimate the 3D pose in an unsupervised manner. It takes a 2D pose and camera parameters as inputs and produces the corresponding 3D pose estimation. These inputs are obtained from RegNet, which starts from a single image and provides estimates for the 2D pose and camera parameters. RegNet utilizes only 2D pose data as weak supervision. Internally, RegNet predicts a 3D pose, which is then projected to 2D using the estimated camera parameters. This process enables RegNet to establish the unambiguous 2D-3D relationship. Our experiments show that modeling the lifting as an unsupervised task with a camera in-the-loop results in better generalization to unseen data. We obtain state-of-the-art results for the 3D HPE on the Human3.6M and MPI-INF-3DHP datasets. Our code is available at: [Github link upon acceptance, see supplementary materials].

7/1/2024

📊

Multi-person 3D pose estimation from unlabelled data

Daniel Rodriguez-Criado, Pilar Bachiller, George Vogiatzis, Luis J. Manso

Its numerous applications make multi-human 3D pose estimation a remarkably impactful area of research. Nevertheless, assuming a multiple-view system composed of several regular RGB cameras, 3D multi-pose estimation presents several challenges. First of all, each person must be uniquely identified in the different views to separate the 2D information provided by the cameras. Secondly, the 3D pose estimation process from the multi-view 2D information of each person must be robust against noise and potential occlusions in the scenario. In this work, we address these two challenges with the help of deep learning. Specifically, we present a model based on Graph Neural Networks capable of predicting the cross-view correspondence of the people in the scenario along with a Multilayer Perceptron that takes the 2D points to yield the 3D poses of each person. These two models are trained in a self-supervised manner, thus avoiding the need for large datasets with 3D annotations.

4/10/2024

🏷️

3D Human Pose Perception from Egocentric Stereo Videos

Hiroyasu Akada, Jian Wang, Vladislav Golyanik, Christian Theobalt

While head-mounted devices are becoming more compact, they provide egocentric views with significant self-occlusions of the device user. Hence, existing methods often fail to accurately estimate complex 3D poses from egocentric views. In this work, we propose a new transformer-based framework to improve egocentric stereo 3D human pose estimation, which leverages the scene information and temporal context of egocentric stereo videos. Specifically, we utilize 1) depth features from our 3D scene reconstruction module with uniformly sampled windows of egocentric stereo frames, and 2) human joint queries enhanced by temporal features of the video inputs. Our method is able to accurately estimate human poses even in challenging scenarios, such as crouching and sitting. Furthermore, we introduce two new benchmark datasets, i.e., UnrealEgo2 and UnrealEgo-RW (RealWorld). The proposed datasets offer a much larger number of egocentric stereo views with a wider variety of human motions than the existing datasets, allowing comprehensive evaluation of existing and upcoming methods. Our extensive experiments show that the proposed approach significantly outperforms previous methods. We will release UnrealEgo2, UnrealEgo-RW, and trained models on our project page.

5/16/2024

MPL: Lifting 3D Human Pose from Multi-view 2D Poses

Seyed Abolfazl Ghasemzadeh, Alexandre Alahi, Christophe De Vleeschouwer

Estimating 3D human poses from 2D images is challenging due to occlusions and projective acquisition. Learning-based approaches have been largely studied to address this challenge, both in single and multi-view setups. These solutions however fail to generalize to real-world cases due to the lack of (multi-view) 'in-the-wild' images paired with 3D poses for training. For this reason, we propose combining 2D pose estimation, for which large and rich training datasets exist, and 2D-to-3D pose lifting, using a transformer-based network that can be trained from synthetic 2D-3D pose pairs. Our experiments demonstrate decreases up to 45% in MPJPE errors compared to the 3D pose obtained by triangulating the 2D poses. The framework's source code is available at https://github.com/aghasemzadeh/OpenMPL .

8/21/2024