Improving 2D Human Pose Estimation in Rare Camera Views with Synthetic Data

Read original: arXiv:2307.06737 - Published 4/23/2024 by Miroslav Purkrabek, Jiri Matas

📊

Overview

Current human pose estimation methods focus on side and front-view scenarios
This paper introduces RePoGen, a method for generating synthetic humans with diverse poses and views
Experiments show that adding RePoGen data to existing datasets improves performance on top and bottom-view pose estimation without hurting common view performance
The paper also finds that anatomical plausibility, a focus of prior research, is not necessary for effective performance

Plain English Explanation

Researchers have typically focused on estimating human poses in side and front-view scenarios. This paper proposes a new method called RePoGen that can generate synthetic humans with a wide range of poses and viewpoints. By adding this synthetic data to existing datasets, the researchers were able to improve the performance of pose estimation models on top and bottom-view scenarios, without negatively impacting their ability to handle common side and front views.

Interestingly, the paper also found that ensuring the synthetic humans look anatomically plausible, which was a focus of previous research efforts like EgoGen, is not actually necessary to get good performance. This suggests that pose estimation models can handle more diverse, and potentially less realistic, training data as long as it captures the key variations in human pose and viewpoint.

Technical Explanation

The core of this paper is the introduction of RePoGen, a method for generating synthetic humans with diverse poses and views using the SMPL body model. RePoGen allows for fine-grained control over factors like limb positions, torso orientation, and camera viewpoint.

The researchers then conduct experiments to evaluate the impact of adding RePoGen data to existing human pose estimation datasets like COCO. They find that this improves performance on top-view and bottom-view scenarios, as captured in a new real-world dataset, without hurting performance on more common side and front views.

Interestingly, an ablation study also shows that anatomical plausibility, a property that prior research like this survey has focused on, is not actually a prerequisite for effective performance. This suggests that pose estimation models can handle more diverse, and potentially less realistic, training data as long as it captures the key variations in human pose and viewpoint.

Critical Analysis

The paper makes a compelling case for the value of synthetic data generation techniques like RePoGen in expanding the diversity of poses and views seen by human pose estimation models. However, it's worth noting that the real-world dataset used for evaluation is still relatively small, so further testing on larger-scale real-world data would be helpful to fully validate the approach.

Additionally, while the paper shows that anatomical plausibility is not strictly necessary, it would be interesting to explore whether models trained on more anatomically realistic synthetic data exhibit any other benefits, such as improved generalization or sample efficiency. There may be a balance to strike between realism and diversity in the synthetic training data.

Finally, the paper focuses solely on 3D human pose estimation, but the insights around the value of diverse synthetic data generation could potentially extend to other human-centric computer vision tasks, such as relative pose regression for visual re-localization. Exploring these broader applications could further demonstrate the significance of this research.

Conclusion

This paper introduces RePoGen, a novel method for generating synthetic humans with diverse poses and viewpoints, and shows that incorporating this data into existing human pose estimation datasets can significantly improve performance on challenging top and bottom-view scenarios. Importantly, the paper also finds that anatomical plausibility is not a strict requirement for effective pose estimation, suggesting that models can handle more diverse, and potentially less realistic, training data as long as it captures key variations in human pose and viewpoint. These insights open up new avenues for enhancing the robustness and versatility of human pose estimation systems through the strategic use of synthetic data generation techniques.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📊

Improving 2D Human Pose Estimation in Rare Camera Views with Synthetic Data

Miroslav Purkrabek, Jiri Matas

Methods and datasets for human pose estimation focus predominantly on side- and front-view scenarios. We overcome the limitation by leveraging synthetic data and introduce RePoGen (RarE POses GENerator), an SMPL-based method for generating synthetic humans with comprehensive control over pose and view. Experiments on top-view datasets and a new dataset of real images with diverse poses show that adding the RePoGen data to the COCO dataset outperforms previous approaches to top- and bottom-view pose estimation without harming performance on common views. An ablation study shows that anatomical plausibility, a property prior research focused on, is not a prerequisite for effective performance. The introduced dataset and the corresponding code are available on https://mirapurkrabek.github.io/RePoGen-paper/ .

4/23/2024

Diversifying Human Pose in Synthetic Data for Aerial-view Human Detection

Yi-Ting Shen, Hyungtae Lee, Heesung Kwon, Shuvra S. Bhattacharyya

We present a framework for diversifying human poses in a synthetic dataset for aerial-view human detection. Our method firstly constructs a set of novel poses using a pose generator and then alters images in the existing synthetic dataset to assume the novel poses while maintaining the original style using an image translator. Since images corresponding to the novel poses are not available in training, the image translator is trained to be applicable only when the input and target poses are similar, thus training does not require the novel poses and their corresponding images. Next, we select a sequence of target novel poses from the novel pose set, using Dijkstra's algorithm to ensure that poses closer to each other are located adjacently in the sequence. Finally, we repeatedly apply the image translator to each target pose in sequence to produce a group of novel pose images representing a variety of different limited body movements from the source pose. Experiments demonstrate that, regardless of how the synthetic data is used for training or the data size, leveraging the pose-diversified synthetic dataset in training generally presents remarkably better accuracy than using the original synthetic dataset on three aerial-view human detection benchmarks (VisDrone, Okutama-Action, and ICG) in the few-shot regime.

5/28/2024

3D Human Reconstruction in the Wild with Synthetic Data Using Generative Models

Yongtao Ge, Wenjia Wang, Yongfan Chen, Hao Chen, Chunhua Shen

In this work, we show that synthetic data created by generative models is complementary to computer graphics (CG) rendered data for achieving remarkable generalization performance on diverse real-world scenes for 3D human pose and shape estimation (HPS). Specifically, we propose an effective approach based on recent diffusion models, termed HumanWild, which can effortlessly generate human images and corresponding 3D mesh annotations. We first collect a large-scale human-centric dataset with comprehensive annotations, e.g., text captions and surface normal images. Then, we train a customized ControlNet model upon this dataset to generate diverse human images and initial ground-truth labels. At the core of this step is that we can easily obtain numerous surface normal images from a 3D human parametric model, e.g., SMPL-X, by rendering the 3D mesh onto the image plane. As there exists inevitable noise in the initial labels, we then apply an off-the-shelf foundation segmentation model, i.e., SAM, to filter negative data samples. Our data generation pipeline is flexible and customizable to facilitate different real-world tasks, e.g., ego-centric scenes and perspective-distortion scenes. The generated dataset comprises 0.79M images with corresponding 3D annotations, covering versatile viewpoints, scenes, and human identities. We train various HPS regressors on top of the generated data and evaluate them on a wide range of benchmarks (3DPW, RICH, EgoBody, AGORA, SSP-3D) to verify the effectiveness of the generated data. By exclusively employing generative models, we generate large-scale in-the-wild human images and high-quality annotations, eliminating the need for real-world data collection.

4/12/2024

📊

On the power of data augmentation for head pose estimation

Michael Welter

Deep learning has been impressively successful in the last decade in predicting human head poses from monocular images. For in-the-wild inputs, the research community has predominantly relied on a single training set of semi-synthetic nature. This paper suggest the combination of different flavors of synthetic data in order to achieve better generalization to natural images. Moreover, additional expansion of the data volume using traditional out-of-plane rotation synthesis is considered. Together with a novel combination of losses and a network architecture with a standard feature-extractor, a competitive model is obtained, both in accuracy and efficiency, which allows full 6 DoF pose estimation in practical real-time applications.

7/12/2024