RTMW: Real-Time Multi-Person 2D and 3D Whole-body Pose Estimation

Read original: arXiv:2407.08634 - Published 7/12/2024 by Tao Jiang, Xinchen Xie, Yining Li

RTMW: Real-Time Multi-Person 2D and 3D Whole-body Pose Estimation

Overview

This paper presents a real-time multi-person 2D and 3D whole-body pose estimation system called RTMW.
The system can accurately detect and estimate the 2D and 3D poses of multiple people in a single image or video frame.
It achieves state-of-the-art performance on several benchmark datasets while running in real-time on consumer hardware.

Plain English Explanation

RTMW is a computer vision system that can automatically detect and track the movements of multiple people in an image or video. It can not only identify where each person is located, but also estimate the 3D positions of their body parts, like their arms, legs, and joints.

This is a challenging task because people can appear in all sorts of different poses and orientations, and there may be multiple people in the same scene. RTMW uses advanced deep learning algorithms to tackle this problem quickly and accurately.

Compared to previous methods, RTMW can process images and videos in real-time, even on modest consumer hardware like laptops or smartphones. This makes it useful for a variety of applications, such as sports analysis, surveillance, or interactive entertainment.

The key innovations in RTMW include [link to RTMO paper] a highly efficient neural network architecture, [link to Hybrid 3D paper] novel 3D pose estimation techniques, and [link to Multi-Person 3D paper] ways to handle multiple people in the same scene. These advances allow RTMW to achieve state-of-the-art performance on standard benchmarks for human pose estimation.

Technical Explanation

RTMW builds on prior work in real-time pose estimation, such as [link to RTMO paper] the RTMO model, and extends it to handle both 2D and 3D pose estimation for multiple people simultaneously.

The core of the RTMW system is a convolutional neural network that takes an input image and outputs the 2D locations of key body joints for each person in the scene. This 2D pose estimation is performed in a single pass, without the need for complex preprocessing or post-processing steps.

To estimate the 3D pose, RTMW incorporates [link to Hybrid 3D paper] a hybrid approach that combines 2D joint detections with depth cues from the input image. This allows it to infer the 3D positions of the body joints, even without explicit 3D ground truth annotations during training.

Furthermore, RTMW addresses the challenge of [link to Multi-Person 3D paper] detecting and localizing multiple people in the same image. It does this by using a person detection module to first identify the locations of individual people, and then applying the pose estimation network to each person independently.

The authors evaluate RTMW on standard benchmarks for 2D and 3D pose estimation, including the COCO, MPII, and Human3.6M datasets. They show that RTMW achieves state-of-the-art results on these benchmarks, while also running in real-time on a single GPU.

Critical Analysis

The authors present a compelling real-time multi-person pose estimation system in RTMW. The technical innovations, such as the efficient network architecture and hybrid 3D pose estimation, are well-justified and appear to deliver strong empirical results.

However, the paper does not address some potential limitations of the approach. For example, the system may struggle with occlusions, where one person's body parts are hidden by another person or object in the scene. [link to Markerless 3D paper] Handling such cases could require additional techniques like multi-view or temporal reasoning.

Additionally, the evaluation is focused primarily on standard benchmarks, which may not fully capture the challenges of real-world deployment. [link to Freeman 3D paper] Further testing on more diverse and unconstrained datasets could help identify areas for improvement.

Overall, RTMW represents an impressive step forward in real-time human pose estimation. With continued research and refinement, systems like this could enable a wide range of applications in fields like sports, healthcare, and human-computer interaction.

Conclusion

The RTMW system presented in this paper demonstrates state-of-the-art performance in real-time multi-person 2D and 3D whole-body pose estimation. By combining efficient network design, novel 3D pose estimation techniques, and effective multi-person handling, the authors have created a system that can accurately track the movements of multiple people in a scene while running in real-time on consumer hardware.

This work has significant implications for a variety of applications, from sports analytics and surveillance to interactive entertainment and assistive technology. As the field of human pose estimation continues to advance, systems like RTMW will play an increasingly important role in enabling machines to better understand and interact with humans in the real world.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

RTMW: Real-Time Multi-Person 2D and 3D Whole-body Pose Estimation

Tao Jiang, Xinchen Xie, Yining Li

Whole-body pose estimation is a challenging task that requires simultaneous prediction of keypoints for the body, hands, face, and feet. Whole-body pose estimation aims to predict fine-grained pose information for the human body, including the face, torso, hands, and feet, which plays an important role in the study of human-centric perception and generation and in various applications. In this work, we present RTMW (Real-Time Multi-person Whole-body pose estimation models), a series of high-performance models for 2D/3D whole-body pose estimation. We incorporate RTMPose model architecture with FPN and HEM (Hierarchical Encoding Module) to better capture pose information from different body parts with various scales. The model is trained with a rich collection of open-source human keypoint datasets with manually aligned annotations and further enhanced via a two-stage distillation strategy. RTMW demonstrates strong performance on multiple whole-body pose estimation benchmarks while maintaining high inference efficiency and deployment friendliness. We release three sizes: m/l/x, with RTMW-l achieving a 70.2 mAP on the COCO-Wholebody benchmark, making it the first open-source model to exceed 70 mAP on this benchmark. Meanwhile, we explored the performance of RTMW in the task of 3D whole-body pose estimation, conducting image-based monocular 3D whole-body pose estimation in a coordinate classification manner. We hope this work can benefit both academic research and industrial applications. The code and models have been made publicly available at: https://github.com/open-mmlab/mmpose/tree/main/projects/rtmpose

7/12/2024

RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation

Peng Lu, Tao Jiang, Yining Li, Xiangtai Li, Kai Chen, Wenming Yang

Real-time multi-person pose estimation presents significant challenges in balancing speed and precision. While two-stage top-down methods slow down as the number of people in the image increases, existing one-stage methods often fail to simultaneously deliver high accuracy and real-time performance. This paper introduces RTMO, a one-stage pose estimation framework that seamlessly integrates coordinate classification by representing keypoints using dual 1-D heatmaps within the YOLO architecture, achieving accuracy comparable to top-down methods while maintaining high speed. We propose a dynamic coordinate classifier and a tailored loss function for heatmap learning, specifically designed to address the incompatibilities between coordinate classification and dense prediction models. RTMO outperforms state-of-the-art one-stage pose estimators, achieving 1.1% higher AP on COCO while operating about 9 times faster with the same backbone. Our largest model, RTMO-l, attains 74.8% AP on COCO val2017 and 141 FPS on a single V100 GPU, demonstrating its efficiency and accuracy. The code and models are available at https://github.com/open-mmlab/mmpose/tree/main/projects/rtmo.

4/9/2024

📉

Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot

Fabien Baradel, Matthieu Armando, Salma Galaaoui, Romain Br'egier, Philippe Weinzaepfel, Gr'egory Rogez, Thomas Lucas

We present Multi-HMR, a strong sigle-shot model for multi-person 3D human mesh recovery from a single RGB image. Predictions encompass the whole body, i.e., including hands and facial expressions, using the SMPL-X parametric model and 3D location in the camera coordinate system. Our model detects people by predicting coarse 2D heatmaps of person locations, using features produced by a standard Vision Transformer (ViT) backbone. It then predicts their whole-body pose, shape and 3D location using a new cross-attention module called the Human Prediction Head (HPH), with one query attending to the entire set of features for each detected person. As direct prediction of fine-grained hands and facial poses in a single shot, i.e., without relying on explicit crops around body parts, is hard to learn from existing data, we introduce CUFFS, the Close-Up Frames of Full-Body Subjects dataset, containing humans close to the camera with diverse hand poses. We show that incorporating it into the training data further enhances predictions, particularly for hands. Multi-HMR also optionally accounts for camera intrinsics, if available, by encoding camera ray directions for each image token. This simple design achieves strong performance on whole-body and body-only benchmarks simultaneously: a ViT-S backbone on $448{times}448$ images already yields a fast and competitive model, while larger models and higher resolutions obtain state-of-the-art results.

7/25/2024

RT-Pose: A 4D Radar Tensor-based 3D Human Pose Estimation and Localization Benchmark

Yuan-Hao Ho, Jen-Hao Cheng, Sheng Yao Kuan, Zhongyu Jiang, Wenhao Chai, Hsiang-Wei Huang, Chih-Lung Lin, Jenq-Neng Hwang

Traditional methods for human localization and pose estimation (HPE), which mainly rely on RGB images as an input modality, confront substantial limitations in real-world applications due to privacy concerns. In contrast, radar-based HPE methods emerge as a promising alternative, characterized by distinctive attributes such as through-wall recognition and privacy-preserving, rendering the method more conducive to practical deployments. This paper presents a Radar Tensor-based human pose (RT-Pose) dataset and an open-source benchmarking framework. The RT-Pose dataset comprises 4D radar tensors, LiDAR point clouds, and RGB images, and is collected for a total of 72k frames across 240 sequences with six different complexity-level actions. The 4D radar tensor provides raw spatio-temporal information, differentiating it from other radar point cloud-based datasets. We develop an annotation process using RGB images and LiDAR point clouds to accurately label 3D human skeletons. In addition, we propose HRRadarPose, the first single-stage architecture that extracts the high-resolution representation of 4D radar tensors in 3D space to aid human keypoint estimation. HRRadarPose outperforms previous radar-based HPE work on the RT-Pose benchmark. The overall HRRadarPose performance on the RT-Pose dataset, as reflected in a mean per joint position error (MPJPE) of 9.91cm, indicates the persistent challenges in achieving accurate HPE in complex real-world scenarios. RT-Pose is available at https://huggingface.co/datasets/uwipl/RT-Pose.

7/22/2024