MovePose: A High-performance Human Pose Estimation Algorithm on Mobile and Edge Devices

Read original: arXiv:2308.09084 - Published 7/25/2024 by Dongyang Yu, Haoyue Zhang, Ruisheng Zhao, Guoqi Chen, Wangpeng An, Yanhong Yang

🔍

Overview

Presents MovePose, an optimized lightweight convolutional neural network for real-time body pose estimation on mobile devices
Current solutions lack satisfactory accuracy and speed for human posture estimation on mobile devices
MovePose aims to maintain real-time performance while improving accuracy of human posture estimation
Achieved a Mean Average Precision (mAP) score of 68.0 on the COCO dataset
Demonstrated efficiency with 69+ frames per second (fps) on an Intel i9-10920x CPU and 452+ fps on an NVIDIA RTX3090 GPU
Reached over 11 fps on an Android phone with a Snapdragon 8 + 4G processor

Plain English Explanation

MovePose is a new machine learning model designed to accurately estimate the poses of people in real-time, even on mobile devices like smartphones. Current solutions for this task often struggle to be both fast and accurate enough for practical use on mobile. MovePose aims to address this by using specialized techniques to maintain high performance while also improving the accuracy of the pose estimates.

The key innovations in MovePose include deconvolution, large kernel convolution, and coordinate classification. These enhance the model's ability to understand human poses by increasing its "field of view" and capacity to learn relevant features. As a result, MovePose achieved strong performance, reaching over 68% accuracy on a standard benchmark dataset and running at over 69 frames per second on a high-end desktop CPU and over 11 frames per second on a modern smartphone.

This high-speed, high-accuracy pose estimation could enable a variety of applications, such as interactive fitness tracking, motion-controlled interfaces, and augmented reality experiences - all running directly on the user's mobile device without need for a powerful server. By making this technology more accessible, MovePose represents an important step forward for mobile human pose estimation.

Technical Explanation

MovePose is a convolutional neural network architecture designed specifically for real-time body pose estimation on mobile CPU-based devices. To address the shortcomings of current solutions, the researchers incorporated three key techniques:

Deconvolution: Instead of basic upsampling, MovePose uses trainable deconvolution layers to improve the model's capacity and receptive field. This allows it to better capture the spatial relationships in human poses.
Large Kernel Convolution: Employing large, multi-scale convolutional kernels strengthens the model's ability to perceive body parts and their configurations, while maintaining computational efficiency.
Coordinate Classification: MovePose formulates pose estimation as a classification problem over discretized coordinate spaces, rather than direct regression. This simplifies the learning task and improves accuracy.

On the COCO dataset, MovePose achieved a Mean Average Precision (mAP) score of 68.0, indicating strong performance on this standard benchmark. The model also demonstrated impressive efficiency, running at 69+ fps on an Intel i9-10920x CPU and 452+ fps on an NVIDIA RTX3090 GPU. Even on a mobile Snapdragon 8 + 4G processor, MovePose achieved over 11 fps.

These results show that MovePose can provide high-accuracy, real-time human pose estimation suitable for deployment on a variety of CPU-based devices, including smartphones. This advance in mobile pose estimation could enable new interactive experiences and applications in domains such as fitness tracking, user interfaces, and augmented reality.

Critical Analysis

The MovePose paper provides a comprehensive technical evaluation of the model's performance on various hardware platforms. However, the authors do not extensively discuss potential limitations or areas for future work.

One aspect that could be explored further is the model's robustness to challenging real-world conditions, such as occlusions, diverse body types and poses, and varying lighting/environmental factors. The evaluation was primarily conducted on the structured COCO dataset, so assessing MovePose's generalization capabilities in more uncontrolled settings would be valuable.

Additionally, the paper does not compare MovePose's accuracy and efficiency to other state-of-the-art mobile pose estimation approaches. Providing a more detailed benchmarking analysis against competing methods would help contextualize the significance of MovePose's achievements.

While the authors mention potential applications for MovePose, they do not delve into the societal implications or potential ethical concerns around deploying such technology, particularly in sensitive domains like healthcare or surveillance. Addressing these considerations could strengthen the overall narrative and impact of the research.

Overall, the MovePose paper presents an impressive technical advancement in mobile pose estimation, but further analysis of its limitations, comparisons to alternatives, and discussion of real-world considerations would enhance the depth and robustness of the work.

Conclusion

MovePose is a novel convolutional neural network architecture that addresses the challenge of providing accurate and efficient human pose estimation on mobile devices. By incorporating specialized techniques like deconvolution, large kernel convolution, and coordinate classification, the model is able to maintain real-time performance while achieving strong accuracy on standard benchmarks.

The results demonstrate MovePose's potential to enable a wide range of interactive applications on CPU-based mobile platforms, from fitness tracking to augmented reality experiences. As mobile computing continues to advance, innovations like MovePose will play a crucial role in bringing powerful computer vision capabilities directly to users' fingertips.

While the paper presents a solid technical contribution, further research is needed to assess MovePose's robustness, compare it to competing methods, and consider the broader implications of deploying such pose estimation technology. Nonetheless, the core achievements of MovePose mark an important milestone in the ongoing quest to bring high-performance human understanding to mobile devices.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔍

MovePose: A High-performance Human Pose Estimation Algorithm on Mobile and Edge Devices

Dongyang Yu, Haoyue Zhang, Ruisheng Zhao, Guoqi Chen, Wangpeng An, Yanhong Yang

We present MovePose, an optimized lightweight convolutional neural network designed specifically for real-time body pose estimation on CPU-based mobile devices. The current solutions do not provide satisfactory accuracy and speed for human posture estimation, and MovePose addresses this gap. It aims to maintain real-time performance while improving the accuracy of human posture estimation for mobile devices. Our MovePose algorithm has attained an Mean Average Precision (mAP) score of 68.0 on the COCO cite{cocodata} validation dataset. The MovePose algorithm displayed efficiency with a performance of 69+ frames per second (fps) when run on an Intel i9-10920x CPU. Additionally, it showcased an increased performance of 452+ fps on an NVIDIA RTX3090 GPU. On an Android phone equipped with a Snapdragon 8 + 4G processor, the fps reached above 11. To enhance accuracy, we incorporated three techniques: deconvolution, large kernel convolution, and coordinate classification methods. Compared to basic upsampling, deconvolution is trainable, improves model capacity, and enhances the receptive field. Large kernel convolution strengthens these properties at a decreased computational cost. In summary, MovePose provides high accuracy and real-time performance, marking it a potential tool for a variety of applications, including those focused on mobile-side human posture estimation. The code and models for this algorithm will be made publicly accessible.

7/25/2024

Efficient Human Pose Estimation: Leveraging Advanced Techniques with MediaPipe

Sandeep Singh Sengar, Abhishek Kumar, Owen Singh

This study presents significant enhancements in human pose estimation using the MediaPipe framework. The research focuses on improving accuracy, computational efficiency, and real-time processing capabilities by comprehensively optimising the underlying algorithms. Novel modifications are introduced that substantially enhance pose estimation accuracy across challenging scenarios, such as dynamic movements and partial occlusions. The improved framework is benchmarked against traditional models, demonstrating considerable precision and computational speed gains. The advancements have wide-ranging applications in augmented reality, sports analytics, and healthcare, enabling more immersive experiences, refined performance analysis, and advanced patient monitoring. The study also explores the integration of these enhancements within mobile and embedded systems, addressing the need for computational efficiency and broader accessibility. The implications of this research set a new benchmark for real-time human pose estimation technologies and pave the way for future innovations in the field. The implementation code for the paper is available at https://github.com/avhixd/Human_pose_estimation.

7/16/2024

A Lightweight Human Pose Estimation Approach for Edge Computing-Enabled Metaverse with Compressive Sensing

Nguyen Quang Hieu, Dinh Thai Hoang, Diep N. Nguyen

The ability to estimate 3D movements of users over edge computing-enabled networks, such as 5G/6G networks, is a key enabler for the new era of extended reality (XR) and Metaverse applications. Recent advancements in deep learning have shown advantages over optimization techniques for estimating 3D human poses given spare measurements from sensor signals, i.e., inertial measurement unit (IMU) sensors attached to the XR devices. However, the existing works lack applicability to wireless systems, where transmitting the IMU signals over noisy wireless networks poses significant challenges. Furthermore, the potential redundancy of the IMU signals has not been considered, resulting in highly redundant transmissions. In this work, we propose a novel approach for redundancy removal and lightweight transmission of IMU signals over noisy wireless environments. Our approach utilizes a random Gaussian matrix to transform the original signal into a lower-dimensional space. By leveraging the compressive sensing theory, we have proved that the designed Gaussian matrix can project the signal into a lower-dimensional space and preserve the Set-Restricted Eigenvalue condition, subject to a power transmission constraint. Furthermore, we develop a deep generative model at the receiver to recover the original IMU signals from noisy compressed data, thus enabling the creation of 3D human body movements at the receiver for XR and Metaverse applications. Simulation results on a real-world IMU dataset show that our framework can achieve highly accurate 3D human poses of the user using only $82%$ of the measurements from the original signals. This is comparable to an optimization-based approach, i.e., Lasso, but is an order of magnitude faster.

9/4/2024

Automatic infant 2D pose estimation from videos: comparing seven deep neural network methods

Filipe Gama, Matej Misar, Lukas Navara, Sergiu T. Popescu, Matej Hoffmann

Automatic markerless estimation of infant posture and motion from ordinary videos carries great potential for movement studies in the wild, facilitating understanding of motor development and massively increasing the chances of early diagnosis of disorders. There is rapid development of human pose estimation methods in computer vision thanks to advances in deep learning and machine learning. However, these methods are trained on datasets featuring adults in different contexts. This work tests and compares seven popular methods (AlphaPose, DeepLabCut/DeeperCut, Detectron2, HRNet, MediaPipe/BlazePose, OpenPose, and ViTPose) on videos of infants in supine position. Surprisingly, all methods except DeepLabCut and MediaPipe have competitive performance without additional finetuning, with ViTPose performing best. Next to standard performance metrics (object keypoint similarity, average precision and recall), we introduce errors expressed in the neck-mid-hip ratio and additionally study missed and redundant detections and the reliability of the internal confidence ratings of the different methods, which are relevant for downstream tasks. Among the networks with competitive performance, only AlphaPose could run close to real time (27 fps) on our machine. We provide documented Docker containers or instructions for all the methods we used, our analysis scripts, and processed data at https://hub.docker.com/u/humanoidsctu and https://osf.io/x465b/.

6/28/2024