Video-Based Human Pose Regression via Decoupled Space-Time Aggregation

Read original: arXiv:2403.19926 - Published 4/1/2024 by Jijie He, Wenwu Yang

Video-Based Human Pose Regression via Decoupled Space-Time Aggregation

Introduction

The text discusses human pose estimation, which aims to identify key anatomical points (such as elbows and knees) in human bodies from images or videos. This is an important task with applications in areas like motion capture, activity analysis, and human-robot interaction. Recent advancements in deep learning, particularly convolutional neural networks and Transformer networks, have led to significant progress in human pose estimation. However, most current methods focus on static images, when in fact the temporal information and consistency across video frames is also highly valuable for addressing challenges like motion blur and occlusions. The text emphasizes the need to sufficiently exploit the temporal cues in video sequences for human pose estimation.

Figure 1: (a) Compared to our proposed video-based regression method, previous image-based regression methods of RLE [20] and Poseur [25] have a substantial performance decline when processing video input, e.g., the dataset of PoseTrack2017 [16]. (b) Despite the intrinsic spatial correlations among human body joints, each joint exhibits independent motion trajectories temporally.

The provided text discusses the two main categories of human pose estimation methods: heatmap-based and regression-based. Heatmap-based methods generate a likelihood heatmap for each joint, while regression-based methods directly map the input to the output joint coordinates.

Heatmap-based methods have shown superior performance, particularly in video-based approaches. However, their high computation and storage requirements limit their use in 3D (temporal) contexts and real-time video applications. In contrast, regression-based methods are more flexible and efficient.

Unfortunately, existing regression-based approaches are designed for static images and neglect the temporal dependency between video frames, leading to a decline in performance when handling video input. The text then introduces the Decoupled Space-Time Aggregation (DSTA) method, which is a novel video-based human pose regression framework.

DSTA models the spatial structure between adjacent joints and the temporal dynamics of each joint separately, avoiding the conflation of spatiotemporal information. It uses a Joint-centric Feature Decoder (JFD) module to establish feature tokens for each joint, which are then used in the Space-Time Decoupling (STD) module to capture the spatiotemporal relations of pose joints.

The evaluation of DSTA on video-based benchmarks for human pose estimation shows a notable improvement over previous regression-based methods and superior performance to heatmap-based methods. Additionally, DSTA offers greater efficiency in computation and storage compared to heatmap-based multi-frame human pose estimation methods, making it more suitable for real-time video applications and easier to deploy, particularly on edge devices.

Related Work

This paper discusses two main approaches for human pose estimation: heatmap-based and regression-based methods.

Heatmap-based methods have become the predominant approach, using likelihood heatmaps to represent joint positions. Top-down methods first detect person bounding boxes and then estimate the pose within those regions, while bottom-up methods first detect individual keypoints and then cluster them into distinct persons. Recent work has leveraged temporal information from adjacent frames to enhance performance for video sequences.

However, heatmap-based methods have drawbacks, including quantization errors and high computational/storage demands for high-resolution heatmaps.

Regression-based methods bypass heatmaps and directly map the input to joint coordinates. While less accurate than heatmap-based methods historically, recent advancements have elevated regression-based performance. However, these static image-based regression methods struggle to capture temporal dependencies when applied to video.

This paper presents a regression-based approach for multi-person human pose estimation in video sequences that outperforms or matches state-of-the-art heatmap-based methods for video.

Method

This paper proposes a Decoupled Space-Time Aggregation (DSTA) method for estimating the locations of human pose joints in video frames. The key aspects are:

Overview: Given a video frame containing multiple persons, the method leverages temporal information from a sequence of consecutive frames to enhance pose estimation for the current frame. It follows a top-down approach, first detecting individuals and then estimating their poses.
Regression-based Pose Estimation: The method adopts a regression-based approach, which directly produces joint coordinates instead of heatmaps. This reduces computation and storage compared to heatmap-based methods.
Joint-centric Feature Decoder (JFD): JFD extracts feature embeddings for each individual joint from the global feature maps produced by the CNN backbone.
Space-Time Decoupling (STD): STD models the temporal dynamic dependencies and spatial structure dependencies of joints separately, producing aggregated space-time features for the current frame.
Local-awareness Attention: The method uses a local-awareness attention mechanism, where each joint only attends to those that are structurally or temporally relevant, reducing computational complexity compared to global attention.
Loss Computation: The model is trained end-to-end using a residual log-likelihood estimation (RLE) loss, which avoids the issues of conventional regression losses.

The proposed DSTA method aims to effectively capture the unique spatiotemporal characteristics of human poses, leading to improved pose estimation performance.

Experiments

The paper evaluated their proposed DSTA model on three widely-used video-based benchmarks for human pose estimation: PoseTrack2017, PoseTrack2018, and PoseTrack21. The authors used the Average Precision (AP) metric to assess performance, calculating the mean AP across all joints (mAP).

The DSTA method, a video-based regression approach, was compared to existing image-based regression methods. The results showed that the video-based DSTA significantly outperformed the image-based methods across different backbone networks. This demonstrates the importance of incorporating temporal information from neighboring frames.

The paper further compared DSTA to state-of-the-art heatmap-based video pose estimation methods. DSTA achieved comparable or superior performance on the PoseTrack datasets, while being significantly more computationally efficient. The regression-based DSTA had much lower computational complexity compared to the heatmap-based methods.

Experiments also showed that DSTA maintained its advantage over heatmap-based methods even with low-resolution inputs, outperforming the heatmap-based approach by a large margin at 64x64 resolution.

An ablation study was conducted to analyze the impact of different components of the DSTA model. The results demonstrated the importance of modeling temporal dependencies at the joint level, rather than using global pose features. The study also found that the temporal dependency module had a greater impact on performance than the spatial dependency module.

Conclusion

The paper proposes a new regression framework for estimating human poses from video. The framework, called Decoupled Space-Time Aggregation network (DSTA), efficiently uses temporal information in video sequences to improve pose estimation while reducing computational and storage requirements. Extensive experiments show the method outperforms both image-based regression and heatmap-based approaches, suggesting new possibilities for real-time video applications.

Acknowledgment

The work described in this section was supported by funding from the "Pioneer" and "Leading Goose" R&D Program of Zhejiang Province. The program number is 2024C01167.

Appendix

Additional Implementation Details: The paper provides further details on the implementation of the proposed method.

Computation Complexity on More Backbones: The paper evaluates the computational complexity of the proposed method on additional model backbones.

Experiments on PoseTrack2018/21 Datasets: The paper presents experiments conducted on the PoseTrack2018 and PoseTrack2021 datasets.

Additional Ablation Study: The paper includes an additional ablation study to further analyze the proposed method.

Qualitative Results: The paper includes qualitative results to complement the quantitative evaluations.

A. Additional Implementation Details

The paper discusses extracting joint embeddings using the HRNet architecture. In the Joint-centric Feature Decoder (JFD) module, the feature embedding is extracted for each joint from the global feature maps. The HRNet-W48 variant is used, with the high-resolution branch followed by a 1x1 convolutional layer and a joint-wise fully connected feed-forward network. This produces a 32-bit feature embedding for each joint.

The models are evaluated on three video-based human pose estimation benchmarks: PoseTrack2017, PoseTrack2018, and PoseTrack21. These datasets contain a varying number of video clips and pose annotations.

During training, data augmentation techniques are used, including random rotation, scaling, truncation, and flipping. The AdamW optimizer is employed, with a base learning rate of 2e-4 that is reduced twice during the 40-epoch training.

B. Computation Complexity on More Backbones

The provided text presents additional comparisons of computation complexity between the regression-based method and the heatmap-based methods on the PoseTrack2017 validation set using the ResNet-50 and MobileNet-V2 backbones.

The experiments utilized the official open-source codes for the heatmap-based PoseWarper and DCPose methods, which employed 3 deconvolution layers in their network heads to generate high-resolution heatmaps from the backbones.

The results show that the regression-based method outperforms the heatmap-based methods in both backbones, while utilizing significantly lower computation complexity and fewer model parameters. Compared to the HRNet backbone results in the main paper, the regression-based method achieves even greater savings in computational costs and model parameters on the smaller ResNet-50 and MobileNet-V2 backbones.

For the MobileNet-V2 backbone, the regression-based network has only 2.4 million parameters, whereas the heatmap-based networks require 14.8 and 11.3 million parameters. For the ResNet-50 backbone, the FLOPs of the regression-based head are almost negligible, accounting for just 1/9030 or 1/2170 of those required by the heatmap-based heads.

The superior computational and storage efficiency of the proposed regression framework holds significant value for the industry, especially for edge devices and real-time video applications.

Experiments on PoseTrack2018/21 Datasets

The provided text compares the method presented in the paper to other state-of-the-art methods on the PoseTrack2018 and PoseTrack21 validation sets. The results show that the regression-based method proposed in the paper achieves performance that is either better than or at least equal to the state-of-the-art heatmap-based methods.

Additional Ablation Study

The paper examines the influence of the size of the joint tokens, which represent the feature embedding for each joint in a pose estimation task. Experiments were conducted on the PoseTrack2017 validation set, and the results showed that as the joint token size increases, the performance gradually improves. However, beyond a size of 16, the performance tends to plateau, indicating that further increases in token size do not yield significant improvements. This suggests that each pose joint requires a sufficiently large feature token to store relevant information, but a token that is too large can lead to spatial redundancy. The researchers have chosen to use a token size of 32, as it provides a balance between capturing sufficient feature information and avoiding unnecessary spatial redundancy.

E. Qualitative Results

The paper includes additional qualitative results on the PoseTrack datasets, shown in Figure 4. The paper also provides additional results in the accompanying video material.

Figure 4: Additional qualitative results of our DSTA on the PoseTrack datasets.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Video-Based Human Pose Regression via Decoupled Space-Time Aggregation

Jijie He, Wenwu Yang

By leveraging temporal dependency in video sequences, multi-frame human pose estimation algorithms have demonstrated remarkable results in complicated situations, such as occlusion, motion blur, and video defocus. These algorithms are predominantly based on heatmaps, resulting in high computation and storage requirements per frame, which limits their flexibility and real-time application in video scenarios, particularly on edge devices. In this paper, we develop an efficient and effective video-based human pose regression method, which bypasses intermediate representations such as heatmaps and instead directly maps the input to the output joint coordinates. Despite the inherent spatial correlation among adjacent joints of the human pose, the temporal trajectory of each individual joint exhibits relative independence. In light of this, we propose a novel Decoupled Space-Time Aggregation network (DSTA) to separately capture the spatial contexts between adjacent joints and the temporal cues of each individual joint, thereby avoiding the conflation of spatiotemporal dimensions. Concretely, DSTA learns a dedicated feature token for each joint to facilitate the modeling of their spatiotemporal dependencies. With the proposed joint-wise local-awareness attention mechanism, our method is capable of efficiently and flexibly utilizing the spatial dependency of adjacent joints and the temporal dependency of each joint itself. Extensive experiments demonstrate the superiority of our method. Compared to previous regression-based single-frame human pose estimation methods, DSTA significantly enhances performance, achieving an 8.9 mAP improvement on PoseTrack2017. Furthermore, our approach either surpasses or is on par with the state-of-the-art heatmap-based multi-frame human pose estimation methods. Project page: https://github.com/zgspose/DSTA.

4/1/2024

Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation

Mengmeng Cui, Kunbo Zhang, Zhenan Sun

In recent years, 2D-to-3D pose uplifting in monocular 3D Human Pose Estimation (HPE) has attracted widespread research interest. GNN-based methods and Transformer-based methods have become mainstream architectures due to their advanced spatial and temporal feature learning capacities. However, existing approaches typically construct joint-wise and frame-wise attention alignments in spatial and temporal domains, resulting in dense connections that introduce considerable local redundancy and computational overhead. In this paper, we take a global approach to exploit spatio-temporal information and realise efficient 3D HPE with a concise Graph and Skipped Transformer architecture. Specifically, in Spatial Encoding stage, coarse-grained body parts are deployed to construct Spatial Graph Network with a fully data-driven adaptive topology, ensuring model flexibility and generalizability across various poses. In Temporal Encoding and Decoding stages, a simple yet effective Skipped Transformer is proposed to capture long-range temporal dependencies and implement hierarchical feature aggregation. A straightforward Data Rolling strategy is also developed to introduce dynamic information into 2D pose sequence. Extensive experiments are conducted on Human3.6M, MPI-INF-3DHP and Human-Eva benchmarks. G-SFormer series methods achieve superior performances compared with previous state-of-the-arts with only around ten percent of parameters and significantly reduced computational complexity. Additionally, G-SFormer also exhibits outstanding robustness to inaccuracies in detected 2D poses.

7/4/2024

Estimating Human Poses Across Datasets: A Unified Skeleton and Multi-Teacher Distillation Approach

Muhammad Saif Ullah Khan, Dhavalkumar Limbachiya, Didier Stricker, Muhammad Zeshan Afzal

Human pose estimation is a key task in computer vision with various applications such as activity recognition and interactive systems. However, the lack of consistency in the annotated skeletons across different datasets poses challenges in developing universally applicable models. To address this challenge, we propose a novel approach integrating multi-teacher knowledge distillation with a unified skeleton representation. Our networks are jointly trained on the COCO and MPII datasets, containing 17 and 16 keypoints, respectively. We demonstrate enhanced adaptability by predicting an extended set of 21 keypoints, 4 (COCO) and 5 (MPII) more than original annotations, improving cross-dataset generalization. Our joint models achieved an average accuracy of 70.89 and 76.40, compared to 53.79 and 55.78 when trained on a single dataset and evaluated on both. Moreover, we also evaluate all 21 predicted points by our two models by reporting an AP of 66.84 and 72.75 on the Halpe dataset. This highlights the potential of our technique to address one of the most pressing challenges in pose estimation research and application - the inconsistency in skeletal annotations.

5/31/2024

STGFormer: Spatio-Temporal GraphFormer for 3D Human Pose Estimation in Video

Yang Liu, Zhiyong Zhang

The current methods of video-based 3D human pose estimation have achieved significant progress; however, they continue to confront the significant challenge of depth ambiguity. To address this limitation, this paper presents the spatio-temporal GraphFormer framework for 3D human pose estimation in video, which integrates body structure graph-based representations with spatio-temporal information. Specifically, we develop a spatio-temporal criss-cross graph (STG) attention mechanism. This approach is designed to learn the long-range dependencies in data across both time and space, integrating graph information directly into the respective attention layers. Furthermore, we introduce the dual-path modulated hop-wise regular GCN (MHR-GCN) module, which utilizes modulation to optimize parameter usage and employs spatio-temporal hop-wise skip connections to acquire higher-order information. Additionally, this module processes temporal and spatial dimensions independently to learn their respective features while avoiding mutual influence. Finally, we demonstrate that our method achieves state-of-the-art performance in 3D human pose estimation on the Human3.6M and MPI-INF-3DHP datasets.

7/16/2024