HDBN: A Novel Hybrid Dual-branch Network for Robust Skeleton-based Action Recognition

Read original: arXiv:2404.15719 - Published 4/26/2024 by Jinfu Liu, Baiqiao Yin, Jiaying Lin, Jiajun Wen, Yue Li, Mengyuan Liu

HDBN: A Novel Hybrid Dual-branch Network for Robust Skeleton-based Action Recognition

Overview

Introduces a novel network architecture called HDBN (Hybrid Dual-Branch Network) for robust skeleton-based action recognition
Combines the strengths of CNN (Convolutional Neural Network) and RNN (Recurrent Neural Network) approaches to capture both spatial and temporal features
Aims to improve the performance and robustness of skeleton-based action recognition tasks

Plain English Explanation

HDBN is a new type of artificial intelligence (AI) model designed to recognize human actions based on the movement of the body's skeleton.

The model works by taking in information about the positions and movements of the joints in the body, and then using two different types of neural networks - a CNN and an RNN - to analyze this data.

The CNN part of the model looks at the spatial relationships between the different joints, while the RNN part looks at how those joint movements change over time. By combining these two approaches, the HDBN model can capture both the static poses and the dynamic movements that make up different human actions.

The researchers who developed HDBN believe this hybrid approach leads to more accurate and robust action recognition, compared to using just a CNN or just an RNN on its own. This could be useful for applications like video analysis, motion capture, and human-robot interaction, where being able to reliably identify what someone is doing is important.

Technical Explanation

The HDBN architecture consists of two parallel branches: a CNN-based branch and an RNN-based branch.

The CNN branch takes in the joint coordinates of the skeleton at each time step and learns to extract spatial features. The RNN branch processes the sequence of joint coordinates over time to capture the temporal dynamics of the action.

The outputs of the two branches are then concatenated and fed into a final fully-connected layer to predict the action class. This hybrid approach allows the model to leverage both the spatial and temporal information present in the skeleton data.

The researchers also introduce several optimization techniques, such as a dedicated loss function and adaptive feature fusion, to further improve the performance and robustness of the HDBN model.

Critical Analysis

The HDBN paper presents a well-designed and thoroughly evaluated model for skeleton-based action recognition. The authors demonstrate the effectiveness of their approach through extensive experiments on several benchmark datasets, showing consistent improvements over state-of-the-art methods.

However, the paper does not address some potential limitations of the HDBN approach. For example, the model may not be as effective in handling occlusions or missing joint data, which can be common in real-world scenarios. Additionally, the computational complexity of the hybrid architecture could be a concern, especially for deployments on resource-constrained devices.

Further research is needed to explore the fine-grained side information guided dual prompts and their impact on the HDBN model's performance and robustness. Incorporating additional context or prior knowledge could help the model generalize better to more challenging action recognition tasks.

Conclusion

The HDBN model presented in this paper is a promising step forward in the field of skeleton-based action recognition. By leveraging the complementary strengths of CNNs and RNNs, the HDBN architecture is able to achieve state-of-the-art performance on several benchmark datasets.

The researchers' emphasis on improving the model's robustness and generalization capabilities is particularly noteworthy, as it represents an important step towards real-world applications of action recognition technology. Further advancements in this direction could lead to exciting developments in areas such as video surveillance, human-computer interaction, and assistive robotics.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

HDBN: A Novel Hybrid Dual-branch Network for Robust Skeleton-based Action Recognition

Jinfu Liu, Baiqiao Yin, Jiaying Lin, Jiajun Wen, Yue Li, Mengyuan Liu

Skeleton-based action recognition has gained considerable traction thanks to its utilization of succinct and robust skeletal representations. Nonetheless, current methodologies often lean towards utilizing a solitary backbone to model skeleton modality, which can be limited by inherent flaws in the network backbone. To address this and fully leverage the complementary characteristics of various network architectures, we propose a novel Hybrid Dual-Branch Network (HDBN) for robust skeleton-based action recognition, which benefits from the graph convolutional network's proficiency in handling graph-structured data and the powerful modeling capabilities of Transformers for global information. In detail, our proposed HDBN is divided into two trunk branches: MixGCN and MixFormer. The two branches utilize GCNs and Transformers to model both 2D and 3D skeletal modalities respectively. Our proposed HDBN emerged as one of the top solutions in the Multi-Modal Video Reasoning and Analyzing Competition (MMVRAC) of 2024 ICME Grand Challenge, achieving accuracies of 47.95% and 75.36% on two benchmarks of the UAV-Human dataset by outperforming most existing methods. Our code will be publicly available at: https://github.com/liujf69/ICMEW2024-Track10.

4/26/2024

🌐

MK-SGN: A Spiking Graph Convolutional Network with Multimodal Fusion and Knowledge Distillation for Skeleton-based Action Recognition

Naichuan Zheng, Hailun Xia, Zeyu Liang

In recent years, skeleton-based action recognition, leveraging multimodal Graph Convolutional Networks (GCN), has achieved remarkable results. However, due to their deep structure and reliance on continuous floating-point operations, GCN-based methods are energy-intensive. To address this issue, we propose an innovative Spiking Graph Convolutional Network with Multimodal Fusion and Knowledge Distillation (MK-SGN). By merging the energy efficiency of Spiking Neural Network (SNN) with the graph representation capability of GCN, the proposed MK-SGN reduces energy consumption while maintaining recognition accuracy. Firstly, we convert GCN into Spiking Graph Convolutional Network (SGN) and construct a foundational Base-SGN for skeleton-based action recognition, establishing a new benchmark and paving the way for future research exploration. Secondly, we further propose a Spiking Multimodal Fusion module (SMF), leveraging mutual information to process multimodal data more efficiently. Additionally, we introduce a spiking attention mechanism and design a Spatio Graph Convolution module with a Spatial Global Spiking Attention mechanism (SA-SGC), enhancing feature learning capability. Furthermore, we delve into knowledge distillation methods from multimodal GCN to SGN and propose a novel, integrated method that simultaneously focuses on both intermediate layer distillation and soft label distillation to improve the performance of SGN. On two challenging datasets for skeleton-based action recognition, MK-SGN outperforms the state-of-the-art GCN-like frameworks in reducing computational load and energy consumption. In contrast, typical GCN methods typically consume more than 35mJ per action sample, while MK-SGN reduces energy consumption by more than 98%.

4/17/2024

🤯

High-Performance Inference Graph Convolutional Networks for Skeleton-Based Action Recognition

Ziao Li, Junyi Wang, Bangli Liu, Haibin Cai, Mohamad Saada, Qinggang Meng

Recently, the significant achievements have been made in skeleton-based human action recognition with the emergence of graph convolutional networks (GCNs). However, the state-of-the-art (SOTA) models used for this task focus on constructing more complex higher-order connections between joint nodes to describe skeleton information, which leads to complex inference processes and high computational costs. To address the slow inference speed caused by overly complex model structures, we introduce re-parameterization and over-parameterization techniques to GCNs and propose two novel high-performance inference GCNs, namely HPI-GCN-RP and HPI-GCN-OP. After the completion of model training, model parameters are fixed. HPI-GCN-RP adopts re-parameterization technique to transform high-performance training model into fast inference model through linear transformations, which achieves a higher inference speed with competitive model performance. HPI-GCN-OP further utilizes over-parameterization technique to achieve higher performance improvement by introducing additional inference parameters, albeit with slightly decreased inference speed. The experimental results on the two skeleton-based action recognition datasets demonstrate the effectiveness of our approach. Our HPI-GCN-OP achieves performance comparable to the current SOTA models, with inference speeds five times faster. Specifically, our HPI-GCN-OP achieves an accuracy of 93% on the cross-subject split of the NTU-RGB+D 60 dataset, and 90.1% on the cross-subject benchmark of the NTU-RGB+D 120 dataset. Code is available at github.com/lizaowo/HPI-GCN.

6/19/2024

GCN-DevLSTM: Path Development for Skeleton-Based Action Recognition

Lei Jiang, Weixin Yang, Xin Zhang, Hao Ni

Skeleton-based action recognition (SAR) in videos is an important but challenging task in computer vision. The recent state-of-the-art (SOTA) models for SAR are primarily based on graph convolutional neural networks (GCNs), which are powerful in extracting the spatial information of skeleton data. However, it is yet clear that such GCN-based models can effectively capture the temporal dynamics of human action sequences. To this end, we propose the G-Dev layer, which exploits the path development -- a principled and parsimonious representation for sequential data by leveraging the Lie group structure. By integrating the G-Dev layer, the hybrid G-DevLSTM module enhances the traditional LSTM to reduce the time dimension while retaining high-frequency information. It can be conveniently applied to any temporal graph data, complementing existing advanced GCN-based models. Our empirical studies on the NTU60, NTU120 and Chalearn2013 datasets demonstrate that our proposed GCN-DevLSTM network consistently improves the strong GCN baseline models and achieves SOTA results with superior robustness in SAR tasks. The code is available at https://github.com/DeepIntoStreams/GCN-DevLSTM.

5/28/2024