SiT-MLP: A Simple MLP with Point-wise Topology Feature Learning for Skeleton-based Action Recognition

Read original: arXiv:2308.16018 - Published 4/9/2024 by Shaojie Zhang, Jianqin Yin, Yonghao Dang, Jiajun Fu

✨

Overview

Graph Convolution Networks (GCNs) have shown promising performance in skeleton-based action recognition tasks.
However, previous GCN-based methods rely heavily on elaborate human priors and complex feature aggregation mechanisms, which can limit their generalizability and effectiveness.
To address these issues, the authors propose a novel Spatial Topology Gating Unit (STGU), an MLP-based variant that can capture co-occurrence topology features without using extra priors.
The authors also introduce the first MLP-based model, SiT-MLP, for skeleton-based action recognition, which achieves competitive performance with significantly fewer parameters.

Plain English Explanation

The paper introduces a new technique called Spatial Topology Gating Unit (STGU) to improve the performance of Graph Convolution Networks (GCNs) in skeleton-based action recognition tasks. Previous GCN-based methods heavily relied on complex human knowledge and feature engineering, which made them less flexible and effective.

The STGU is a more straightforward, MLP-based (Multi-Layer Perceptron) approach that can capture the spatial relationships between different joints in the skeleton without needing as much prior information. It uses a new "gate-based" mechanism to learn these spatial dependencies directly from the input data.

Building on the STGU, the authors also propose the first MLP-based model, called SiT-MLP, for skeleton-based action recognition. This new model performs just as well as previous, more complex GCN-based methods, but with significantly fewer parameters, making it more efficient and easier to use.

The key idea is to find a way to understand the spatial relationships between different body parts (joints) without relying too heavily on pre-defined rules or human expertise. This makes the models more flexible and able to generalize better to new situations.

Technical Explanation

The paper proposes a novel Spatial Topology Gating Unit (STGU), which is an MLP-based variant that can capture the co-occurrence topology features encoding the spatial dependency across all joints in a skeleton-based action recognition task. Unlike previous GCN-based methods that rely on elaborate human priors and complex feature aggregation mechanisms, the STGU learns the point-wise topology features using a new gate-based feature interaction mechanism.

The gate-based mechanism generates an attention map from the input sample to activate the features point-to-point, allowing the model to learn the spatial dependencies directly from the data without requiring additional human knowledge or hand-crafted features.

Based on the STGU, the authors introduce the first MLP-based model, SiT-MLP, for skeleton-based action recognition. Compared to previous GCN-based methods on three large-scale datasets, SiT-MLP achieves competitive performance while significantly reducing the number of parameters.

The authors' approach contrasts with other recent work, such as StepNet, which uses a more complex, part-aware network, and MLP-based Transformer learners, which leverage self-attention mechanisms. In contrast, the STGU and SiT-MLP demonstrate that a simpler, MLP-based approach can be effective for skeleton-based action recognition tasks.

Critical Analysis

The paper presents a promising approach to improving the generalizability and efficiency of skeleton-based action recognition models. By introducing the STGU and the SiT-MLP model, the authors have shown that a more straightforward, MLP-based architecture can achieve competitive performance compared to previous, more complex GCN-based methods.

However, the paper does not provide a thorough analysis of the limitations or potential drawbacks of the proposed approach. For example, it would be interesting to understand how the STGU and SiT-MLP models compare to other recent advances in the field, such as GVT, a Graph-based Vision Transformer, or biologically-plausible spiking neural networks, which aim to capture spatial dependencies in a more neurologically-inspired way.

Additionally, the paper could benefit from a more in-depth discussion of the potential limitations or weaknesses of the STGU and SiT-MLP models, such as their performance on more challenging or diverse datasets, their robustness to noisy or incomplete data, or their ability to generalize to different types of action recognition tasks.

Conclusion

The paper presents a novel Spatial Topology Gating Unit (STGU) and the first MLP-based model, SiT-MLP, for skeleton-based action recognition. The key innovation is the use of a gate-based feature interaction mechanism to capture the spatial dependencies between joints without relying on elaborate human priors or complex feature aggregation methods.

The results show that the simpler, MLP-based approach can achieve competitive performance compared to previous GCN-based methods, while significantly reducing the number of parameters. This suggests that the STGU and SiT-MLP models could be more generalizable and efficient alternatives for skeleton-based action recognition tasks.

Overall, the paper makes a valuable contribution to the field by demonstrating the potential of MLP-based architectures in this domain and providing a new, more streamlined approach to modeling spatial relationships in skeleton data.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

✨

SiT-MLP: A Simple MLP with Point-wise Topology Feature Learning for Skeleton-based Action Recognition

Shaojie Zhang, Jianqin Yin, Yonghao Dang, Jiajun Fu

Graph convolution networks (GCNs) have achieved remarkable performance in skeleton-based action recognition. However, previous GCN-based methods rely on elaborate human priors excessively and construct complex feature aggregation mechanisms, which limits the generalizability and effectiveness of networks. To solve these problems, we propose a novel Spatial Topology Gating Unit (STGU), an MLP-based variant without extra priors, to capture the co-occurrence topology features that encode the spatial dependency across all joints. In STGU, to learn the point-wise topology features, a new gate-based feature interaction mechanism is introduced to activate the features point-to-point by the attention map generated from the input sample. Based on the STGU, we propose the first MLP-based model, SiT-MLP, for skeleton-based action recognition in this work. Compared with previous methods on three large-scale datasets, SiT-MLP achieves competitive performance. In addition, SiT-MLP reduces the parameters significantly with favorable results. The code will be available at https://github.com/BUPTSJZhang/SiT?MLP.

4/9/2024

Multi-Scale Spatial-Temporal Self-Attention Graph Convolutional Networks for Skeleton-based Action Recognition

Ikuo Nakamura

Skeleton-based gesture recognition methods have achieved high success using Graph Convolutional Network (GCN). In addition, context-dependent adaptive topology as a neighborhood vertex information and attention mechanism leverages a model to better represent actions. In this paper, we propose self-attention GCN hybrid model, Multi-Scale Spatial-Temporal self-attention (MSST)-GCN to effectively improve modeling ability to achieve state-of-the-art results on several datasets. We utilize spatial self-attention module with adaptive topology to understand intra-frame interactions within a frame among different body parts, and temporal self-attention module to examine correlations between frames of a node. These two are followed by multi-scale convolution network with dilations, which not only captures the long-range temporal dependencies of joints but also the long-range spatial dependencies (i.e., long-distance dependencies) of node temporal behaviors. They are combined into high-level spatial-temporal representations and output the predicted action with the softmax classifier.

4/4/2024

🌐

MK-SGN: A Spiking Graph Convolutional Network with Multimodal Fusion and Knowledge Distillation for Skeleton-based Action Recognition

Naichuan Zheng, Hailun Xia, Zeyu Liang

In recent years, skeleton-based action recognition, leveraging multimodal Graph Convolutional Networks (GCN), has achieved remarkable results. However, due to their deep structure and reliance on continuous floating-point operations, GCN-based methods are energy-intensive. To address this issue, we propose an innovative Spiking Graph Convolutional Network with Multimodal Fusion and Knowledge Distillation (MK-SGN). By merging the energy efficiency of Spiking Neural Network (SNN) with the graph representation capability of GCN, the proposed MK-SGN reduces energy consumption while maintaining recognition accuracy. Firstly, we convert GCN into Spiking Graph Convolutional Network (SGN) and construct a foundational Base-SGN for skeleton-based action recognition, establishing a new benchmark and paving the way for future research exploration. Secondly, we further propose a Spiking Multimodal Fusion module (SMF), leveraging mutual information to process multimodal data more efficiently. Additionally, we introduce a spiking attention mechanism and design a Spatio Graph Convolution module with a Spatial Global Spiking Attention mechanism (SA-SGC), enhancing feature learning capability. Furthermore, we delve into knowledge distillation methods from multimodal GCN to SGN and propose a novel, integrated method that simultaneously focuses on both intermediate layer distillation and soft label distillation to improve the performance of SGN. On two challenging datasets for skeleton-based action recognition, MK-SGN outperforms the state-of-the-art GCN-like frameworks in reducing computational load and energy consumption. In contrast, typical GCN methods typically consume more than 35mJ per action sample, while MK-SGN reduces energy consumption by more than 98%.

4/17/2024

🤯

High-Performance Inference Graph Convolutional Networks for Skeleton-Based Action Recognition

Ziao Li, Junyi Wang, Bangli Liu, Haibin Cai, Mohamad Saada, Qinggang Meng

Recently, the significant achievements have been made in skeleton-based human action recognition with the emergence of graph convolutional networks (GCNs). However, the state-of-the-art (SOTA) models used for this task focus on constructing more complex higher-order connections between joint nodes to describe skeleton information, which leads to complex inference processes and high computational costs. To address the slow inference speed caused by overly complex model structures, we introduce re-parameterization and over-parameterization techniques to GCNs and propose two novel high-performance inference GCNs, namely HPI-GCN-RP and HPI-GCN-OP. After the completion of model training, model parameters are fixed. HPI-GCN-RP adopts re-parameterization technique to transform high-performance training model into fast inference model through linear transformations, which achieves a higher inference speed with competitive model performance. HPI-GCN-OP further utilizes over-parameterization technique to achieve higher performance improvement by introducing additional inference parameters, albeit with slightly decreased inference speed. The experimental results on the two skeleton-based action recognition datasets demonstrate the effectiveness of our approach. Our HPI-GCN-OP achieves performance comparable to the current SOTA models, with inference speeds five times faster. Specifically, our HPI-GCN-OP achieves an accuracy of 93% on the cross-subject split of the NTU-RGB+D 60 dataset, and 90.1% on the cross-subject benchmark of the NTU-RGB+D 120 dataset. Code is available at github.com/lizaowo/HPI-GCN.

6/19/2024