A Survey on Backbones for Deep Video Action Recognition

2405.05584

Published 5/10/2024 by Zixuan Tang, Youjun Zhao, Yuhang Wen, Mengyuan Liu

🤿

Abstract

Action recognition is a key technology in building interactive metaverses. With the rapid development of deep learning, methods in action recognition have also achieved great advancement. Researchers design and implement the backbones referring to multiple standpoints, which leads to the diversity of methods and encountering new challenges. This paper reviews several action recognition methods based on deep neural networks. We introduce these methods in three parts: 1) Two-Streams networks and their variants, which, specifically in this paper, use RGB video frame and optical flow modality as input; 2) 3D convolutional networks, which make efforts in taking advantage of RGB modality directly while extracting different motion information is no longer necessary; 3) Transformer-based methods, which introduce the model from natural language processing into computer vision and video understanding. We offer objective sights in this review and hopefully provide a reference for future research.

Create account to get full access

Overview

This paper reviews several deep learning-based methods for action recognition, a key technology for building interactive metaverses.
The methods are categorized into three main groups: Two-Stream networks, 3D convolutional networks, and Transformer-based methods.
The paper aims to provide an objective overview of these approaches and serve as a reference for future research in this field.

Plain English Explanation

Action recognition is a technology that allows computers to identify and understand the actions and behaviors of people in video footage. This is a crucial capability for building interactive metaverses, which are virtual worlds where people can interact with each other and their environment in real-time.

Over the past several years, deep learning has led to significant advancements in action recognition. Researchers have explored different approaches, each with its own strengths and weaknesses. This paper categorizes these methods into three main groups:

Two-Stream Networks: These models take two types of input - RGB video frames and optical flow (a representation of motion) - and process them separately before combining the results. This allows the model to capture both visual and motion information.
3D Convolutional Networks: These models directly process the RGB video frames, without the need for a separate optical flow input. They are designed to extract different types of motion information directly from the video data.
Transformer-Based Methods: These models borrow techniques from natural language processing, using Transformer architectures to process the video data. This approach has shown promising results in video understanding tasks.

By reviewing these diverse approaches, the paper aims to provide a comprehensive understanding of the current state-of-the-art in action recognition, and to serve as a reference for researchers working to further advance this important technology.

Technical Explanation

Two-Stream Networks: Two-Stream networks use two separate input modalities: RGB video frames and optical flow. The RGB frames capture the visual appearance of the action, while the optical flow represents the motion information. The two streams are processed independently, often using convolutional neural networks, and their outputs are then combined to make the final prediction. This approach allows the model to leverage both visual and motion cues for improved action recognition.

3D Convolutional Networks: In contrast to Two-Stream networks, 3D convolutional networks directly process the RGB video frames, without the need for a separate optical flow input. These models use 3D convolutions, which can capture both spatial and temporal information within the video data. By learning features directly from the raw video, 3D convolutional networks can potentially extract richer motion representations and avoid the need for explicit optical flow computation.

Transformer-Based Methods: The recent success of Transformer architectures in natural language processing has inspired researchers to apply similar techniques to video understanding tasks, including action recognition. Transformer-based models for action recognition leverage the self-attention mechanism to capture long-range dependencies in the video data, which can be important for understanding complex actions and activities.

Critical Analysis

The paper provides a comprehensive overview of the current state-of-the-art in deep learning-based action recognition, highlighting the key strengths and tradeoffs of the different approaches. However, it does not delve into the specific limitations or potential issues with each method.

For example, while Two-Stream networks can effectively capture both visual and motion information, they require the additional computation and storage of optical flow data, which can be costly. 3D convolutional networks may be more efficient, but they may struggle to learn complex, long-range dependencies in the video data.

Similarly, the paper does not discuss potential biases or fairness concerns that may arise from these action recognition models, nor does it address the ethical implications of deploying such technologies in real-world applications, such as surveillance or human-robot interaction.

Conclusion

This paper provides a valuable overview of the current approaches to deep learning-based action recognition, a critical technology for building interactive metaverses. By categorizing the methods into Three-Stream networks, 3D convolutional networks, and Transformer-based models, the authors offer a comprehensive understanding of the state-of-the-art in this rapidly evolving field.

While the paper does not delve deeply into the limitations or potential issues with these methods, it serves as a useful reference for researchers and practitioners working to advance the capabilities of action recognition systems. As the development of metaverses and other interactive virtual environments continues, the insights and trends highlighted in this review will likely play an important role in guiding future research and development in this crucial area of computer vision.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos

Sharana Dharshikgan Suresh Dass, Hrishav Bakul Barua, Ganesh Krishnasamy, Raveendran Paramesran, Raphael C. -W. Phan

Human action or activity recognition in videos is a fundamental task in computer vision with applications in surveillance and monitoring, self-driving cars, sports analytics, human-robot interaction and many more. Traditional supervised methods require large annotated datasets for training, which are expensive and time-consuming to acquire. This work proposes a novel approach using Cross-Architecture Pseudo-Labeling with contrastive learning for semi-supervised action recognition. Our framework leverages both labeled and unlabelled data to robustly learn action representations in videos, combining pseudo-labeling with contrastive learning for effective learning from both types of samples. We introduce a novel cross-architecture approach where 3D Convolutional Neural Networks (3D CNNs) and video transformers (VIT) are utilised to capture different aspects of action representations; hence we call it ActNetFormer. The 3D CNNs excel at capturing spatial features and local dependencies in the temporal domain, while VIT excels at capturing long-range dependencies across frames. By integrating these complementary architectures within the ActNetFormer framework, our approach can effectively capture both local and global contextual information of an action. This comprehensive representation learning enables the model to achieve better performance in semi-supervised action recognition tasks by leveraging the strengths of each of these architectures. Experimental results on standard action recognition datasets demonstrate that our approach performs better than the existing methods, achieving state-of-the-art performance with only a fraction of labeled data. The official website of this work is available at: https://github.com/rana2149/ActNetFormer.

4/10/2024

cs.CV cs.AI cs.HC cs.LG cs.MM

Exploring Explainability in Video Action Recognition

Avinab Saha, Shashank Gupta, Sravan Kumar Ankireddy, Karl Chahine, Joydeep Ghosh

Image Classification and Video Action Recognition are perhaps the two most foundational tasks in computer vision. Consequently, explaining the inner workings of trained deep neural networks is of prime importance. While numerous efforts focus on explaining the decisions of trained deep neural networks in image classification, exploration in the domain of its temporal version, video action recognition, has been scant. In this work, we take a deeper look at this problem. We begin by revisiting Grad-CAM, one of the popular feature attribution methods for Image Classification, and its extension to Video Action Recognition tasks and examine the method's limitations. To address these, we introduce Video-TCAV, by building on TCAV for Image Classification tasks, which aims to quantify the importance of specific concepts in the decision-making process of Video Action Recognition models. As the scalable generation of concepts is still an open problem, we propose a machine-assisted approach to generate spatial and spatiotemporal concepts relevant to Video Action Recognition for testing Video-TCAV. We then establish the importance of temporally-varying concepts by demonstrating the superiority of dynamic spatiotemporal concepts over trivial spatial concepts. In conclusion, we introduce a framework for investigating hypotheses in action recognition and quantitatively testing them, thus advancing research in the explainability of deep neural networks used in video action recognition.

4/16/2024

cs.CV cs.AI

🏷️

Lane Change Classification and Prediction with Action Recognition Networks

Kai Liang, Jun Wang, Abhir Bhalerao

Anticipating lane change intentions of surrounding vehicles is crucial for efficient and safe driving decision making in an autonomous driving system. Previous works often adopt physical variables such as driving speed, acceleration and so forth for lane change classification. However, physical variables do not contain semantic information. Although 3D CNNs have been developing rapidly, the number of methods utilising action recognition models and appearance feature for lane change recognition is low, and they all require additional information to pre-process data. In this work, we propose an end-to-end framework including two action recognition methods for lane change recognition, using video data collected by cameras. Our method achieves the best lane change classification results using only the RGB video data of the PREVENTION dataset. Class activation maps demonstrate that action recognition models can efficiently extract lane change motions. A method to better extract motion clues is also proposed in this paper.

4/11/2024

cs.CV

🤔

From CNNs to Transformers in Multimodal Human Action Recognition: A Survey

Muhammad Bilal Shaikh, Syed Mohammed Shamsul Islam, Douglas Chai, Naveed Akhtar

Due to its widespread applications, human action recognition is one of the most widely studied research problems in Computer Vision. Recent studies have shown that addressing it using multimodal data leads to superior performance as compared to relying on a single data modality. During the adoption of deep learning for visual modelling in the last decade, action recognition approaches have mainly relied on Convolutional Neural Networks (CNNs). However, the recent rise of Transformers in visual modelling is now also causing a paradigm shift for the action recognition task. This survey captures this transition while focusing on Multimodal Human Action Recognition (MHAR). Unique to the induction of multimodal computational models is the process of fusing the features of the individual data modalities. Hence, we specifically focus on the fusion design aspects of the MHAR approaches. We analyze the classic and emerging techniques in this regard, while also highlighting the popular trends in the adaption of CNN and Transformer building blocks for the overall problem. In particular, we emphasize on recent design choices that have led to more efficient MHAR models. Unlike existing reviews, which discuss Human Action Recognition from a broad perspective, this survey is specifically aimed at pushing the boundaries of MHAR research by identifying promising architectural and fusion design choices to train practicable models. We also provide an outlook of the multimodal datasets from their scale and evaluation viewpoint. Finally, building on the reviewed literature, we discuss the challenges and future avenues for MHAR.

5/28/2024

cs.CV