Semi-supervised Active Learning for Video Action Detection

2312.07169

Published 4/4/2024 by Ayush Singh, Aayush J Rana, Akash Kumar, Shruti Vyas, Yogesh Singh Rawat

Semi-supervised Active Learning for Video Action Detection

Abstract

In this work, we focus on label efficient learning for video action detection. We develop a novel semi-supervised active learning approach which utilizes both labeled as well as unlabeled data along with informative sample selection for action detection. Video action detection requires spatio-temporal localization along with classification, which poses several challenges for both active learning informative sample selection as well as semi-supervised learning pseudo label generation. First, we propose NoiseAug, a simple augmentation strategy which effectively selects informative samples for video action detection. Next, we propose fft-attention, a novel technique based on high-pass filtering which enables effective utilization of pseudo label for SSL in video action detection by emphasizing on relevant activity region within a video. We evaluate the proposed approach on three different benchmark datasets, UCF-101-24, JHMDB-21, and Youtube-VOS. First, we demonstrate its effectiveness on video action detection where the proposed approach outperforms prior works in semi-supervised and weakly-supervised learning along with several baseline approaches in both UCF101-24 and JHMDB-21. Next, we also show its effectiveness on Youtube-VOS for video object segmentation demonstrating its generalization capability for other dense prediction tasks in videos. The code and models is publicly available at: url{https://github.com/AKASH2907/semi-sup-active-learning}.

Create account to get full access

Overview

This paper proposes a semi-supervised active learning approach for video action detection, which aims to efficiently annotate video data with limited labeled samples.
The key idea is to leverage both labeled and unlabeled data to train a model, and then actively select informative samples for human annotation to improve the model iteratively.
The authors demonstrate the effectiveness of their approach on several video action detection benchmarks, showing improved performance compared to fully-supervised and other semi-supervised baselines.

Plain English Explanation

The paper focuses on the challenge of annotating video data for action detection, which is a common task in computer vision. Annotating videos can be very time-consuming, as it requires humans to carefully label all the actions that occur in each video.

The researchers' solution is to use a semi-supervised approach, which means they leverage both labeled video data (where actions have already been annotated) as well as unlabeled video data (where no annotations exist yet). By using both types of data, the model can learn patterns and features without needing as much labeled data.

Additionally, the researchers propose an "active learning" approach. This means the model can actively identify the video frames or clips that would be most informative for a human annotator to label. By focusing annotation efforts on the most useful samples, the model can be iteratively improved with fewer total annotations.

The key benefit of this semi-supervised active learning approach is that it can achieve high performance on video action detection tasks while minimizing the amount of expensive, manual annotation work required. This could save significant time and effort compared to a traditional fully-supervised approach.

Technical Explanation

The paper presents a semi-supervised active learning framework for video action detection. It consists of three main components:

Semi-supervised learning: The model is trained on both labeled and unlabeled video data. For the unlabeled data, the model uses pseudo-labels generated by the current model to provide supervision.
Active learning: The model actively selects the most informative video clips for a human annotator to label. This is done by measuring the model's uncertainty about each unlabeled clip and prioritizing the ones it is least confident about.
Iterative refinement: The annotated clips are added to the training set, and the model is fine-tuned. This process repeats over multiple iterations, with the model progressively improving as it gains access to more labeled data.

The authors evaluate their approach on several video action detection benchmarks, including Kinetics, ActivityNet, and HACS. They demonstrate that their semi-supervised active learning method outperforms fully-supervised baselines as well as other semi-supervised approaches, achieving state-of-the-art results while requiring fewer human annotations.

Critical Analysis

The paper presents a well-designed and thorough evaluation of the proposed semi-supervised active learning framework. The authors clearly articulate the motivation and challenges around video action detection, and their approach seems like a practical solution to address the annotation bottleneck.

One potential limitation is that the active learning component relies on the model's own uncertainty estimates, which could be biased or miscalibrated, especially in the early iterations of the training process. The authors do not discuss how they address this issue or evaluate the reliability of the uncertainty estimates.

Additionally, the paper focuses on improving overall action detection performance, but does not provide much insight into how the active learning process affects the diversity of the annotated samples or the model's ability to detect rare or unusual actions. These aspects could be worth investigating further.

Overall, the paper makes a compelling case for the benefits of semi-supervised active learning for video action detection, and the results suggest it is a promising direction for further research and real-world applications.

Conclusion

This paper presents a novel semi-supervised active learning approach for video action detection, which aims to address the challenge of annotating large-scale video datasets. By leveraging both labeled and unlabeled data, and actively selecting the most informative samples for annotation, the proposed framework can achieve state-of-the-art performance on several benchmarks while significantly reducing the manual annotation effort required.

The key insights and contributions of this work include the effective integration of semi-supervised learning and active learning techniques, as well as the demonstration of their practical benefits for video understanding tasks. This research has the potential to greatly streamline the data annotation process and open up new possibilities for large-scale video analysis in a wide range of applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos

Sharana Dharshikgan Suresh Dass, Hrishav Bakul Barua, Ganesh Krishnasamy, Raveendran Paramesran, Raphael C. -W. Phan

Human action or activity recognition in videos is a fundamental task in computer vision with applications in surveillance and monitoring, self-driving cars, sports analytics, human-robot interaction and many more. Traditional supervised methods require large annotated datasets for training, which are expensive and time-consuming to acquire. This work proposes a novel approach using Cross-Architecture Pseudo-Labeling with contrastive learning for semi-supervised action recognition. Our framework leverages both labeled and unlabelled data to robustly learn action representations in videos, combining pseudo-labeling with contrastive learning for effective learning from both types of samples. We introduce a novel cross-architecture approach where 3D Convolutional Neural Networks (3D CNNs) and video transformers (VIT) are utilised to capture different aspects of action representations; hence we call it ActNetFormer. The 3D CNNs excel at capturing spatial features and local dependencies in the temporal domain, while VIT excels at capturing long-range dependencies across frames. By integrating these complementary architectures within the ActNetFormer framework, our approach can effectively capture both local and global contextual information of an action. This comprehensive representation learning enables the model to achieve better performance in semi-supervised action recognition tasks by leveraging the strengths of each of these architectures. Experimental results on standard action recognition datasets demonstrate that our approach performs better than the existing methods, achieving state-of-the-art performance with only a fraction of labeled data. The official website of this work is available at: https://github.com/rana2149/ActNetFormer.

4/10/2024

cs.CV cs.AI cs.HC cs.LG cs.MM

👁️

Learning Discriminative Spatio-temporal Representations for Semi-supervised Action Recognition

Yu Wang, Sanping Zhou, Kun Xia, Le Wang

Semi-supervised action recognition aims to improve spatio-temporal reasoning ability with a few labeled data in conjunction with a large amount of unlabeled data. Albeit recent advancements, existing powerful methods are still prone to making ambiguous predictions under scarce labeled data, embodied as the limitation of distinguishing different actions with similar spatio-temporal information. In this paper, we approach this problem by empowering the model two aspects of capability, namely discriminative spatial modeling and temporal structure modeling for learning discriminative spatio-temporal representations. Specifically, we propose an Adaptive Contrastive Learning~(ACL) strategy. It assesses the confidence of all unlabeled samples by the class prototypes of the labeled data, and adaptively selects positive-negative samples from a pseudo-labeled sample bank to construct contrastive learning. Additionally, we introduce a Multi-scale Temporal Learning~(MTL) strategy. It could highlight informative semantics from long-term clips and integrate them into the short-term clip while suppressing noisy information. Afterwards, both of these two new techniques are integrated in a unified framework to encourage the model to make accurate predictions. Extensive experiments on UCF101, HMDB51 and Kinetics400 show the superiority of our method over prior state-of-the-art approaches.

4/26/2024

cs.CV

A Semantic and Motion-Aware Spatiotemporal Transformer Network for Action Detection

Matthew Korban, Peter Youngs, Scott T. Acton

This paper presents a novel spatiotemporal transformer network that introduces several original components to detect actions in untrimmed videos. First, the multi-feature selective semantic attention model calculates the correlations between spatial and motion features to model spatiotemporal interactions between different action semantics properly. Second, the motion-aware network encodes the locations of action semantics in video frames utilizing the motion-aware 2D positional encoding algorithm. Such a motion-aware mechanism memorizes the dynamic spatiotemporal variations in action frames that current methods cannot exploit. Third, the sequence-based temporal attention model captures the heterogeneous temporal dependencies in action frames. In contrast to standard temporal attention used in natural language processing, primarily aimed at finding similarities between linguistic words, the proposed sequence-based temporal attention is designed to determine both the differences and similarities between video frames that jointly define the meaning of actions. The proposed approach outperforms the state-of-the-art solutions on four spatiotemporal action datasets: AVA 2.2, AVA 2.1, UCF101-24, and EPIC-Kitchens.

5/15/2024

cs.CV

Towards Weakly Supervised End-to-end Learning for Long-video Action Recognition

Jiaming Zhou, Hanjun Li, Kun-Yu Lin, Junwei Liang

Developing end-to-end action recognition models on long videos is fundamental and crucial for long-video action understanding. Due to the unaffordable cost of end-to-end training on the whole long videos, existing works generally train models on short clips trimmed from long videos. However, this ``trimming-then-training'' practice requires action interval annotations for clip-level supervision, i.e., knowing which actions are trimmed into the clips. Unfortunately, collecting such annotations is very expensive and prevents model training at scale. To this end, this work aims to build a weakly supervised end-to-end framework for training recognition models on long videos, with only video-level action category labels. Without knowing the precise temporal locations of actions in long videos, our proposed weakly supervised framework, namely AdaptFocus, estimates where and how likely the actions will occur to adaptively focus on informative action clips for end-to-end training. The effectiveness of the proposed AdaptFocus framework is demonstrated on three long-video datasets. Furthermore, for downstream long-video tasks, our AdaptFocus framework provides a weakly supervised feature extraction pipeline for extracting more robust long-video features, such that the state-of-the-art methods on downstream tasks are significantly advanced. We will release the code and models.

5/27/2024

cs.CV