Towards Adaptive Pseudo-label Learning for Semi-Supervised Temporal Action Localization

Read original: arXiv:2407.07673 - Published 7/26/2024 by Feixiang Zhou, Bryan Williams, Hossein Rahmani

Towards Adaptive Pseudo-label Learning for Semi-Supervised Temporal Action Localization

Overview

This paper proposes an adaptive pseudo-label learning approach for semi-supervised temporal action localization, which aims to improve the performance of action detection in video by leveraging both labeled and unlabeled data.
The key idea is to adaptively adjust the pseudo-label threshold based on the confidence of the model's predictions, allowing for more reliable pseudo-labels to be generated from the unlabeled data.
The proposed method outperforms state-of-the-art semi-supervised approaches on several benchmark datasets for temporal action localization.

Plain English Explanation

The paper introduces a new technique for temporal action localization, which is the task of detecting and classifying actions within a video. The researchers' goal is to improve the accuracy of action detection by using both labeled and unlabeled video data.

Typically, training action detection models requires a large amount of labeled video data, which can be time-consuming and expensive to collect. To overcome this, the researchers propose an adaptive pseudo-label learning approach. This method automatically generates "pseudo-labels" for the unlabeled video data, which can then be used to supplement the labeled data during training.

The key innovation is that the technique adaptively adjusts the pseudo-label threshold based on the confidence of the model's predictions. This helps ensure that only reliable pseudo-labels are used, which can improve the overall performance of the action detection model.

The researchers evaluate their method on several standard benchmarks for temporal action localization and show that it outperforms other state-of-the-art semi-supervised approaches. This suggests that their adaptive pseudo-label learning technique is an effective way to leverage unlabeled data and improve the accuracy of action detection in videos.

Technical Explanation

The paper presents an adaptive pseudo-label learning approach for semi-supervised temporal action localization. The key idea is to adaptively adjust the pseudo-label threshold based on the model's prediction confidence, allowing for more reliable pseudo-labels to be generated from the unlabeled data.

Specifically, the method consists of three main components:

Backbone network: A convolutional neural network that extracts spatio-temporal features from the input video.
Pseudo-label generator: A module that generates pseudo-labels for the unlabeled data based on the backbone network's predictions.
Adaptive threshold module: This component dynamically adjusts the pseudo-label threshold to ensure only high-confidence predictions are used as pseudo-labels.

During training, the model is optimized using a combination of the labeled data and the pseudo-labeled data. The adaptive threshold module plays a crucial role, as it prevents the accumulation of errors from unreliable pseudo-labels, which can degrade the model's performance.

The researchers evaluate their proposed method on several temporal action localization benchmarks, including ActivityNet and THUMOS14. They show that their adaptive pseudo-label learning approach outperforms other state-of-the-art semi-supervised and weakly-supervised methods for action detection in videos.

Critical Analysis

The paper presents a well-designed and empirically validated approach for leveraging unlabeled data to improve temporal action localization. The adaptive pseudo-label threshold mechanism is a clever and effective way to mitigate the issue of unreliable pseudo-labels, which is a common challenge in semi-supervised learning.

One potential limitation of the proposed method is that it relies on the assumption that the model's prediction confidence is a reliable indicator of the pseudo-label quality. In cases where the model's confidence is miscalibrated, the adaptive threshold might not be able to effectively filter out the noisy pseudo-labels. Addressing this potential issue could be an interesting direction for future research.

Additionally, the paper focuses on the semi-supervised setting, where a small amount of labeled data is available. It would be valuable to investigate the performance of the proposed method in a fully unsupervised setting, where no labeled data is provided, or in a weakly-supervised setting, where only partial labels are available.

Overall, the paper presents a compelling and well-executed approach to leveraging unlabeled data for temporal action localization. The adaptive pseudo-label learning technique is a promising contribution to the field, and the results demonstrate its potential to advance the state of the art in this important computer vision task.

Conclusion

This paper introduces an adaptive pseudo-label learning method for semi-supervised temporal action localization. By dynamically adjusting the pseudo-label threshold based on the model's prediction confidence, the proposed approach can effectively leverage unlabeled data to improve action detection performance.

The key innovation is the adaptive threshold module, which helps ensure that only reliable pseudo-labels are used during training. This allows the model to learn from the unlabeled data without accumulating errors from low-confidence predictions.

The empirical results demonstrate the effectiveness of the adaptive pseudo-label learning method, as it outperforms other state-of-the-art semi-supervised and weakly-supervised approaches on several benchmark datasets. This suggests that the proposed technique is a promising direction for advancing the field of temporal action localization, particularly in settings where labeled data is scarce.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Towards Adaptive Pseudo-label Learning for Semi-Supervised Temporal Action Localization

Feixiang Zhou, Bryan Williams, Hossein Rahmani

Alleviating noisy pseudo labels remains a key challenge in Semi-Supervised Temporal Action Localization (SS-TAL). Existing methods often filter pseudo labels based on strict conditions, but they typically assess classification and localization quality separately, leading to suboptimal pseudo-label ranking and selection. In particular, there might be inaccurate pseudo labels within selected positives, alongside reliable counterparts erroneously assigned to negatives. To tackle these problems, we propose a novel Adaptive Pseudo-label Learning (APL) framework to facilitate better pseudo-label selection. Specifically, to improve the ranking quality, Adaptive Label Quality Assessment (ALQA) is proposed to jointly learn classification confidence and localization reliability, followed by dynamically selecting pseudo labels based on the joint score. Additionally, we propose an Instance-level Consistency Discriminator (ICD) for eliminating ambiguous positives and mining potential positives simultaneously based on inter-instance intrinsic consistency, thereby leading to a more precise selection. We further introduce a general unsupervised Action-aware Contrastive Pre-training (ACP) to enhance the discrimination both within actions and between actions and backgrounds, which benefits SS-TAL. Extensive experiments on THUMOS14 and ActivityNet v1.3 demonstrate that our method achieves state-of-the-art performance under various semi-supervised settings.

7/26/2024

Full-Stage Pseudo Label Quality Enhancement for Weakly-supervised Temporal Action Localization

Qianhan Feng, Wenshuo Li, Tong Lin, Xinghao Chen

Weakly-supervised Temporal Action Localization (WSTAL) aims to localize actions in untrimmed videos using only video-level supervision. Latest WSTAL methods introduce pseudo label learning framework to bridge the gap between classification-based training and inferencing targets at localization, and achieve cutting-edge results. In these frameworks, a classification-based model is used to generate pseudo labels for a regression-based student model to learn from. However, the quality of pseudo labels in the framework, which is a key factor to the final result, is not carefully studied. In this paper, we propose a set of simple yet efficient pseudo label quality enhancement mechanisms to build our FuSTAL framework. FuSTAL enhances pseudo label quality at three stages: cross-video contrastive learning at proposal Generation-Stage, prior-based filtering at proposal Selection-Stage and EMA-based distillation at Training-Stage. These designs enhance pseudo label quality at different stages in the framework, and help produce more informative, less false and smoother action proposals. With the help of these comprehensive designs at all stages, FuSTAL achieves an average mAP of 50.8% on THUMOS'14, outperforming the previous best method by 1.2%, and becomes the first method to reach the milestone of 50%.

7/15/2024

👁️

Learning Discriminative Spatio-temporal Representations for Semi-supervised Action Recognition

Yu Wang, Sanping Zhou, Kun Xia, Le Wang

Semi-supervised action recognition aims to improve spatio-temporal reasoning ability with a few labeled data in conjunction with a large amount of unlabeled data. Albeit recent advancements, existing powerful methods are still prone to making ambiguous predictions under scarce labeled data, embodied as the limitation of distinguishing different actions with similar spatio-temporal information. In this paper, we approach this problem by empowering the model two aspects of capability, namely discriminative spatial modeling and temporal structure modeling for learning discriminative spatio-temporal representations. Specifically, we propose an Adaptive Contrastive Learning~(ACL) strategy. It assesses the confidence of all unlabeled samples by the class prototypes of the labeled data, and adaptively selects positive-negative samples from a pseudo-labeled sample bank to construct contrastive learning. Additionally, we introduce a Multi-scale Temporal Learning~(MTL) strategy. It could highlight informative semantics from long-term clips and integrate them into the short-term clip while suppressing noisy information. Afterwards, both of these two new techniques are integrated in a unified framework to encourage the model to make accurate predictions. Extensive experiments on UCF101, HMDB51 and Kinetics400 show the superiority of our method over prior state-of-the-art approaches.

4/26/2024

Semi-Supervised Variational Adversarial Active Learning via Learning to Rank and Agreement-Based Pseudo Labeling

Zongyao Lyu, William J. Beksi

Active learning aims to alleviate the amount of labor involved in data labeling by automating the selection of unlabeled samples via an acquisition function. For example, variational adversarial active learning (VAAL) leverages an adversarial network to discriminate unlabeled samples from labeled ones using latent space information. However, VAAL has the following shortcomings: (i) it does not exploit target task information, and (ii) unlabeled data is only used for sample selection rather than model training. To address these limitations, we introduce novel techniques that significantly improve the use of abundant unlabeled data during training and take into account the task information. Concretely, we propose an improved pseudo-labeling algorithm that leverages information from all unlabeled data in a semi-supervised manner, thus allowing a model to explore a richer data space. In addition, we develop a ranking-based loss prediction module that converts predicted relative ranking information into a differentiable ranking loss. This loss can be embedded as a rank variable into the latent space of a variational autoencoder and then trained with a discriminator in an adversarial fashion for sample selection. We demonstrate the superior performance of our approach over the state of the art on various image classification and segmentation benchmark datasets.

8/26/2024