Full-Stage Pseudo Label Quality Enhancement for Weakly-supervised Temporal Action Localization

Read original: arXiv:2407.08971 - Published 7/15/2024 by Qianhan Feng, Wenshuo Li, Tong Lin, Xinghao Chen

Full-Stage Pseudo Label Quality Enhancement for Weakly-supervised Temporal Action Localization

Overview

This paper presents a method for improving the quality of pseudo-labels used in weakly-supervised temporal action localization.
Weakly-supervised learning uses only video-level labels, rather than precise temporal annotations, to train models.
The authors propose a "full-stage" approach that enhances pseudo-labels throughout the training process, leading to better localization performance.

Plain English Explanation

Temporal action localization is the task of identifying when specific actions occur in a video. Traditional approaches require detailed annotations of the start and end times of each action, which is time-consuming and expensive to obtain. Towards Adaptive Pseudo-Label Learning for Semi-Supervised Temporal Action Localization and Test-Time Zero-Shot Temporal Action Localization have explored weakly-supervised methods that only need video-level labels, which are easier to collect.

The key idea in this paper is to improve the quality of the "pseudo-labels" - the localization predictions made by the model during training, which are used as targets even though they are not perfect. The authors propose a "full-stage" approach that refines the pseudo-labels throughout the training process, rather than just at the end. This helps the model learn better representations and make more accurate predictions. Exploring Scalability of Self-Training for Open-Vocabulary Temporal Action Localization and POTLOC: Pseudo-Label Oriented Transformer for Point-Supervised Temporal Action Localization have also looked at ways to leverage pseudo-labels more effectively.

The key innovation is that the pseudo-label enhancement happens at multiple stages, not just at the end. This allows the model to learn better representations and make more accurate predictions, leading to improved temporal action localization performance.

Technical Explanation

The paper proposes a "full-stage pseudo label quality enhancement" (FS-PLQE) approach for weakly-supervised temporal action localization. The key components are:

Pseudo-Label Generation: The model first generates initial pseudo-labels by predicting action boundaries in the training videos.
Pseudo-Label Enhancement: The pseudo-labels are then refined at multiple stages during training. This is done by:
- Temporal Attention Refinement: Attending to the most informative temporal regions to improve localization.
- Confidence-Aware Masking: Adaptively masking low-confidence pseudo-labels to avoid learning from noisy data.
- Inter-Class Separation: Enforcing separation between different action classes to improve discrimination.
Multi-Stage Training: The model is trained end-to-end, with pseudo-label enhancement happening at multiple stages to progressively improve the quality of the pseudo-labels.

The authors evaluate their approach on several benchmark datasets and show significant improvements over previous weakly-supervised methods. The "full-stage" refinement of pseudo-labels is the key innovation that enables better learning of action localization.

Critical Analysis

The paper presents a thorough and well-designed approach to improving weakly-supervised temporal action localization. The authors have addressed several important challenges, such as noisy pseudo-labels and the need for better inter-class separation.

One potential limitation is that the method relies on having access to video-level action labels during training. In some real-world scenarios, even this level of supervision may not be available. Towards Generalizable Temporal Action Localization has explored methods for zero-shot localization without any labeled training data.

Additionally, the computational complexity of the full-stage pseudo-label enhancement may be a concern, especially for large-scale datasets. The authors could investigate ways to make the process more efficient, or explore trade-offs between computational cost and localization performance.

Overall, this paper makes a valuable contribution to the field of weakly-supervised temporal action localization, and the proposed techniques could be further extended and refined in future work.

Conclusion

This paper presents a novel approach for enhancing the quality of pseudo-labels used in weakly-supervised temporal action localization. By refining the pseudo-labels at multiple stages during training, the model is able to learn better representations and make more accurate predictions, leading to improved localization performance.

The key innovations include temporal attention refinement, confidence-aware masking, and inter-class separation, all of which work together to progressively improve the pseudo-labels. This "full-stage" approach is a significant advancement over previous weakly-supervised methods.

The techniques described in this paper could have wider implications for other areas of computer vision and machine learning, where leveraging weak or noisy labels is a common challenge. As the field continues to explore more efficient and scalable learning approaches, ideas like those presented here will likely play an important role.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Full-Stage Pseudo Label Quality Enhancement for Weakly-supervised Temporal Action Localization

Qianhan Feng, Wenshuo Li, Tong Lin, Xinghao Chen

Weakly-supervised Temporal Action Localization (WSTAL) aims to localize actions in untrimmed videos using only video-level supervision. Latest WSTAL methods introduce pseudo label learning framework to bridge the gap between classification-based training and inferencing targets at localization, and achieve cutting-edge results. In these frameworks, a classification-based model is used to generate pseudo labels for a regression-based student model to learn from. However, the quality of pseudo labels in the framework, which is a key factor to the final result, is not carefully studied. In this paper, we propose a set of simple yet efficient pseudo label quality enhancement mechanisms to build our FuSTAL framework. FuSTAL enhances pseudo label quality at three stages: cross-video contrastive learning at proposal Generation-Stage, prior-based filtering at proposal Selection-Stage and EMA-based distillation at Training-Stage. These designs enhance pseudo label quality at different stages in the framework, and help produce more informative, less false and smoother action proposals. With the help of these comprehensive designs at all stages, FuSTAL achieves an average mAP of 50.8% on THUMOS'14, outperforming the previous best method by 1.2%, and becomes the first method to reach the milestone of 50%.

7/15/2024

STAT: Towards Generalizable Temporal Action Localization

Yangcen Liu, Ziyi Liu, Yuanhao Zhai, Wen Li, David Doerman, Junsong Yuan

Weakly-supervised temporal action localization (WTAL) aims to recognize and localize action instances with only video-level labels. Despite the significant progress, existing methods suffer from severe performance degradation when transferring to different distributions and thus may hardly adapt to real-world scenarios . To address this problem, we propose the Generalizable Temporal Action Localization task (GTAL), which focuses on improving the generalizability of action localization methods. We observed that the performance decline can be primarily attributed to the lack of generalizability to different action scales. To address this problem, we propose STAT (Self-supervised Temporal Adaptive Teacher), which leverages a teacher-student structure for iterative refinement. Our STAT features a refinement module and an alignment module. The former iteratively refines the model's output by leveraging contextual information and helps adapt to the target scale. The latter improves the refinement process by promoting a consensus between student and teacher models. We conduct extensive experiments on three datasets, THUMOS14, ActivityNet1.2, and HACS, and the results show that our method significantly improves the Baseline methods under the cross-distribution evaluation setting, even approaching the same-distribution evaluation performance.

4/23/2024

Towards Adaptive Pseudo-label Learning for Semi-Supervised Temporal Action Localization

Feixiang Zhou, Bryan Williams, Hossein Rahmani

Alleviating noisy pseudo labels remains a key challenge in Semi-Supervised Temporal Action Localization (SS-TAL). Existing methods often filter pseudo labels based on strict conditions, but they typically assess classification and localization quality separately, leading to suboptimal pseudo-label ranking and selection. In particular, there might be inaccurate pseudo labels within selected positives, alongside reliable counterparts erroneously assigned to negatives. To tackle these problems, we propose a novel Adaptive Pseudo-label Learning (APL) framework to facilitate better pseudo-label selection. Specifically, to improve the ranking quality, Adaptive Label Quality Assessment (ALQA) is proposed to jointly learn classification confidence and localization reliability, followed by dynamically selecting pseudo labels based on the joint score. Additionally, we propose an Instance-level Consistency Discriminator (ICD) for eliminating ambiguous positives and mining potential positives simultaneously based on inter-instance intrinsic consistency, thereby leading to a more precise selection. We further introduce a general unsupervised Action-aware Contrastive Pre-training (ACP) to enhance the discrimination both within actions and between actions and backgrounds, which benefits SS-TAL. Extensive experiments on THUMOS14 and ActivityNet v1.3 demonstrate that our method achieves state-of-the-art performance under various semi-supervised settings.

7/26/2024

Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization

Geuntaek Lim, Hyunwoo Kim, Joonsoo Kim, Yukyung Choi

Weakly supervised temporal action localization (WTAL) aims to detect action instances in untrimmed videos using only video-level annotations. Since many existing works optimize WTAL models based on action classification labels, they encounter the task discrepancy problem (i.e., localization-by-classification). To tackle this issue, recent studies have attempted to utilize action category names as auxiliary semantic knowledge through vision-language pre-training (VLP). However, there are still areas where existing research falls short. Previous approaches primarily focused on leveraging textual information from language models but overlooked the alignment of dynamic human action and VLP knowledge in a joint space. Furthermore, the deterministic representation employed in previous studies struggles to capture fine-grained human motions. To address these problems, we propose a novel framework that aligns human action knowledge and VLP knowledge in a probabilistic embedding space. Moreover, we propose intra- and inter-distribution contrastive learning to enhance the probabilistic embedding space based on statistical similarities. Extensive experiments and ablation studies reveal that our method significantly outperforms all previous state-of-the-art methods. Code is available at https://github.com/sejong-rcv/PVLR.

8/13/2024