Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts

Read original: arXiv:2408.05905 - Published 8/14/2024 by Peng Wu, Xuerong Zhou, Guansong Pang, Zhiwei Yang, Qingsen Yan, Peng Wang, Yanning Zhang

Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts

Overview

This research paper presents a novel approach for weakly supervised video anomaly detection and localization using spatio-temporal prompts.
The proposed method leverages language-image pre-training to guide the model in detecting and localizing anomalies in video data, without the need for extensive labeled training data.
The key contributions include a weakly supervised training framework, a spatio-temporal prompt design, and a comprehensive evaluation on multiple video anomaly detection benchmarks.

Plain English Explanation

The paper introduces a new way to detect and pinpoint unusual or anomalous events in videos, without requiring a lot of labeled training data. The researchers used an approach called "weakly supervised learning," which means the model is trained on video data that has only been partially labeled for anomalies, rather than needing fully labeled examples.

To guide the model in this task, the researchers developed "spatio-temporal prompts" - short textual descriptions that provide information about the type, location, and timing of anomalous events. These prompts are used to pre-train the model on language-image relationships, helping it learn to associate the prompts with visual cues of anomalies in the video data.

This approach has several advantages over traditional video anomaly detection methods. First, it reduces the burden of extensive labeling, which can be time-consuming and costly. Second, the spatio-temporal prompts give the model more contextualized guidance, allowing it to better identify and localize anomalies in the video. The researchers evaluated their method on multiple video anomaly detection benchmarks and found it outperformed other state-of-the-art techniques.

Technical Explanation

The paper introduces a weakly supervised video anomaly detection and localization framework that leverages spatio-temporal prompts to guide the model. This is in contrast to fully supervised approaches that require extensive labeled training data.

The key components of the proposed method are:

Weakly Supervised Training: The model is trained on video data with partial annotations, where only a subset of anomalous events are labeled. This reduces the burden of full labeling.
Spatio-Temporal Prompts: The researchers design textual prompts that describe the type, location, and timing of anomalous events. These prompts are used to pre-train the model on language-image relationships, guiding it to associate the prompts with visual cues of anomalies.
Anomaly Detection and Localization: The pre-trained model is then fine-tuned on the partially labeled video data to detect and localize anomalies. The spatio-temporal prompts help the model focus on relevant regions and time intervals.

The researchers evaluate their method on multiple video anomaly detection benchmarks and demonstrate state-of-the-art performance, outperforming other weakly supervised and fully supervised approaches.

Critical Analysis

The paper presents a promising approach for video anomaly detection that reduces the need for extensive labeled training data. The use of spatio-temporal prompts is a novel and effective way to guide the model, leveraging language-image pre-training to compensate for partial supervision.

However, the paper does not address the potential limitations of this approach. For example, the quality and coverage of the prompts may significantly impact the model's performance, and designing appropriate prompts could be challenging in practice. Additionally, the paper does not discuss the generalization of the method to more diverse and complex video anomaly scenarios.

Further research could explore ways to automatically generate or refine the spatio-temporal prompts, as well as investigate the robustness of the approach to different types of anomalies and video data. Incorporating additional contextual information, such as scene semantics or object interactions, could also enhance the model's ability to detect and localize anomalies.

Conclusion

This research paper presents a novel weakly supervised video anomaly detection and localization framework that leverages spatio-temporal prompts to guide the model's learning. By reducing the need for extensive labeled training data, this approach has the potential to make video anomaly detection more accessible and practical for real-world applications. The comprehensive evaluation on multiple benchmarks demonstrates the effectiveness of the proposed method, which outperforms other state-of-the-art techniques. While the paper highlights the merits of this approach, further research is needed to address potential limitations and explore ways to enhance the model's robustness and generalization capabilities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts

Peng Wu, Xuerong Zhou, Guansong Pang, Zhiwei Yang, Qingsen Yan, Peng Wang, Yanning Zhang

Current weakly supervised video anomaly detection (WSVAD) task aims to achieve frame-level anomalous event detection with only coarse video-level annotations available. Existing works typically involve extracting global features from full-resolution video frames and training frame-level classifiers to detect anomalies in the temporal dimension. However, most anomalous events tend to occur in localized spatial regions rather than the entire video frames, which implies existing frame-level feature based works may be misled by the dominant background information and lack the interpretation of the detected anomalies. To address this dilemma, this paper introduces a novel method called STPrompt that learns spatio-temporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs). Our proposed method employs a two-stream network structure, with one stream focusing on the temporal dimension and the other primarily on the spatial dimension. By leveraging the learned knowledge from pre-trained VLMs and incorporating natural motion priors from raw videos, our model learns prompt embeddings that are aligned with spatio-temporal regions of videos (e.g., patches of individual frames) for identify specific local regions of anomalies, enabling accurate video anomaly detection while mitigating the influence of background information. Without relying on detailed spatio-temporal annotations or auxiliary object detection/tracking, our method achieves state-of-the-art performance on three public benchmarks for the WSVADL task.

8/14/2024

Learn Suspected Anomalies from Event Prompts for Video Anomaly Detection

Chenchen Tao, Xiaohao Peng, Chong Wang, Jiafei Wu, Puning Zhao, Jun Wang, Jiangbo Qian

Most models for weakly supervised video anomaly detection (WS-VAD) rely on multiple instance learning, aiming to distinguish normal and abnormal snippets without specifying the type of anomaly. However, the ambiguous nature of anomaly definitions across contexts may introduce inaccuracy in discriminating abnormal and normal events. To show the model what is anomalous, a novel framework is proposed to guide the learning of suspected anomalies from event prompts. Given a textual prompt dictionary of potential anomaly events and the captions generated from anomaly videos, the semantic anomaly similarity between them could be calculated to identify the suspected events for each video snippet. It enables a new multi-prompt learning process to constrain the visual-semantic features across all videos, as well as provides a new way to label pseudo anomalies for self-training. To demonstrate its effectiveness, comprehensive experiments and detailed ablation studies are conducted on four datasets, namely XD-Violence, UCF-Crime, TAD, and ShanghaiTech. Our proposed model outperforms most state-of-the-art methods in terms of AP or AUC (86.5%, hl{90.4}%, 94.4%, and 97.4%). Furthermore, it shows promising performance in open-set and cross-dataset cases. The data, code, and models can be found at: url{https://github.com/shiwoaz/lap}.

9/4/2024

Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

Zhiwei Yang, Jing Liu, Peng Wu

Weakly supervised video anomaly detection (WSVAD) is a challenging task. Generating fine-grained pseudo-labels based on weak-label and then self-training a classifier is currently a promising solution. However, since the existing methods use only RGB visual modality and the utilization of category text information is neglected, thus limiting the generation of more accurate pseudo-labels and affecting the performance of self-training. Inspired by the manual labeling process based on the event description, in this paper, we propose a novel pseudo-label generation and self-training framework based on Text Prompt with Normality Guidance (TPWNG) for WSVAD. Our idea is to transfer the rich language-visual knowledge of the contrastive language-image pre-training (CLIP) model for aligning the video event description text and corresponding video frames to generate pseudo-labels. Specifically, We first fine-tune the CLIP for domain adaptation by designing two ranking losses and a distributional inconsistency loss. Further, we propose a learnable text prompt mechanism with the assist of a normality visual prompt to further improve the matching accuracy of video event description text and video frames. Then, we design a pseudo-label generation module based on the normality guidance to infer reliable frame-level pseudo-labels. Finally, we introduce a temporal context self-adaptive learning module to learn the temporal dependencies of different video events more flexibly and accurately. Extensive experiments show that our method achieves state-of-the-art performance on two benchmark datasets, UCF-Crime and XD-Viole

4/15/2024

A Lightweight Video Anomaly Detection Model with Weak Supervision and Adaptive Instance Selection

Yang Wang, Jiaogen Zhou, Jihong Guan

Video anomaly detection is to determine whether there are any abnormal events, behaviors or objects in a given video, which enables effective and intelligent public safety management. As video anomaly labeling is both time-consuming and expensive, most existing works employ unsupervised or weakly supervised learning methods. This paper focuses on weakly supervised video anomaly detection, in which the training videos are labeled whether or not they contain any anomalies, but there is no information about which frames the anomalies are located. However, the uncertainty of weakly labeled data and the large model size prevent existing methods from wide deployment in real scenarios, especially the resource-limit situations such as edge-computing. In this paper, we develop a lightweight video anomaly detection model. On the one hand, we propose an adaptive instance selection strategy, which is based on the model's current status to select confident instances, thereby mitigating the uncertainty of weakly labeled data and subsequently promoting the model's performance. On the other hand, we design a lightweight multi-level temporal correlation attention module and an hourglass-shaped fully connected layer to construct the model, which can reduce the model parameters to only 0.56% of the existing methods (e.g. RTFM). Our extensive experiments on two public datasets UCF-Crime and ShanghaiTech show that our model can achieve comparable or even superior AUC score compared to the state-of-the-art methods, with a significantly reduced number of model parameters.

7/8/2024