Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

Read original: arXiv:2404.08531 - Published 4/15/2024 by Zhiwei Yang, Jing Liu, Peng Wu

Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

Overview

This paper proposes a novel approach for weakly supervised video anomaly detection using text prompts and normality guidance.
The key ideas are to use language models to generate text prompts that capture the concept of normality, and then use these prompts to guide the training of a video anomaly detection model.
This method aims to address the challenge of obtaining high-quality labeled anomaly data for training, which is often difficult and expensive.

Plain English Explanation

The paper introduces a new way to detect unusual or anomalous events in videos, without needing a lot of labeled training data. The basic idea is to use language models to create text descriptions that capture what "normal" looks like. These text prompts are then used to guide the training of a video anomaly detection model, helping it learn to recognize normal activities and flag anything unusual.

The key advantage of this approach is that it can work with limited labeled data. Instead of needing lots of examples of both normal and abnormal behavior, the model can learn from just the "normal" prompts generated by the language model. This makes the training process more efficient and practical, as collecting comprehensive labeled data for anomaly detection is often very difficult and time-consuming.

Technical Explanation

The paper proposes a "Text Prompt with Normality Guidance" (PromptAD) framework for weakly supervised video anomaly detection. The core idea is to leverage large language models (LLMs) to generate text prompts that describe normal activities, and then use these prompts to guide the training of a video anomaly detection model.

Specifically, the authors first train an LLM on a large corpus of text data to learn a representation of "normality". They then use this LLM to generate text prompts that describe normal behaviors in a given video domain (e.g. "people walking normally on a sidewalk"). These prompts are used to create pseudo-normal video samples, which are combined with any available labeled normal data to train the video anomaly detection model.

The video anomaly detection model is a spatio-temporal neural network that learns to predict the generated normal prompts for each video frame. Frames that deviate significantly from the predicted prompts are flagged as anomalous. The authors show that this "normality guidance" approach outperforms prior weakly supervised methods on several video anomaly detection benchmarks.

Critical Analysis

The key strength of this work is its ability to leverage language models to address the data scarcity challenge in video anomaly detection. By generating synthetic normal samples from text prompts, the approach can train effective models with limited labeled anomaly data. This is an important practical advantage, as obtaining comprehensive anomaly data is often very difficult.

However, the paper does not extensively explore the limitations of the approach. For example, the quality and diversity of the generated text prompts may impact the model's ability to capture the full range of normal behaviors. Additionally, the video anomaly detection model may struggle to generalize to anomalies that are not well-represented by the provided prompts.

Further research is needed to understand the robustness of this approach across different video domains and anomaly types. Evaluating the method's sensitivity to the choice of language model and prompt engineering techniques would also be valuable.

Conclusion

The "Text Prompt with Normality Guidance" framework represents a promising direction for advancing weakly supervised video anomaly detection. By leveraging large language models to generate normality descriptions, the approach can train effective anomaly detection models with limited labeled data. This could significantly improve the practicality and scalability of anomaly detection systems in real-world applications.

While the paper demonstrates the effectiveness of this approach on several benchmarks, further research is needed to fully understand its limitations and robustness. Continued exploration of this intersection between language models and video analysis could lead to important advancements in the field of anomaly detection.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

Zhiwei Yang, Jing Liu, Peng Wu

Weakly supervised video anomaly detection (WSVAD) is a challenging task. Generating fine-grained pseudo-labels based on weak-label and then self-training a classifier is currently a promising solution. However, since the existing methods use only RGB visual modality and the utilization of category text information is neglected, thus limiting the generation of more accurate pseudo-labels and affecting the performance of self-training. Inspired by the manual labeling process based on the event description, in this paper, we propose a novel pseudo-label generation and self-training framework based on Text Prompt with Normality Guidance (TPWNG) for WSVAD. Our idea is to transfer the rich language-visual knowledge of the contrastive language-image pre-training (CLIP) model for aligning the video event description text and corresponding video frames to generate pseudo-labels. Specifically, We first fine-tune the CLIP for domain adaptation by designing two ranking losses and a distributional inconsistency loss. Further, we propose a learnable text prompt mechanism with the assist of a normality visual prompt to further improve the matching accuracy of video event description text and video frames. Then, we design a pseudo-label generation module based on the normality guidance to infer reliable frame-level pseudo-labels. Finally, we introduce a temporal context self-adaptive learning module to learn the temporal dependencies of different video events more flexibly and accurately. Extensive experiments show that our method achieves state-of-the-art performance on two benchmark datasets, UCF-Crime and XD-Viole

4/15/2024

Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts

Peng Wu, Xuerong Zhou, Guansong Pang, Zhiwei Yang, Qingsen Yan, Peng Wang, Yanning Zhang

Current weakly supervised video anomaly detection (WSVAD) task aims to achieve frame-level anomalous event detection with only coarse video-level annotations available. Existing works typically involve extracting global features from full-resolution video frames and training frame-level classifiers to detect anomalies in the temporal dimension. However, most anomalous events tend to occur in localized spatial regions rather than the entire video frames, which implies existing frame-level feature based works may be misled by the dominant background information and lack the interpretation of the detected anomalies. To address this dilemma, this paper introduces a novel method called STPrompt that learns spatio-temporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs). Our proposed method employs a two-stream network structure, with one stream focusing on the temporal dimension and the other primarily on the spatial dimension. By leveraging the learned knowledge from pre-trained VLMs and incorporating natural motion priors from raw videos, our model learns prompt embeddings that are aligned with spatio-temporal regions of videos (e.g., patches of individual frames) for identify specific local regions of anomalies, enabling accurate video anomaly detection while mitigating the influence of background information. Without relying on detailed spatio-temporal annotations or auxiliary object detection/tracking, our method achieves state-of-the-art performance on three public benchmarks for the WSVADL task.

8/14/2024

Learn Suspected Anomalies from Event Prompts for Video Anomaly Detection

Chenchen Tao, Xiaohao Peng, Chong Wang, Jiafei Wu, Puning Zhao, Jun Wang, Jiangbo Qian

Most models for weakly supervised video anomaly detection (WS-VAD) rely on multiple instance learning, aiming to distinguish normal and abnormal snippets without specifying the type of anomaly. However, the ambiguous nature of anomaly definitions across contexts may introduce inaccuracy in discriminating abnormal and normal events. To show the model what is anomalous, a novel framework is proposed to guide the learning of suspected anomalies from event prompts. Given a textual prompt dictionary of potential anomaly events and the captions generated from anomaly videos, the semantic anomaly similarity between them could be calculated to identify the suspected events for each video snippet. It enables a new multi-prompt learning process to constrain the visual-semantic features across all videos, as well as provides a new way to label pseudo anomalies for self-training. To demonstrate its effectiveness, comprehensive experiments and detailed ablation studies are conducted on four datasets, namely XD-Violence, UCF-Crime, TAD, and ShanghaiTech. Our proposed model outperforms most state-of-the-art methods in terms of AP or AUC (86.5%, hl{90.4}%, 94.4%, and 97.4%). Furthermore, it shows promising performance in open-set and cross-dataset cases. The data, code, and models can be found at: url{https://github.com/shiwoaz/lap}.

9/4/2024

✨

Multimodal Attention-Enhanced Feature Fusion-based Weekly Supervised Anomaly Violence Detection

Yuta Kaneko, Abu Saleh Musa Miah, Najmul Hassan, Hyoun-Sup Lee, Si-Woong Jang, Jungpil Shin

Weakly supervised video anomaly detection (WS-VAD) is a crucial area in computer vision for developing intelligent surveillance systems. This system uses three feature streams: RGB video, optical flow, and audio signals, where each stream extracts complementary spatial and temporal features using an enhanced attention module to improve detection accuracy and robustness. In the first stream, we employed an attention-based, multi-stage feature enhancement approach to improve spatial and temporal features from the RGB video where the first stage consists of a ViT-based CLIP module, with top-k features concatenated in parallel with I3D and Temporal Contextual Aggregation (TCA) based rich spatiotemporal features. The second stage effectively captures temporal dependencies using the Uncertainty-Regulated Dual Memory Units (UR-DMU) model, which learns representations of normal and abnormal data simultaneously, and the third stage is employed to select the most relevant spatiotemporal features. The second stream extracted enhanced attention-based spatiotemporal features from the flow data modality-based feature by taking advantage of the integration of the deep learning and attention module. The audio stream captures auditory cues using an attention module integrated with the VGGish model, aiming to detect anomalies based on sound patterns. These streams enrich the model by incorporating motion and audio signals often indicative of abnormal events undetectable through visual analysis alone. The concatenation of the multimodal fusion leverages the strengths of each modality, resulting in a comprehensive feature set that significantly improves anomaly detection accuracy and robustness across three datasets. The extensive experiment and high performance with the three benchmark datasets proved the effectiveness of the proposed system over the existing state-of-the-art system.

9/18/2024