Lightning Fast Video Anomaly Detection via Adversarial Knowledge Distillation

Read original: arXiv:2211.15597 - Published 7/18/2024 by Florinel-Alin Croitoru, Nicolae-Catalin Ristea, Dana Dascalescu, Radu Tudor Ionescu, Fahad Shahbaz Khan, Mubarak Shah

❗

Overview

Proposes a very fast frame-level model for anomaly detection in video
Learns to detect anomalies by distilling knowledge from multiple highly accurate object-level teacher models
Introduces adversarial distillation to improve the fidelity of the student model
Achieves exceptional speed of 1480 FPS while maintaining comparable accuracy to recent methods

Plain English Explanation

The researchers have developed a novel video anomaly detection model that is extremely fast, running at an impressive 1480 frames per second (FPS). This is over 7 times faster than the previous fastest competing method, and 28 to 62 times faster than object-centric models.

The key innovation is that the model "distills" or learns from multiple highly accurate teacher models that operate at the object level. By combining the knowledge of these expert teachers, the researchers' student model is able to achieve comparable accuracy to recent state-of-the-art methods, while running at an unparalleled speed.

To further improve the student model's performance, the researchers use a technique called adversarial distillation, where an adversarial discriminator is introduced to help the student better mimic the teachers' output.

The result is a highly efficient and accurate video anomaly detection system that could have significant practical applications, such as real-time monitoring and surveillance.

Technical Explanation

The researchers propose a frame-level video anomaly detection model that achieves exceptional speed by distilling knowledge from multiple object-level teacher models. The key components of their approach are:

Knowledge Distillation: The student model learns to detect anomalies by distilling knowledge from a set of highly accurate teacher models that operate at the object level. This allows the student to benefit from the expertise of the teachers while running much faster.
Adversarial Distillation: To further improve the fidelity of the student model, the researchers introduce an adversarial distillation component, where an adversarial discriminator is used to distinguish between the target and generated anomaly maps from the teachers.
Comprehensive Evaluation: The researchers evaluate their model on three benchmark datasets (Avenue, ShanghaiTech, UCSD Ped2) and show that it is over 7 times faster than the fastest competing method, and between 28 and 62 times faster than object-centric models, while maintaining comparable accuracy.
Ablation Study: The researchers conduct a detailed ablation study to justify their architectural design choices and the effectiveness of the individual components of their approach.

Critical Analysis

The researchers have presented a highly innovative and efficient approach to video anomaly detection. By leveraging the knowledge of multiple expert teacher models through distillation, they have been able to create a student model that is exceptionally fast without sacrificing too much accuracy.

One potential limitation of the approach, as mentioned in the paper, is that the performance of the student model is still somewhat dependent on the quality of the teacher models. If the teacher models are not sufficiently accurate or diverse, the student may not be able to fully benefit from the distillation process.

Additionally, the researchers note that their method assumes the availability of labeled anomaly data for training the teacher models. In real-world scenarios, obtaining such labeled data can be challenging and costly, which could limit the practical applicability of the approach.

Further research could explore ways to reduce the reliance on labeled data, perhaps through the use of unsupervised or weakly supervised learning techniques. Investigating the robustness of the distillation process to noisy or imperfect teacher models would also be a valuable area of study.

Conclusion

The researchers have developed a highly efficient and accurate video anomaly detection model by leveraging knowledge distillation from multiple object-level teacher models. Their approach achieves exceptional speed while maintaining comparable accuracy to recent state-of-the-art methods, making it a promising candidate for real-world applications such as surveillance and monitoring.

The use of adversarial distillation to improve the fidelity of the student model is a particularly novel and effective contribution. While the approach has some limitations, the researchers have demonstrated the potential of distillation-based methods for building fast and accurate computer vision systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

❗

Lightning Fast Video Anomaly Detection via Adversarial Knowledge Distillation

Florinel-Alin Croitoru, Nicolae-Catalin Ristea, Dana Dascalescu, Radu Tudor Ionescu, Fahad Shahbaz Khan, Mubarak Shah

We propose a very fast frame-level model for anomaly detection in video, which learns to detect anomalies by distilling knowledge from multiple highly accurate object-level teacher models. To improve the fidelity of our student, we distill the low-resolution anomaly maps of the teachers by jointly applying standard and adversarial distillation, introducing an adversarial discriminator for each teacher to distinguish between target and generated anomaly maps. We conduct experiments on three benchmarks (Avenue, ShanghaiTech, UCSD Ped2), showing that our method is over 7 times faster than the fastest competing method, and between 28 and 62 times faster than object-centric models, while obtaining comparable results to recent methods. Our evaluation also indicates that our model achieves the best trade-off between speed and accuracy, due to its previously unheard-of speed of 1480 FPS. In addition, we carry out a comprehensive ablation study to justify our architectural design choices. Our code is freely available at: https://github.com/ristea/fast-aed.

7/18/2024

Distilling Aggregated Knowledge for Weakly-Supervised Video Anomaly Detection

Jash Dalvi, Ali Dabouei, Gunjan Dhanuka, Min Xu

Video anomaly detection aims to develop automated models capable of identifying abnormal events in surveillance videos. The benchmark setup for this task is extremely challenging due to: i) the limited size of the training sets, ii) weak supervision provided in terms of video-level labels, and iii) intrinsic class imbalance induced by the scarcity of abnormal events. In this work, we show that distilling knowledge from aggregated representations of multiple backbones into a relatively simple model achieves state-of-the-art performance. In particular, we develop a bi-level distillation approach along with a novel disentangled cross-attention-based feature aggregation network. Our proposed approach, DAKD (Distilling Aggregated Knowledge with Disentangled Attention), demonstrates superior performance compared to existing methods across multiple benchmark datasets. Notably, we achieve significant improvements of 1.36%, 0.78%, and 7.02% on the UCF-Crime, ShanghaiTech, and XD-Violence datasets, respectively.

6/6/2024

A Lightweight Video Anomaly Detection Model with Weak Supervision and Adaptive Instance Selection

Yang Wang, Jiaogen Zhou, Jihong Guan

Video anomaly detection is to determine whether there are any abnormal events, behaviors or objects in a given video, which enables effective and intelligent public safety management. As video anomaly labeling is both time-consuming and expensive, most existing works employ unsupervised or weakly supervised learning methods. This paper focuses on weakly supervised video anomaly detection, in which the training videos are labeled whether or not they contain any anomalies, but there is no information about which frames the anomalies are located. However, the uncertainty of weakly labeled data and the large model size prevent existing methods from wide deployment in real scenarios, especially the resource-limit situations such as edge-computing. In this paper, we develop a lightweight video anomaly detection model. On the one hand, we propose an adaptive instance selection strategy, which is based on the model's current status to select confident instances, thereby mitigating the uncertainty of weakly labeled data and subsequently promoting the model's performance. On the other hand, we design a lightweight multi-level temporal correlation attention module and an hourglass-shaped fully connected layer to construct the model, which can reduce the model parameters to only 0.56% of the existing methods (e.g. RTFM). Our extensive experiments on two public datasets UCF-Crime and ShanghaiTech show that our model can achieve comparable or even superior AUC score compared to the state-of-the-art methods, with a significantly reduced number of model parameters.

7/8/2024

Object-Centric Diffusion for Efficient Video Editing

Kumara Kahatapitiya, Adil Karjauv, Davide Abati, Fatih Porikli, Yuki M. Asano, Amirhossein Habibian

Diffusion-based video editing have reached impressive quality and can transform either the global style, local structure, and attributes of given video inputs, following textual edit prompts. However, such solutions typically incur heavy memory and computational costs to generate temporally-coherent frames, either in the form of diffusion inversion and/or cross-frame attention. In this paper, we conduct an analysis of such inefficiencies, and suggest simple yet effective modifications that allow significant speed-ups whilst maintaining quality. Moreover, we introduce Object-Centric Diffusion, to fix generation artifacts and further reduce latency by allocating more computations towards foreground edited regions, arguably more important for perceptual quality. We achieve this by two novel proposals: i) Object-Centric Sampling, decoupling the diffusion steps spent on salient or background regions and spending most on the former, and ii) Object-Centric Token Merging, which reduces cost of cross-frame attention by fusing redundant tokens in unimportant background regions. Both techniques are readily applicable to a given video editing model without retraining, and can drastically reduce its memory and computational cost. We evaluate our proposals on inversion-based and control-signal-based editing pipelines, and show a latency reduction up to 10x for a comparable synthesis quality. Project page: qualcomm-ai-research.github.io/object-centric-diffusion.

9/2/2024