JOSENet: A Joint Stream Embedding Network for Violence Detection in Surveillance Videos

Read original: arXiv:2405.02961 - Published 8/6/2024 by Pietro Nardelli, Danilo Comminiello

🌐

Overview

Video surveillance cameras are increasingly available, and there is a growing need for crime prevention, leading to more research on violence detection in surveillance videos.
Compared to other action recognition tasks, violence detection in surveillance videos poses additional challenges, such as the presence of a wide variety of real-life fight scenes and small available datasets.
Surveillance videos also have varying backgrounds and people, and violent actions need to be detected quickly to prevent consequences, requiring models with reduced memory usage and computational costs.
To address these issues, the authors introduce JOSENet, a novel self-supervised framework for outstanding violence detection performance in surveillance videos.

Plain English Explanation

The paper focuses on the problem of detecting violence in surveillance videos, which is becoming more important as video cameras become more widespread and there is a growing need to prevent crime. Compared to other tasks in action recognition, violence detection in surveillance videos has some additional challenges:

There is a huge variety of real-life fight scenes that the system needs to be able to recognize, which makes it harder to build accurate models.
The available datasets for training these models are quite small compared to datasets for other action recognition tasks.
In surveillance videos, the background and the people in the scenes are always changing, which makes it harder for the models to learn.
Violent actions need to be detected quickly in surveillance videos to prevent bad things from happening, so the models need to be efficient and not use too much memory or computing power.

To address these challenges, the researchers developed a new self-supervised learning approach called JOSENet. This framework takes in two different video inputs - the actual video frames and optical flow information - and uses a novel self-supervised learning technique to train the model. The key benefits of JOSENet are that it outperforms other self-supervised methods while only using a quarter of the video frames and a lower frame rate, making it more efficient to run.

Technical Explanation

The authors introduce JOSENet, a novel self-supervised framework for violence detection in surveillance videos. The model takes in two spatiotemporal video streams - RGB frames and optical flows - and uses a new regularized self-supervised learning approach.

JOSENet is designed to address the unique challenges of violence detection in surveillance videos, such as the presence of diverse real-life fight scenes, small available datasets, and the need for efficient models that can quickly detect violent actions.

Compared to other self-supervised state-of-the-art methods, JOSENet demonstrates improved performance while using only one-fourth of the number of frames per video segment and a reduced frame rate. This makes the model more computationally efficient, an important consideration for real-time surveillance applications.

The authors provide the source code and instructions to reproduce their experiments, which are available on GitHub.

Critical Analysis

The paper introduces a promising approach to violence detection in surveillance videos, but there are a few potential limitations and areas for further research:

The authors note that available datasets for this task are quite small compared to other action recognition datasets. While JOSENet demonstrates good performance, more research may be needed to understand how the model would scale and generalize to larger, more diverse datasets.
The paper does not provide much analysis on the types of violent actions the model is able to accurately detect. Further research could explore the model's capabilities and limitations in recognizing different manifestations of violence.
As mentioned in the paper on gait recognition from compressed videos, the use of reduced frame rates and video segments could potentially impact the model's ability to capture important temporal information. The tradeoffs between efficiency and performance should be examined more closely.
The ActNetFormer and Unifying Global-Local Scene Entities papers discuss semi-supervised and multi-modal approaches that could potentially be combined with the self-supervised JOSENet framework to further improve violence detection capabilities.

Conclusion

The JOSENet framework introduced in this paper represents an important step forward in addressing the challenges of violence detection in surveillance videos. By leveraging a novel self-supervised learning approach, the model is able to achieve strong performance while being more computationally efficient than other state-of-the-art methods.

As video surveillance becomes more prevalent and the need for effective crime prevention grows, tools like JOSENet could play a crucial role in helping to quickly identify and respond to violent incidents. However, continued research is needed to further refine these techniques, expand their capabilities, and ensure they are developed and deployed responsibly and ethically, as discussed in the paper on semi-supervised active learning for video action detection.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌐

JOSENet: A Joint Stream Embedding Network for Violence Detection in Surveillance Videos

Pietro Nardelli, Danilo Comminiello

The increasing proliferation of video surveillance cameras and the escalating demand for crime prevention have intensified interest in the task of violence detection within the research community. Compared to other action recognition tasks, violence detection in surveillance videos presents additional issues, such as the wide variety of real fight scenes. Unfortunately, existing datasets for violence detection are relatively small in comparison to those for other action recognition tasks. Moreover, surveillance footage often features different individuals in each video and varying backgrounds for each camera. In addition, fast detection of violent actions in real-life surveillance videos is crucial to prevent adverse outcomes, thus necessitating models that are optimized for reduced memory usage and computational costs. These challenges complicate the application of traditional action recognition methods. To tackle all these issues, we introduce JOSENet, a novel self-supervised framework that provides outstanding performance for violence detection in surveillance videos. The proposed model processes two spatiotemporal video streams, namely RGB frames and optical flows, and incorporates a new regularized self-supervised learning approach for videos. JOSENet demonstrates improved performance compared to state-of-the-art methods, while utilizing only one-fourth of the frames per video segment and operating at a reduced frame rate. The source code is available at https://github.com/ispamm/JOSENet.

8/6/2024

CUE-Net: Violence Detection Video Analytics with Spatial Cropping, Enhanced UniformerV2 and Modified Efficient Additive Attention

Damith Chamalke Senadeera, Xiaoyun Yang, Dimitrios Kollias, Gregory Slabaugh

In this paper we introduce CUE-Net, a novel architecture designed for automated violence detection in video surveillance. As surveillance systems become more prevalent due to technological advances and decreasing costs, the challenge of efficiently monitoring vast amounts of video data has intensified. CUE-Net addresses this challenge by combining spatial Cropping with an enhanced version of the UniformerV2 architecture, integrating convolutional and self-attention mechanisms alongside a novel Modified Efficient Additive Attention mechanism (which reduces the quadratic time complexity of self-attention) to effectively and efficiently identify violent activities. This approach aims to overcome traditional challenges such as capturing distant or partially obscured subjects within video frames. By focusing on both local and global spatiotemporal features, CUE-Net achieves state-of-the-art performance on the RWF-2000 and RLVS datasets, surpassing existing methods.

5/1/2024

Violence detection in videos using deep recurrent and convolutional neural networks

Abdarahmane Traor'e, Moulay A. Akhloufi

Violence and abnormal behavior detection research have known an increase of interest in recent years, due mainly to a rise in crimes in large cities worldwide. In this work, we propose a deep learning architecture for violence detection which combines both recurrent neural networks (RNNs) and 2-dimensional convolutional neural networks (2D CNN). In addition to video frames, we use optical flow computed using the captured sequences. CNN extracts spatial characteristics in each frame, while RNN extracts temporal characteristics. The use of optical flow allows to encode the movements in the scenes. The proposed approaches reach the same level as the state-of-the-art techniques and sometime surpass them. It was validated on 3 databases achieving good results.

9/14/2024

2D bidirectional gated recurrent unit convolutional Neural networks for end-to-end violence detection In videos

Abdarahmane Traor'e, Moulay A. Akhloufi

Abnormal behavior detection, action recognition, fight and violence detection in videos is an area that has attracted a lot of interest in recent years. In this work, we propose an architecture that combines a Bidirectional Gated Recurrent Unit (BiGRU) and a 2D Convolutional Neural Network (CNN) to detect violence in video sequences. A CNN is used to extract spatial characteristics from each frame, while the BiGRU extracts temporal and local motion characteristics using CNN extracted features from multiple frames. The proposed end-to-end deep learning network is tested in three public datasets with varying scene complexities. The proposed network achieves accuracies up to 98%. The obtained results are promising and show the performance of the proposed end-to-end approach.

9/14/2024