Event Stream based Human Action Recognition: A High-Definition Benchmark Dataset and Algorithms

Read original: arXiv:2408.09764 - Published 8/20/2024 by Xiao Wang, Shiao Wang, Pengpeng Shao, Bo Jiang, Lin Zhu, Yonghong Tian

Event Stream based Human Action Recognition: A High-Definition Benchmark Dataset and Algorithms

Overview

This paper introduces a new high-definition benchmark dataset and algorithms for event stream-based human action recognition.
The dataset contains over 1 million annotated event frames captured at 1kHz, providing a comprehensive resource for training and evaluating action recognition systems.
The authors also propose several novel deep learning algorithms that leverage the unique properties of event streams to achieve state-of-the-art performance on the benchmark.

Plain English Explanation

The paper describes a new dataset and machine learning models for recognizing human actions from a special type of video data called "event streams." Event-based vision sensors capture information about changes in a scene, rather than recording complete images like a regular camera. This allows them to capture fast motions and events with very high temporal resolution.

The dataset introduced in this paper contains over 1 million annotated event frames, where each frame represents the changes that occurred in a scene over a very short time period. This provides a rich dataset for training algorithms to recognize different human actions, like walking, running, or waving, from this type of event stream data.

The authors also propose several new deep learning models that are designed to effectively process and learn from this event stream data. These models leverage the unique properties of event streams, such as the high temporal resolution and sparse nature of the data, to achieve state-of-the-art performance on the benchmark. By using this specialized data and tailored algorithms, the researchers aim to advance the field of human action recognition using event-based vision sensors.

Technical Explanation

The paper introduces a new high-definition benchmark dataset called DailyDVS-200 for event stream-based human action recognition. The dataset contains over 1 million annotated event frames captured at 1kHz using a dynamic vision sensor (DVS) camera. This provides a comprehensive resource for training and evaluating action recognition systems that can process event stream data.

The authors also propose several novel deep learning algorithms for event stream-based human action recognition. These include the Event Sampling Network (ESN), which uses a sparse convolutional neural network architecture to efficiently process the high-dimensional event stream data, and the Event Temporal Pyramid Network (ETPN), which captures temporal information across multiple timescales.

The performance of these models is evaluated on the DailyDVS-200 benchmark, where they achieve state-of-the-art results. The authors demonstrate the advantages of using event stream data and specialized architectures compared to traditional frame-based approaches.

Critical Analysis

The paper provides a valuable contribution to the field of event-based vision and action recognition. The DailyDVS-200 dataset fills an important gap, as existing event-based benchmarks have been limited in scale and diversity.

However, the authors acknowledge some limitations of the dataset, such as the controlled indoor environment and limited number of action classes. Further research would be needed to assess the generalization of the proposed models to more varied real-world scenarios.

Additionally, while the deep learning architectures show promising results, there may be opportunities to further improve performance by exploring alternative network designs or incorporating additional contextual information beyond the event stream data alone.

Overall, this work represents an important step forward in the development of robust and efficient event-based action recognition systems, which could have applications in fields like robotics, surveillance, and human-computer interaction.

Conclusion

This paper presents a new high-definition benchmark dataset and deep learning algorithms for event stream-based human action recognition. The DailyDVS-200 dataset provides a comprehensive resource for training and evaluating action recognition systems using event-based vision sensors, which offer unique advantages over traditional frame-based cameras.

The authors' proposed deep learning models, such as the Event Sampling Network and Event Temporal Pyramid Network, demonstrate state-of-the-art performance on the benchmark by effectively leveraging the properties of event stream data. This work represents an important advancement in the field of event-based vision and its application to human action recognition, with potential impacts on a wide range of real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Event Stream based Human Action Recognition: A High-Definition Benchmark Dataset and Algorithms

Xiao Wang, Shiao Wang, Pengpeng Shao, Bo Jiang, Lin Zhu, Yonghong Tian

Human Action Recognition (HAR) stands as a pivotal research domain in both computer vision and artificial intelligence, with RGB cameras dominating as the preferred tool for investigation and innovation in this field. However, in real-world applications, RGB cameras encounter numerous challenges, including light conditions, fast motion, and privacy concerns. Consequently, bio-inspired event cameras have garnered increasing attention due to their advantages of low energy consumption, high dynamic range, etc. Nevertheless, most existing event-based HAR datasets are low resolution ($346 times 260$). In this paper, we propose a large-scale, high-definition ($1280 times 800$) human action recognition dataset based on the CeleX-V event camera, termed CeleX-HAR. It encompasses 150 commonly occurring action categories, comprising a total of 124,625 video sequences. Various factors such as multi-view, illumination, action speed, and occlusion are considered when recording these data. To build a more comprehensive benchmark dataset, we report over 20 mainstream HAR models for future works to compare. In addition, we also propose a novel Mamba vision backbone network for event stream based HAR, termed EVMamba, which equips the spatial plane multi-directional scanning and novel voxel temporal scanning mechanism. By encoding and mining the spatio-temporal information of event streams, our EVMamba has achieved favorable results across multiple datasets. Both the dataset and source code will be released on url{https://github.com/Event-AHU/CeleX-HAR}

8/20/2024

HabitAction: A Video Dataset for Human Habitual Behavior Recognition

Hongwu Li, Zhenliang Zhang, Wei Wang

Human Action Recognition (HAR) is a very crucial task in computer vision. It helps to carry out a series of downstream tasks, like understanding human behaviors. Due to the complexity of human behaviors, many highly valuable behaviors are not yet encompassed within the available datasets for HAR, e.g., human habitual behaviors (HHBs). HHBs hold significant importance for analyzing a person's personality, habits, and psychological changes. To solve these problems, in this work, we build a novel video dataset to demonstrate various HHBs. These behaviors in the proposed dataset are able to reflect internal mental states and specific emotions of the characters, e.g., crossing arms suggests to shield oneself from perceived threats. The dataset contains 30 categories of habitual behaviors including more than 300,000 frames and 6,899 action instances. Since these behaviors usually appear at small local parts of human action videos, it is difficult for existing action recognition methods to handle these local features. Therefore, we also propose a two-stream model using both human skeletons and RGB appearances. Experimental results demonstrate that our proposed method has much better performance in action recognition than the existing methods on the proposed dataset.

8/27/2024

Real-Time Human Action Recognition on Embedded Platforms

Ruiqi Wang, Zichen Wang, Peiqi Gao, Mingzhen Li, Jaehwan Jeong, Yihang Xu, Yejin Lee, Carolyn M. Baum, Lisa Tabor Connor, Chenyang Lu

With advancements in computer vision and deep learning, video-based human action recognition (HAR) has become practical. However, due to the complexity of the computation pipeline, running HAR on live video streams incurs excessive delays on embedded platforms. This work tackles the real-time performance challenges of HAR with four contributions: 1) an experimental study identifying a standard Optical Flow (OF) extraction technique as the latency bottleneck in a state-of-the-art HAR pipeline, 2) an exploration of the latency-accuracy tradeoff between the standard and deep learning approaches to OF extraction, which highlights the need for a novel, efficient motion feature extractor, 3) the design of Integrated Motion Feature Extractor (IMFE), a novel single-shot neural network architecture for motion feature extraction with drastic improvement in latency, 4) the development of RT-HARE, a real-time HAR system tailored for embedded platforms. Experimental results on an Nvidia Jetson Xavier NX platform demonstrated that RT-HARE realizes real-time HAR at a video frame rate of 30 frames per second while delivering high levels of recognition accuracy.

9/12/2024

👁️

New!A Comprehensive Methodological Survey of Human Activity Recognition Across Divers Data Modalities

Jungpil Shin, Najmul Hassan, Abu Saleh Musa Miah1, Satoshi Nishimura

Human Activity Recognition (HAR) systems aim to understand human behaviour and assign a label to each action, attracting significant attention in computer vision due to their wide range of applications. HAR can leverage various data modalities, such as RGB images and video, skeleton, depth, infrared, point cloud, event stream, audio, acceleration, and radar signals. Each modality provides unique and complementary information suited to different application scenarios. Consequently, numerous studies have investigated diverse approaches for HAR using these modalities. This paper presents a comprehensive survey of the latest advancements in HAR from 2014 to 2024, focusing on machine learning (ML) and deep learning (DL) approaches categorized by input data modalities. We review both single-modality and multi-modality techniques, highlighting fusion-based and co-learning frameworks. Additionally, we cover advancements in hand-crafted action features, methods for recognizing human-object interactions, and activity detection. Our survey includes a detailed dataset description for each modality and a summary of the latest HAR systems, offering comparative results on benchmark datasets. Finally, we provide insightful observations and propose effective future research directions in HAR.

9/17/2024