A Lightweight Spatiotemporal Network for Online Eye Tracking with Event Camera

Read original: arXiv:2404.08858 - Published 4/16/2024 by Yan Ru Pei, Sasskia Bruers, S'ebastien Crouzet, Douglas McLelland, Olivier Coenen

A Lightweight Spatiotemporal Network for Online Eye Tracking with Event Camera

Overview

This paper presents a lightweight spatiotemporal network for online eye tracking using event cameras.
Event cameras are a novel type of sensor that capture changes in pixel intensity rather than full frame images, which can provide high-speed, low-latency tracking capabilities.
The proposed network architecture is designed to be efficient for real-time applications while maintaining robust eye tracking performance.

Plain English Explanation

Event cameras are a new type of visual sensor that work differently from traditional cameras. Instead of capturing full images at a fixed frame rate, event cameras only record changes in pixel brightness. This allows them to track movement and events at extremely fast speeds with very low latency.

The researchers in this paper developed a lightweight neural network that can use the data from an event camera to track a person's eye movements in real-time. This is useful for applications like human-computer interaction, where fast and accurate eye tracking is important.

The key innovation is creating a neural network architecture that is computationally efficient, so it can run quickly on small devices, while still being effective at precisely locating the user's eye. This involves [link to https://aimodels.fyi/papers/arxiv/state-space-models-event-cameras]using spatiotemporal techniques[/link] to process the unique data format from the event camera sensor.

Overall, this research aims to make high-speed, low-latency eye tracking more practical for real-world applications by developing a lightweight, efficient neural network solution.

Technical Explanation

The paper proposes a [link to https://aimodels.fyi/papers/arxiv/deep-learning-event-based-vision-comprehensive-survey]deep learning[/link] architecture called LightEyetrack that is designed for online eye tracking using event cameras. Event cameras capture per-pixel brightness changes rather than full frame images, which enables ultra-fast, low-latency sensing compared to traditional cameras.

The LightEyetrack network uses a spatiotemporal convolutional structure to effectively process the event-based data. It takes as input a sequence of event frames, which are formed by accumulating events over a short time window. The network then applies 3D convolutions to extract spatiotemporal features, followed by fully connected layers to regress the 2D eye position.

A key aspect of the architecture is its lightweight design, with only 0.23 million parameters. This makes it efficient enough to run in real-time on embedded devices. The researchers evaluate the model on two public event-based eye tracking datasets, demonstrating state-of-the-art performance while maintaining low computational cost.

Critical Analysis

The paper makes a compelling case for the advantages of using event cameras and lightweight neural networks for practical eye tracking applications. The proposed LightEyetrack architecture appears to be an effective solution, achieving high accuracy while being computationally efficient enough for online, real-time use.

However, the paper does not address some potential limitations and areas for further research. For example, it's unclear how the system would perform in more unconstrained real-world environments with complex backgrounds and lighting conditions. [link to https://aimodels.fyi/papers/arxiv/eventsleep-sleep-activity-recognition-event-cameras]Event-based sensors can be sensitive to such factors[/link], so additional robustness testing would be useful.

Additionally, the paper only evaluates the model on existing public datasets. Validating the approach on a wider range of custom datasets, including those with more diverse users and use cases, could help strengthen the generalizability claims.

[link to https://aimodels.fyi/papers/arxiv/long-term-frame-event-visual-tracking-benchmark]Longer-term, real-world tracking performance[/link] is another area that could benefit from further investigation, as the current evaluations are relatively short-term.

Overall, this is a promising piece of research that demonstrates the potential of lightweight, event-based neural networks for practical eye tracking applications. However, additional validation and exploration of the approach's real-world limitations and robustness would help solidify the findings.

Conclusion

This paper presents a novel lightweight spatiotemporal network for online eye tracking using event cameras. Event cameras offer high-speed, low-latency sensing capabilities that can be leveraged for practical real-time applications like human-computer interaction.

The proposed LightEyetrack architecture achieves state-of-the-art eye tracking performance while maintaining a small, efficient neural network design. This makes it well-suited for deployment on embedded devices and in scenarios where computational resources are constrained.

[link to https://aimodels.fyi/papers/arxiv/spikenvs-enhancing-novel-view-synthesis-from-blurry]The use of event-based sensors and spatiotemporal processing techniques[/link] appears to be a promising direction for advancing the field of eye tracking technology. Further research to improve the robustness and generalizability of the approach could unlock even more real-world applications.

Overall, this work demonstrates the potential of lightweight, efficient neural networks to enable high-performance, low-latency computer vision capabilities on resource-constrained platforms, which has important implications for the future of human-machine interaction.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

A Lightweight Spatiotemporal Network for Online Eye Tracking with Event Camera

Yan Ru Pei, Sasskia Bruers, S'ebastien Crouzet, Douglas McLelland, Olivier Coenen

Event-based data are commonly encountered in edge computing environments where efficiency and low latency are critical. To interface with such data and leverage their rich temporal features, we propose a causal spatiotemporal convolutional network. This solution targets efficient implementation on edge-appropriate hardware with limited resources in three ways: 1) deliberately targets a simple architecture and set of operations (convolutions, ReLU activations) 2) can be configured to perform online inference efficiently via buffering of layer outputs 3) can achieve more than 90% activation sparsity through regularization during training, enabling very significant efficiency gains on event-based processors. In addition, we propose a general affine augmentation strategy acting directly on the events, which alleviates the problem of dataset scarcity for event-based systems. We apply our model on the AIS 2024 event-based eye tracking challenge, reaching a score of 0.9916 p10 accuracy on the Kaggle private testset.

4/16/2024

Co-designing a Sub-millisecond Latency Event-based Eye Tracking System with Submanifold Sparse CNN

Baoheng Zhang, Yizhao Gao, Jingyuan Li, Hayden Kwok-Hay So

Eye-tracking technology is integral to numerous consumer electronics applications, particularly in the realm of virtual and augmented reality (VR/AR). These applications demand solutions that excel in three crucial aspects: low-latency, low-power consumption, and precision. Yet, achieving optimal performance across all these fronts presents a formidable challenge, necessitating a balance between sophisticated algorithms and efficient backend hardware implementations. In this study, we tackle this challenge through a synergistic software/hardware co-design of the system with an event camera. Leveraging the inherent sparsity of event-based input data, we integrate a novel sparse FPGA dataflow accelerator customized for submanifold sparse convolution neural networks (SCNN). The SCNN implemented on the accelerator can efficiently extract the embedding feature vector from each representation of event slices by only processing the non-zero activations. Subsequently, these vectors undergo further processing by a gated recurrent unit (GRU) and a fully connected layer on the host CPU to generate the eye centers. Deployment and evaluation of our system reveal outstanding performance metrics. On the Event-based Eye-Tracking-AIS2024 dataset, our system achieves 81% p5 accuracy, 99.5% p10 accuracy, and 3.71 Mean Euclidean Distance with 0.7 ms latency while only consuming 2.29 mJ per inference. Notably, our solution opens up opportunities for future eye-tracking systems. Code is available at https://github.com/CASR-HKU/ESDA/tree/eye_tracking.

4/23/2024

Evaluating Image-Based Face and Eye Tracking with Event Cameras

Khadija Iddrisu, Waseem Shariff, Noel E. OConnor, Joseph Lemley, Suzanne Little

Event Cameras, also known as Neuromorphic sensors, capture changes in local light intensity at the pixel level, producing asynchronously generated data termed ``events''. This distinct data format mitigates common issues observed in conventional cameras, like under-sampling when capturing fast-moving objects, thereby preserving critical information that might otherwise be lost. However, leveraging this data often necessitates the development of specialized, handcrafted event representations that can integrate seamlessly with conventional Convolutional Neural Networks (CNNs), considering the unique attributes of event data. In this study, We evaluate event-based Face and Eye tracking. The core objective of our study is to showcase the viability of integrating conventional algorithms with event-based data, transformed into a frame format while preserving the unique benefits of event cameras. To validate our approach, we constructed a frame-based event dataset by simulating events between RGB frames derived from the publicly accessible Helen Dataset. We assess its utility for face and eye detection tasks through the application of GR-YOLO -- a pioneering technique derived from YOLOv3. This evaluation includes a comparative analysis with results derived from training the dataset with YOLOv8. Subsequently, the trained models were tested on real event streams from various iterations of Prophesee's event cameras and further evaluated on the Faces in Event Stream (FES) benchmark dataset. The models trained on our dataset shows a good prediction performance across all the datasets obtained for validation with the best results of a mean Average precision score of 0.91. Additionally, The models trained demonstrated robust performance on real event camera data under varying light conditions.

8/21/2024

🌐

A Novel Spike Transformer Network for Depth Estimation from Event Cameras via Cross-modality Knowledge Distillation

Xin Zhang, Liangxiu Han, Tam Sobeih, Lianghao Han, Darren Dancey

Depth estimation is crucial for interpreting complex environments, especially in areas such as autonomous vehicle navigation and robotics. Nonetheless, obtaining accurate depth readings from event camera data remains a formidable challenge. Event cameras operate differently from traditional digital cameras, continuously capturing data and generating asynchronous binary spikes that encode time, location, and light intensity. Yet, the unique sampling mechanisms of event cameras render standard image based algorithms inadequate for processing spike data. This necessitates the development of innovative, spike-aware algorithms tailored for event cameras, a task compounded by the irregularity, continuity, noise, and spatial and temporal characteristics inherent in spiking data.Harnessing the strong generalization capabilities of transformer neural networks for spatiotemporal data, we propose a purely spike-driven spike transformer network for depth estimation from spiking camera data. To address performance limitations with Spiking Neural Networks (SNN), we introduce a novel single-stage cross-modality knowledge transfer framework leveraging knowledge from a large vision foundational model of artificial neural networks (ANN) (DINOv2) to enhance the performance of SNNs with limited data. Our experimental results on both synthetic and real datasets show substantial improvements over existing models, with notable gains in Absolute Relative and Square Relative errors (49% and 39.77% improvements over the benchmark model Spike-T, respectively). Besides accuracy, the proposed model also demonstrates reduced power consumptions, a critical factor for practical applications.

5/2/2024