Co-designing a Sub-millisecond Latency Event-based Eye Tracking System with Submanifold Sparse CNN

Read original: arXiv:2404.14279 - Published 4/23/2024 by Baoheng Zhang, Yizhao Gao, Jingyuan Li, Hayden Kwok-Hay So

Co-designing a Sub-millisecond Latency Event-based Eye Tracking System with Submanifold Sparse CNN

Overview

This paper describes the design and development of a sub-millisecond latency event-based eye tracking system using a Submanifold Sparse Convolutional Neural Network (SSCNN).
The system aims to provide low-latency eye tracking capabilities for applications like augmented reality, virtual reality, and human-computer interaction.
The key innovations include a co-design approach that integrates event-based sensors, sparse neural networks, and specialized hardware to achieve extremely low latency.

Plain English Explanation

The researchers have created a new eye tracking system that can track the movement of a person's eyes extremely quickly - in less than a millisecond. This is much faster than traditional eye tracking systems.

The system uses a special type of camera sensor that only records changes in the image, rather than capturing full frames like a regular camera. This allows the system to react to eye movements very rapidly. The researchers then use a specialized neural network model called a "Submanifold Sparse Convolutional Neural Network" to process the sensor data and determine where the person's eyes are looking.

By carefully designing the hardware, software, and neural network together, the researchers were able to create an eye tracking system that is both extremely fast and accurate. This could be very useful for applications like augmented reality (AR), virtual reality (VR), and other interfaces that need to respond quickly to a person's eye movements.

Technical Explanation

The paper describes the co-design of an event-based eye tracking system that achieves sub-millisecond latency using Submanifold Sparse Convolutional Neural Networks (SSCNN). The key technical elements include:

Event-based sensors: The system uses event-based vision sensors that only capture changes in the visual scene, rather than full video frames. This allows for much faster reaction times compared to traditional cameras.
Submanifold Sparse CNNs: The researchers developed a specialized neural network architecture called SSCNN that can efficiently process the sparse, event-based sensor data. SSCNN leverages the sparse and structured nature of the input to achieve high performance with low computational cost.
Hardware-software co-design: The system integrates the event-based sensors, SSCNN neural network, and specialized hardware accelerators to create a highly optimized end-to-end eye tracking pipeline. This co-design approach is critical for achieving sub-millisecond latency.

The paper presents experimental results demonstrating the system's ability to track eye movements with sub-millisecond latency and high accuracy. This makes the technology well-suited for applications like augmented reality, virtual reality, and other human-computer interaction systems that require rapid, real-time eye tracking.

Critical Analysis

The paper presents a compelling technical solution for achieving extremely low-latency eye tracking using event-based sensors and specialized neural network architectures. However, some potential limitations and areas for further research are:

The system has only been evaluated in a controlled laboratory setting, so its performance in real-world, dynamic environments is still unknown. Further testing would be needed to assess its robustness.
The hardware requirements, including the event-based sensors and specialized accelerators, may limit the scalability and accessibility of the technology, especially for consumer applications.
While the sub-millisecond latency is impressive, the practical benefits of such ultra-low latency for many applications are still unclear and would require further user studies to validate.
The paper does not provide much insight into the energy efficiency or power consumption of the system, which are crucial factors for mobile and wearable applications.

Overall, the research represents an innovative step forward in event-based computer vision and low-latency eye tracking. However, additional work is needed to fully understand the real-world implications and limitations of this technology.

Conclusion

This paper presents a novel eye tracking system that achieves sub-millisecond latency through the co-design of event-based sensors, sparse neural networks, and specialized hardware. The key innovations include the use of Submanifold Sparse Convolutional Neural Networks to efficiently process the sparse, event-based sensor data, and the tight integration of the hardware and software components.

The extremely low latency of this system could enable new applications in augmented reality, virtual reality, and human-computer interaction that require rapid, real-time eye tracking. While the technology shows promise, further research is needed to assess its performance in realistic environments and address potential scalability and power consumption challenges.

Overall, this work represents an important advance in the field of event-based computer vision and demonstrates the potential of co-design approaches to create highly optimized sensing and perception systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Co-designing a Sub-millisecond Latency Event-based Eye Tracking System with Submanifold Sparse CNN

Baoheng Zhang, Yizhao Gao, Jingyuan Li, Hayden Kwok-Hay So

Eye-tracking technology is integral to numerous consumer electronics applications, particularly in the realm of virtual and augmented reality (VR/AR). These applications demand solutions that excel in three crucial aspects: low-latency, low-power consumption, and precision. Yet, achieving optimal performance across all these fronts presents a formidable challenge, necessitating a balance between sophisticated algorithms and efficient backend hardware implementations. In this study, we tackle this challenge through a synergistic software/hardware co-design of the system with an event camera. Leveraging the inherent sparsity of event-based input data, we integrate a novel sparse FPGA dataflow accelerator customized for submanifold sparse convolution neural networks (SCNN). The SCNN implemented on the accelerator can efficiently extract the embedding feature vector from each representation of event slices by only processing the non-zero activations. Subsequently, these vectors undergo further processing by a gated recurrent unit (GRU) and a fully connected layer on the host CPU to generate the eye centers. Deployment and evaluation of our system reveal outstanding performance metrics. On the Event-based Eye-Tracking-AIS2024 dataset, our system achieves 81% p5 accuracy, 99.5% p10 accuracy, and 3.71 Mean Euclidean Distance with 0.7 ms latency while only consuming 2.29 mJ per inference. Notably, our solution opens up opportunities for future eye-tracking systems. Code is available at https://github.com/CASR-HKU/ESDA/tree/eye_tracking.

4/23/2024

A Lightweight Spatiotemporal Network for Online Eye Tracking with Event Camera

Yan Ru Pei, Sasskia Bruers, S'ebastien Crouzet, Douglas McLelland, Olivier Coenen

Event-based data are commonly encountered in edge computing environments where efficiency and low latency are critical. To interface with such data and leverage their rich temporal features, we propose a causal spatiotemporal convolutional network. This solution targets efficient implementation on edge-appropriate hardware with limited resources in three ways: 1) deliberately targets a simple architecture and set of operations (convolutions, ReLU activations) 2) can be configured to perform online inference efficiently via buffering of layer outputs 3) can achieve more than 90% activation sparsity through regularization during training, enabling very significant efficiency gains on event-based processors. In addition, we propose a general affine augmentation strategy acting directly on the events, which alleviates the problem of dataset scarcity for event-based systems. We apply our model on the AIS 2024 event-based eye tracking challenge, reaching a score of 0.9916 p10 accuracy on the Kaggle private testset.

4/16/2024

Evaluating Image-Based Face and Eye Tracking with Event Cameras

Khadija Iddrisu, Waseem Shariff, Noel E. OConnor, Joseph Lemley, Suzanne Little

Event Cameras, also known as Neuromorphic sensors, capture changes in local light intensity at the pixel level, producing asynchronously generated data termed ``events''. This distinct data format mitigates common issues observed in conventional cameras, like under-sampling when capturing fast-moving objects, thereby preserving critical information that might otherwise be lost. However, leveraging this data often necessitates the development of specialized, handcrafted event representations that can integrate seamlessly with conventional Convolutional Neural Networks (CNNs), considering the unique attributes of event data. In this study, We evaluate event-based Face and Eye tracking. The core objective of our study is to showcase the viability of integrating conventional algorithms with event-based data, transformed into a frame format while preserving the unique benefits of event cameras. To validate our approach, we constructed a frame-based event dataset by simulating events between RGB frames derived from the publicly accessible Helen Dataset. We assess its utility for face and eye detection tasks through the application of GR-YOLO -- a pioneering technique derived from YOLOv3. This evaluation includes a comparative analysis with results derived from training the dataset with YOLOv8. Subsequently, the trained models were tested on real event streams from various iterations of Prophesee's event cameras and further evaluated on the Faces in Event Stream (FES) benchmark dataset. The models trained on our dataset shows a good prediction performance across all the datasets obtained for validation with the best results of a mean Average precision score of 0.91. Additionally, The models trained demonstrated robust performance on real event camera data under varying light conditions.

8/21/2024

Smartphone-based Eye Tracking System using Edge Intelligence and Model Optimisation

Nishan Gunawardena, Gough Yumu Lui, Jeewani Anupama Ginige, Bahman Javadi

A significant limitation of current smartphone-based eye-tracking algorithms is their low accuracy when applied to video-type visual stimuli, as they are typically trained on static images. Also, the increasing demand for real-time interactive applications like games, VR, and AR on smartphones requires overcoming the limitations posed by resource constraints such as limited computational power, battery life, and network bandwidth. Therefore, we developed two new smartphone eye-tracking techniques for video-type visuals by combining Convolutional Neural Networks (CNN) with two different Recurrent Neural Networks (RNN), namely Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU). Our CNN+LSTM and CNN+GRU models achieved an average Root Mean Square Error of 0.955cm and 1.091cm, respectively. To address the computational constraints of smartphones, we developed an edge intelligence architecture to enhance the performance of smartphone-based eye tracking. We applied various optimisation methods like quantisation and pruning to deep learning models for better energy, CPU, and memory usage on edge devices, focusing on real-time processing. Using model quantisation, the model inference time in the CNN+LSTM and CNN+GRU models was reduced by 21.72% and 19.50%, respectively, on edge devices.

8/23/2024