Effect of Kernel Size on CNN-Vision-Transformer-Based Gaze Prediction Using Electroencephalography Data

Read original: arXiv:2408.03478 - Published 8/9/2024 by Chuhui Qiu, Bugao Liang, Matthew L Key

Effect of Kernel Size on CNN-Vision-Transformer-Based Gaze Prediction Using Electroencephalography Data

Overview

The provided paper investigates the effect of kernel size on a CNN-Vision-Transformer-based model for predicting gaze using electroencephalography (EEG) data.
The researchers explore how adjusting the kernel size in the CNN component of the model impacts its performance in estimating where a person is looking.
EEG data, which captures electrical activity in the brain, is used as the input to the model to predict gaze without relying on eye-tracking hardware.

Plain English Explanation

The paper looks at a machine learning model that can predict where someone is looking, based on data collected from sensors that measure the brain's electrical activity. Specifically, the model uses a combination of convolutional neural networks (CNNs) and transformers, two common types of deep learning architectures.

The key part the researchers investigated was the size of the "kernels" used in the CNN component of the model. Kernels are mathematical filters that the CNN uses to extract relevant features from the input data (in this case, the EEG signals). The researchers tried out different kernel sizes to see how that affected the model's ability to accurately predict where the person was gazing.

Predicting gaze, or where someone is looking, is an important task for various applications like human-computer interaction, virtual/augmented reality, and assistive technologies. By using EEG data instead of eye-tracking hardware, this approach offers a more convenient and potentially more accessible way to estimate gaze compared to traditional methods.

Technical Explanation

The paper presents a CNN-Vision-Transformer-based model for predicting gaze using electroencephalography (EEG) data. The model combines a convolutional neural network (CNN) and a vision transformer to extract relevant features from the EEG signals and make the gaze predictions.

The key focus of the research was to investigate the effect of kernel size in the CNN component of the model. The researchers experimented with different kernel sizes, ranging from 3x3 to 11x11, to understand how this hyperparameter impacts the model's gaze prediction performance.

The experiments were conducted on a publicly available EEG dataset collected during a visual attention task. The model's performance was evaluated using metrics like accuracy, precision, recall, and F1-score.

The results showed that the kernel size had a significant influence on the model's performance. Larger kernel sizes, such as 9x9 and 11x11, generally led to better gaze prediction accuracy compared to smaller kernel sizes like 3x3 and 5x5. The researchers hypothesize that the larger kernels are better able to capture the spatial and temporal relationships in the EEG data, which are crucial for predicting gaze.

Critical Analysis

The paper provides a thorough investigation of the impact of kernel size on a CNN-Vision-Transformer-based model for EEG-based gaze prediction. The researchers have carefully designed their experiments and provided a detailed analysis of the results.

One potential limitation of the study is the use of a single EEG dataset, which may limit the generalizability of the findings. It would be valuable to validate the results on additional datasets to ensure the robustness of the observed trends.

Additionally, the paper does not explore the potential trade-offs between kernel size and other model characteristics, such as computational complexity, training time, or memory requirements. Investigating these aspects could provide a more comprehensive understanding of the practical implications of kernel size selection.

Further research could also explore the combination of different kernel sizes within the same CNN architecture, or the integration of additional preprocessing techniques or feature engineering approaches to enhance the model's performance.

Conclusion

The paper presents an insightful investigation into the effect of kernel size on a CNN-Vision-Transformer-based model for predicting gaze using EEG data. The results demonstrate that larger kernel sizes, such as 9x9 and 11x11, generally lead to improved gaze prediction accuracy compared to smaller kernel sizes.

These findings contribute to the ongoing research in EEG-based gaze prediction and have potential implications for the development of more accurate and user-friendly gaze estimation systems that can leverage brain activity data without the need for specialized eye-tracking hardware. Further research in this direction could help advance the field of human-computer interaction and enable novel applications in areas like assistive technology, virtual/augmented reality, and cognitive neuroscience.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Effect of Kernel Size on CNN-Vision-Transformer-Based Gaze Prediction Using Electroencephalography Data

Chuhui Qiu, Bugao Liang, Matthew L Key

In this paper, we present an algorithm of gaze prediction from Electroencephalography (EEG) data. EEG-based gaze prediction is a new research topic that can serve as an alternative to traditional video-based eye-tracking. Compared to the existing state-of-the-art (SOTA) method, we improved the root mean-squared-error of EEG-based gaze prediction to 53.06 millimeters, while reducing the training time to less than 33% of its original duration. Our source code can be found at https://github.com/AmCh-Q/CSCI6907Project

8/9/2024

Advancing EEG-Based Gaze Prediction Using Depthwise Separable Convolution and Enhanced Pre-Processing

Matthew L Key, Tural Mehtiyev, Xiaodong Qu

In the field of EEG-based gaze prediction, the application of deep learning to interpret complex neural data poses significant challenges. This study evaluates the effectiveness of pre-processing techniques and the effect of additional depthwise separable convolution on EEG vision transformers (ViTs) in a pretrained model architecture. We introduce a novel method, the EEG Deeper Clustered Vision Transformer (EEG-DCViT), which combines depthwise separable convolutional neural networks (CNNs) with vision transformers, enriched by a pre-processing strategy involving data clustering. The new approach demonstrates superior performance, establishing a new benchmark with a Root Mean Square Error (RMSE) of 51.6 mm. This achievement underscores the impact of pre-processing and model refinement in enhancing EEG-based applications.

8/9/2024

EEGMobile: Enhancing Speed and Accuracy in EEG-Based Gaze Prediction with Advanced Mobile Architectures

Teng Liang, Andrews Damoah

Electroencephalography (EEG) analysis is an important domain in the realm of Brain-Computer Interface (BCI) research. To ensure BCI devices are capable of providing practical applications in the real world, brain signal processing techniques must be fast, accurate, and resource-conscious to deliver low-latency neural analytics. This study presents a model that leverages a pre-trained MobileViT alongside Knowledge Distillation (KD) for EEG regression tasks. Our results showcase that this model is capable of performing at a level comparable (only 3% lower) to the previous State-Of-The-Art (SOTA) on the EEGEyeNet Absolute Position Task while being 33% faster and 60% smaller. Our research presents a cost-effective model applicable to resource-constrained devices and contributes to expanding future research on lightweight, mobile-friendly models for EEG regression.

8/9/2024

Smartphone-based Eye Tracking System using Edge Intelligence and Model Optimisation

Nishan Gunawardena, Gough Yumu Lui, Jeewani Anupama Ginige, Bahman Javadi

A significant limitation of current smartphone-based eye-tracking algorithms is their low accuracy when applied to video-type visual stimuli, as they are typically trained on static images. Also, the increasing demand for real-time interactive applications like games, VR, and AR on smartphones requires overcoming the limitations posed by resource constraints such as limited computational power, battery life, and network bandwidth. Therefore, we developed two new smartphone eye-tracking techniques for video-type visuals by combining Convolutional Neural Networks (CNN) with two different Recurrent Neural Networks (RNN), namely Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU). Our CNN+LSTM and CNN+GRU models achieved an average Root Mean Square Error of 0.955cm and 1.091cm, respectively. To address the computational constraints of smartphones, we developed an edge intelligence architecture to enhance the performance of smartphone-based eye tracking. We applied various optimisation methods like quantisation and pruning to deep learning models for better energy, CPU, and memory usage on edge devices, focusing on real-time processing. Using model quantisation, the model inference time in the CNN+LSTM and CNN+GRU models was reduced by 21.72% and 19.50%, respectively, on edge devices.

8/23/2024