Spatio-Temporal Attention and Gaussian Processes for Personalized Video Gaze Estimation

2404.05215

Published 4/11/2024 by Swati Jindal, Mohit Yadav, Roberto Manduchi

Spatio-Temporal Attention and Gaussian Processes for Personalized Video Gaze Estimation

Abstract

Gaze is an essential prompt for analyzing human behavior and attention. Recently, there has been an increasing interest in determining gaze direction from facial videos. However, video gaze estimation faces significant challenges, such as understanding the dynamic evolution of gaze in video sequences, dealing with static backgrounds, and adapting to variations in illumination. To address these challenges, we propose a simple and novel deep learning model designed to estimate gaze from videos, incorporating a specialized attention module. Our method employs a spatial attention mechanism that tracks spatial dynamics within videos. This technique enables accurate gaze direction prediction through a temporal sequence model, adeptly transforming spatial observations into temporal insights, thereby significantly improving gaze estimation accuracy. Additionally, our approach integrates Gaussian processes to include individual-specific traits, facilitating the personalization of our model with just a few labeled samples. Experimental results confirm the efficacy of the proposed approach, demonstrating its success in both within-dataset and cross-dataset settings. Specifically, our proposed approach achieves state-of-the-art performance on the Gaze360 dataset, improving by $2.5^circ$ without personalization. Further, by personalizing the model with just three samples, we achieved an additional improvement of $0.8^circ$. The code and pre-trained models are available at url{https://github.com/jswati31/stage}.

Create account to get full access

Overview

This paper proposes a method for personalizing video gaze estimation using spatio-temporal attention and Gaussian processes.
The approach aims to improve the accuracy and personalization of eye tracking by modeling the complex, dynamic nature of human gaze behavior.
The method leverages spatio-temporal attention to capture the user's focus on different regions of the video over time, and Gaussian processes to personalize the gaze estimates for each individual.

Plain English Explanation

The paper discusses a new way to track where people are looking when watching videos. Tracking someone's gaze, or where their eyes are focused, can be useful for things like user interface design, advertising, and understanding human behavior. However, gaze behavior can be complex and vary from person to person.

The researchers developed a system that uses "spatio-temporal attention" to model how a person's focus shifts across different areas of the video over time. [This is similar to how some language models use attention mechanisms to capture relevant parts of the input.] The system also uses "Gaussian processes" to personalize the gaze estimates for each individual viewer. [Gaussian processes are a flexible machine learning tool that can help account for individual differences, similar to how some avatar generation models use them to create personalized avatars.]

The goal is to create a more accurate and personalized way to track where people are looking when watching videos, which could have applications in areas like [user interface design, advertising, and understanding human behavior and intentions].

Technical Explanation

The paper introduces a novel method for personalized video gaze estimation that combines spatio-temporal attention and Gaussian processes. The key elements of the approach are:

Spatio-Temporal Attention: The method uses a spatio-temporal attention mechanism to capture the user's focus on different regions of the video over time. This allows the model to identify salient areas that the user is likely to be looking at.
Gaussian Processes: The system employs Gaussian processes to personalize the gaze estimates for each individual user. Gaussian processes are a flexible probabilistic framework that can account for the unique characteristics of each person's gaze behavior.
Personalization: By integrating the spatio-temporal attention and Gaussian process components, the method is able to provide personalized video gaze estimation that adapts to the specific viewing patterns of each user.

The researchers evaluate their approach on several benchmark datasets and demonstrate improved performance compared to existing gaze estimation techniques. The results highlight the benefits of modeling the dynamic, personalized nature of human gaze behavior for video-based applications.

Critical Analysis

The paper presents a promising approach to improving video gaze estimation, but it does acknowledge some limitations and areas for further research. For example, the method relies on the availability of ground truth gaze data for training, which may not always be easily obtained. Additionally, as with many deep learning-based approaches, the model's performance may be sensitive to the quality and diversity of the training data.

Another potential issue is the computational complexity of the Gaussian process component, which could limit the scalability of the approach for large-scale applications. The authors suggest exploring alternative personalization techniques or ways to optimize the Gaussian process computations as future research directions.

Overall, the paper makes a valuable contribution to the field of gaze estimation, but further research is needed to address the identified limitations and explore the real-world applicability of the method in diverse settings.

Conclusion

This paper presents a novel approach to personalized video gaze estimation that combines spatio-temporal attention and Gaussian processes. By modeling the dynamic, personalized nature of human gaze behavior, the method demonstrates improved performance compared to existing techniques. The proposed system has potential applications in areas such as user interface design, advertising, and understanding human intentions, but further research is needed to address the identified limitations and optimize the approach for practical deployment. The work represents an exciting step forward in enhancing the accuracy and personalization of eye tracking technology.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

A Transformer-Based Model for the Prediction of Human Gaze Behavior on Videos

Suleyman Ozdel, Yao Rong, Berat Mert Albaba, Yen-Ling Kuo, Xi Wang

Eye-tracking applications that utilize the human gaze in video understanding tasks have become increasingly important. To effectively automate the process of video analysis based on eye-tracking data, it is important to accurately replicate human gaze behavior. However, this task presents significant challenges due to the inherent complexity and ambiguity of human gaze patterns. In this work, we introduce a novel method for simulating human gaze behavior. Our approach uses a transformer-based reinforcement learning algorithm to train an agent that acts as a human observer, with the primary role of watching videos and simulating human gaze behavior. We employed an eye-tracking dataset gathered from videos generated by the VirtualHome simulator, with a primary focus on activity recognition. Our experimental results demonstrate the effectiveness of our gaze prediction method by highlighting its capability to replicate human gaze behavior and its applicability for downstream tasks where real human-gaze is used as input.

4/12/2024

cs.CV cs.HC cs.LG

Driver Attention Tracking and Analysis

Dat Viet Thanh Nguyen, Anh Tran, Hoai Nam Vu, Cuong Pham, Minh Hoai

We propose a novel method to estimate a driver's points-of-gaze using a pair of ordinary cameras mounted on the windshield and dashboard of a car. This is a challenging problem due to the dynamics of traffic environments with 3D scenes of unknown depths. This problem is further complicated by the volatile distance between the driver and the camera system. To tackle these challenges, we develop a novel convolutional network that simultaneously analyzes the image of the scene and the image of the driver's face. This network has a camera calibration module that can compute an embedding vector that represents the spatial configuration between the driver and the camera system. This calibration module improves the overall network's performance, which can be jointly trained end to end. We also address the lack of annotated data for training and evaluation by introducing a large-scale driving dataset with point-of-gaze annotations. This is an in situ dataset of real driving sessions in an urban city, containing synchronized images of the driving scene as well as the face and gaze of the driver. Experiments on this dataset show that the proposed method outperforms various baseline methods, having the mean prediction error of 29.69 pixels, which is relatively small compared to the $1280{times}720$ resolution of the scene camera.

4/12/2024

cs.CV

🌀

iCub Detecting Gazed Objects: A Pipeline Estimating Human Attention

Shiva Hanifi, Elisa Maiettini, Maria Lombardi, Lorenzo Natale

This research report explores the role of eye gaze in human-robot interactions and proposes a learning system for detecting objects gazed at by humans using solely visual feedback. The system leverages face detection, human attention prediction, and online object detection, and it allows the robot to perceive and interpret human gaze accurately, paving the way for establishing joint attention with human partners. Additionally, a novel dataset collected with the humanoid robot iCub is introduced, comprising over 22,000 images from ten participants gazing at different annotated objects. This dataset serves as a benchmark for the field of human gaze estimation in table-top human-robot interaction (HRI) contexts. In this work, we use it to evaluate the performance of the proposed pipeline and examine the performance of each component. Furthermore, the developed system is deployed on the iCub, and a supplementary video showcases its functionality. The results demonstrate the potential of the proposed approach as a first step to enhance social awareness and responsiveness in social robotics, as well as improve assistance and support in collaborative scenarios, promoting efficient human-robot collaboration.

5/10/2024

cs.RO

Using Deep Learning to Increase Eye-Tracking Robustness, Accuracy, and Precision in Virtual Reality

Kevin Barkevich, Reynold Bailey, Gabriel J. Diaz

Algorithms for the estimation of gaze direction from mobile and video-based eye trackers typically involve tracking a feature of the eye that moves through the eye camera image in a way that covaries with the shifting gaze direction, such as the center or boundaries of the pupil. Tracking these features using traditional computer vision techniques can be difficult due to partial occlusion and environmental reflections. Although recent efforts to use machine learning (ML) for pupil tracking have demonstrated superior results when evaluated using standard measures of segmentation performance, little is known of how these networks may affect the quality of the final gaze estimate. This work provides an objective assessment of the impact of several contemporary ML-based methods for eye feature tracking when the subsequent gaze estimate is produced using either feature-based or model-based methods. Metrics include the accuracy and precision of the gaze estimate, as well as drop-out rate.

4/1/2024

cs.CV