SCOUT+: Towards Practical Task-Driven Drivers' Gaze Prediction

2404.08756

Published 4/16/2024 by Iuliia Kotseruba, John K. Tsotsos

SCOUT+: Towards Practical Task-Driven Drivers' Gaze Prediction

Abstract

Accurate prediction of drivers' gaze is an important component of vision-based driver monitoring and assistive systems. Of particular interest are safety-critical episodes, such as performing maneuvers or crossing intersections. In such scenarios, drivers' gaze distribution changes significantly and becomes difficult to predict, especially if the task and context information is represented implicitly, as is common in many state-of-the-art models. However, explicit modeling of top-down factors affecting drivers' attention often requires additional information and annotations that may not be readily available. In this paper, we address the challenge of effective modeling of task and context with common sources of data for use in practical systems. To this end, we introduce SCOUT+, a task- and context-aware model for drivers' gaze prediction, which leverages route and map information inferred from commonly available GPS data. We evaluate our model on two datasets, DR(eye)VE and BDD-A, and demonstrate that using maps improves results compared to bottom-up models and reaches performance comparable to the top-down model SCOUT which relies on privileged ground truth information. Code is available at https://github.com/ykotseruba/SCOUT.

Create account to get full access

Overview

This paper presents SCOUT+, a practical task-driven model for predicting drivers' gaze behavior.
The model aims to capture the complex interplay between visual attention, task context, and driving scenarios.
The researchers evaluate SCOUT+ on multiple real-world datasets and compare it to existing state-of-the-art approaches.

Plain English Explanation

When we drive, our eyes are constantly scanning the road, dashboard, and other objects in our surroundings. Predicting where a driver will look can be very useful for designing safer and more intuitive in-vehicle technologies. SCOUT+: Towards Practical Task-Driven Drivers' Gaze Prediction introduces a new model called SCOUT+ that tries to forecast a driver's gaze behavior based on the specific task they are performing and the driving scenario they are in.

The key idea is that a driver's visual attention is influenced not just by what they can see, but also by what they are trying to do. For example, if a driver is trying to change the radio station, they may spend more time looking at the dashboard controls than if they were just focused on the road ahead. SCOUT+ aims to capture these complex interactions between the driver's goals, the driving context, and where their eyes end up looking.

The researchers evaluate SCOUT+ on several real-world datasets of driver behavior and show that it outperforms existing state-of-the-art approaches. This suggests SCOUT+ could be a promising tool for developing in-vehicle systems that are better tailored to how drivers actually use them in practice.

Technical Explanation

SCOUT+: Towards Practical Task-Driven Drivers' Gaze Prediction presents a new deep learning model for forecasting where a driver will look based on the current driving scenario and the task they are trying to accomplish.

The core of SCOUT+ is a transformer-based architecture that takes in information about the driver's current task, the vehicle's state, and visual features of the driving environment. It then outputs a probability distribution over where the driver is likely to direct their gaze in the near future.

To train and evaluate SCOUT+, the researchers used several real-world datasets of driver behavior, including Data Limitations in Modeling Top-Down Effects in Drivers' Gaze, Driver Attention Tracking and Analysis, and Spatio-Temporal Attention with Gaussian Processes for Personalized Video. They compared SCOUT+ to existing state-of-the-art gaze prediction models and found that it achieved superior performance.

One key aspect of SCOUT+ is its ability to model how a driver's visual attention shifts between different tasks and objects in the driving scene. The Transformer-based Model for Prediction of Human Gaze Behavior architecture allows the model to capture these complex spatio-temporal dependencies.

Critical Analysis

The authors of SCOUT+: Towards Practical Task-Driven Drivers' Gaze Prediction acknowledge several limitations of their work. First, the model was trained and evaluated on relatively small datasets, which may limit its generalization to more diverse driving scenarios.

Additionally, the authors note that their approach does not explicitly model the driver's cognitive state or internal decision-making processes. Incorporating these higher-level factors could potentially improve the model's ability to predict gaze behavior in complex, real-world driving situations.

Another area for further research is exploring how SCOUT+ could be deployed in practical in-vehicle applications. The authors suggest integrating the model with real-time sensor data and exploring ways to personalize it to individual drivers' habits and preferences.

Overall, SCOUT+ represents a promising step towards more accurate and contextually-aware models of driver attention. However, continued research and refinement will be necessary to fully realize the potential of task-driven gaze prediction for improving driver safety and in-vehicle technology design.

Conclusion

SCOUT+: Towards Practical Task-Driven Drivers' Gaze Prediction introduces a new deep learning model that aims to forecast where a driver will look based on the current driving context and the task they are trying to accomplish. The key innovation of SCOUT+ is its ability to capture the complex interplay between visual attention, task-level goals, and the surrounding environment.

The researchers demonstrate that SCOUT+ outperforms existing state-of-the-art gaze prediction models on several real-world datasets. This suggests the model could be a valuable tool for designing more intuitive and driver-centric in-vehicle technologies, with potential applications in areas like driver monitoring, augmented reality displays, and adaptive user interfaces.

While SCOUT+ represents an important step forward, the authors identify several avenues for future work, such as expanding the model's training data, incorporating higher-level cognitive factors, and developing practical deployment strategies. Continued research in this direction could lead to significant improvements in our ability to understand and predict human visual attention in complex, dynamic environments like driving.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤔

Understanding and Modeling the Effects of Task and Context on Drivers' Gaze Allocation

Iuliia Kotseruba, John K. Tsotsos

To further advance driver monitoring and assistance systems, it is important to understand how drivers allocate their attention, in other words, where do they tend to look and why. Traditionally, factors affecting human visual attention have been divided into bottom-up (involuntary attraction to salient regions) and top-down (driven by the demands of the task being performed). Although both play a role in directing drivers' gaze, most of the existing models for drivers' gaze prediction apply techniques developed for bottom-up saliency and do not consider influences of the drivers' actions explicitly. Likewise, common driving attention benchmarks lack relevant annotations for drivers' actions and the context in which they are performed. Therefore, to enable analysis and modeling of these factors for drivers' gaze prediction, we propose the following: 1) we correct the data processing pipeline used in DR(eye)VE to reduce noise in the recorded gaze data; 2) we then add per-frame labels for driving task and context; 3) we benchmark a number of baseline and SOTA models for saliency and driver gaze prediction and use new annotations to analyze how their performance changes in scenarios involving different tasks; and, lastly, 4) we develop a novel model that modulates drivers' gaze prediction with explicit action and context information. While reducing noise in the DR(eye)VE gaze data improves results of all models, we show that using task information in our proposed model boosts performance even further compared to bottom-up models on the cleaned up data, both overall (by 24% KLD and 89% NSS) and on scenarios that involve performing safety-critical maneuvers and crossing intersections (by up to 10--30% KLD). Extended annotations and code are available at https://github.com/ykotseruba/SCOUT.

4/16/2024

cs.CV

Data Limitations for Modeling Top-Down Effects on Drivers' Attention

Iuliia Kotseruba, John K. Tsotsos

Driving is a visuomotor task, i.e., there is a connection between what drivers see and what they do. While some models of drivers' gaze account for top-down effects of drivers' actions, the majority learn only bottom-up correlations between human gaze and driving footage. The crux of the problem is lack of public data with annotations that could be used to train top-down models and evaluate how well models of any kind capture effects of task on attention. As a result, top-down models are trained and evaluated on private data and public benchmarks measure only the overall fit to human data. In this paper, we focus on data limitations by examining four large-scale public datasets, DR(eye)VE, BDD-A, MAAD, and LBW, used to train and evaluate algorithms for drivers' gaze prediction. We define a set of driving tasks (lateral and longitudinal maneuvers) and context elements (intersections and right-of-way) known to affect drivers' attention, augment the datasets with annotations based on the said definitions, and analyze the characteristics of data recording and processing pipelines w.r.t. capturing what the drivers see and do. In sum, the contributions of this work are: 1) quantifying biases of the public datasets, 2) examining performance of the SOTA bottom-up models on subsets of the data involving non-trivial drivers' actions, 3) linking shortcomings of the bottom-up models to data limitations, and 4) recommendations for future data collection and processing. The new annotations and code for reproducing the results is available at https://github.com/ykotseruba/SCOUT.

4/16/2024

cs.CV

Driver Attention Tracking and Analysis

Dat Viet Thanh Nguyen, Anh Tran, Hoai Nam Vu, Cuong Pham, Minh Hoai

We propose a novel method to estimate a driver's points-of-gaze using a pair of ordinary cameras mounted on the windshield and dashboard of a car. This is a challenging problem due to the dynamics of traffic environments with 3D scenes of unknown depths. This problem is further complicated by the volatile distance between the driver and the camera system. To tackle these challenges, we develop a novel convolutional network that simultaneously analyzes the image of the scene and the image of the driver's face. This network has a camera calibration module that can compute an embedding vector that represents the spatial configuration between the driver and the camera system. This calibration module improves the overall network's performance, which can be jointly trained end to end. We also address the lack of annotated data for training and evaluation by introducing a large-scale driving dataset with point-of-gaze annotations. This is an in situ dataset of real driving sessions in an urban city, containing synchronized images of the driving scene as well as the face and gaze of the driver. Experiments on this dataset show that the proposed method outperforms various baseline methods, having the mean prediction error of 29.69 pixels, which is relatively small compared to the $1280{times}720$ resolution of the scene camera.

4/12/2024

cs.CV

Guiding Attention in End-to-End Driving Models

Diego Porres, Yi Xiao, Gabriel Villalonga, Alexandre Levy, Antonio M. L'opez

Vision-based end-to-end driving models trained by imitation learning can lead to affordable solutions for autonomous driving. However, training these well-performing models usually requires a huge amount of data, while still lacking explicit and intuitive activation maps to reveal the inner workings of these models while driving. In this paper, we study how to guide the attention of these models to improve their driving quality and obtain more intuitive activation maps by adding a loss term during training using salient semantic maps. In contrast to previous work, our method does not require these salient semantic maps to be available during testing time, as well as removing the need to modify the model's architecture to which it is applied. We perform tests using perfect and noisy salient semantic maps with encouraging results in both, the latter of which is inspired by possible errors encountered with real data. Using CIL++ as a representative state-of-the-art model and the CARLA simulator with its standard benchmarks, we conduct experiments that show the effectiveness of our method in training better autonomous driving models, especially when data and computational resources are scarce.

5/2/2024

cs.CV cs.AI