Understanding and Modeling the Effects of Task and Context on Drivers' Gaze Allocation

2310.09275

Published 4/16/2024 by Iuliia Kotseruba, John K. Tsotsos

🤔

Abstract

To further advance driver monitoring and assistance systems, it is important to understand how drivers allocate their attention, in other words, where do they tend to look and why. Traditionally, factors affecting human visual attention have been divided into bottom-up (involuntary attraction to salient regions) and top-down (driven by the demands of the task being performed). Although both play a role in directing drivers' gaze, most of the existing models for drivers' gaze prediction apply techniques developed for bottom-up saliency and do not consider influences of the drivers' actions explicitly. Likewise, common driving attention benchmarks lack relevant annotations for drivers' actions and the context in which they are performed. Therefore, to enable analysis and modeling of these factors for drivers' gaze prediction, we propose the following: 1) we correct the data processing pipeline used in DR(eye)VE to reduce noise in the recorded gaze data; 2) we then add per-frame labels for driving task and context; 3) we benchmark a number of baseline and SOTA models for saliency and driver gaze prediction and use new annotations to analyze how their performance changes in scenarios involving different tasks; and, lastly, 4) we develop a novel model that modulates drivers' gaze prediction with explicit action and context information. While reducing noise in the DR(eye)VE gaze data improves results of all models, we show that using task information in our proposed model boosts performance even further compared to bottom-up models on the cleaned up data, both overall (by 24% KLD and 89% NSS) and on scenarios that involve performing safety-critical maneuvers and crossing intersections (by up to 10--30% KLD). Extended annotations and code are available at https://github.com/ykotseruba/SCOUT.

Create account to get full access

Overview

The paper aims to better understand how drivers allocate their visual attention, which is crucial for developing improved driver monitoring and assistance systems.
It proposes a novel model that considers the driver's current actions and the surrounding context to predict their gaze behavior, going beyond traditional bottom-up saliency models.
The paper also improves the quality of the DR(eye)VE dataset, a common benchmark for driver attention research, by correcting processing issues and adding annotations for driving tasks and context.

Plain English Explanation

When we drive, our eyes are constantly scanning the road, looking at different things for different reasons. Understanding this visual attention is crucial for developing systems that can monitor drivers and provide assistance when needed.

The researchers in this paper wanted to take a closer look at where drivers tend to look and why. Traditionally, there are two main factors that influence human visual attention: bottom-up (being automatically drawn to salient, eye-catching regions) and top-down (focusing on things relevant to the task at hand).

Most existing models for predicting drivers' gaze focus on the bottom-up factors, without considering the impact of the driver's actions and the driving context. To address this, the researchers proposed a new model that explicitly incorporates information about the driver's current task and the surrounding environment.

The researchers also improved the quality of a widely used dataset called DR(eye)VE, which contains recordings of drivers' gaze behavior. They cleaned up some issues in the data processing and added new annotations about the driving tasks and context.

By using this enhanced dataset, the researchers showed that their new model outperforms traditional bottom-up saliency models, especially in scenarios involving safety-critical maneuvers and intersections. This suggests that considering the driver's actions and the driving context is crucial for accurately predicting where they will focus their attention.

Technical Explanation

The paper first identifies a key limitation in existing models for predicting drivers' visual attention: they primarily rely on bottom-up saliency, without explicitly considering the driver's actions and the driving context.

To address this, the researchers propose a novel model that modulates gaze prediction based on the driver's current task and the surrounding environment. They start by correcting issues in the data processing pipeline of the DR(eye)VE dataset, a common benchmark for driver attention research, to reduce noise in the recorded gaze data.

The researchers then add per-frame annotations for driving tasks and context to the DR(eye)VE dataset. With this enhanced dataset, they benchmark a range of baseline and state-of-the-art models for saliency and driver gaze prediction, analyzing how their performance changes in scenarios involving different driving tasks.

The results show that the researchers' proposed model, which incorporates explicit action and context information, outperforms the bottom-up saliency models, both overall (by 24% KLD and 89% NSS) and on scenarios with safety-critical maneuvers and intersection crossings (by up to 10-30% KLD). This suggests that considering the driver's actions and the driving context is crucial for accurately predicting their visual attention.

Critical Analysis

The paper takes a valuable step towards understanding the complex factors that influence driver attention, going beyond traditional bottom-up saliency models. By incorporating information about the driver's actions and the driving context, the proposed model demonstrates improved performance, particularly in safety-critical scenarios.

However, the paper also acknowledges some limitations in the DR(eye)VE dataset, such as the potential for noise in the gaze data and the need for more diverse driving scenarios. Additionally, the paper does not explore the generalizability of the proposed model to other datasets or real-world driving conditions.

Further research could investigate how the model's performance scales with larger and more diverse datasets, as well as how it might be integrated into practical driver monitoring and assistance systems. Exploring the use of spatio-temporal attention mechanisms could also be a fruitful avenue for improving the model's ability to capture the dynamic nature of driver attention.

Overall, this paper makes a valuable contribution to the field of driver monitoring and attention modeling, and its findings highlight the importance of considering higher-level, top-down factors in understanding and predicting human visual attention in complex, real-world tasks.

Conclusion

This paper presents a novel approach to modeling driver visual attention that goes beyond traditional bottom-up saliency models. By incorporating information about the driver's current actions and the surrounding driving context, the researchers' proposed model demonstrates improved performance, particularly in safety-critical scenarios.

The study also includes important improvements to the widely used DR(eye)VE dataset, which should benefit future research in this area. While the paper acknowledges some limitations, its findings underscore the need to consider top-down, task-driven factors when developing advanced driver monitoring and assistance systems.

Overall, this work represents a significant step forward in understanding the complex interplay of factors that influence human visual attention in dynamic, real-world environments like driving. The insights and methodologies presented here have the potential to inform the development of more reliable and effective driver assistance technologies, ultimately enhancing road safety for all.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

SCOUT+: Towards Practical Task-Driven Drivers' Gaze Prediction

Iuliia Kotseruba, John K. Tsotsos

Accurate prediction of drivers' gaze is an important component of vision-based driver monitoring and assistive systems. Of particular interest are safety-critical episodes, such as performing maneuvers or crossing intersections. In such scenarios, drivers' gaze distribution changes significantly and becomes difficult to predict, especially if the task and context information is represented implicitly, as is common in many state-of-the-art models. However, explicit modeling of top-down factors affecting drivers' attention often requires additional information and annotations that may not be readily available. In this paper, we address the challenge of effective modeling of task and context with common sources of data for use in practical systems. To this end, we introduce SCOUT+, a task- and context-aware model for drivers' gaze prediction, which leverages route and map information inferred from commonly available GPS data. We evaluate our model on two datasets, DR(eye)VE and BDD-A, and demonstrate that using maps improves results compared to bottom-up models and reaches performance comparable to the top-down model SCOUT which relies on privileged ground truth information. Code is available at https://github.com/ykotseruba/SCOUT.

4/16/2024

cs.CV

Data Limitations for Modeling Top-Down Effects on Drivers' Attention

Iuliia Kotseruba, John K. Tsotsos

Driving is a visuomotor task, i.e., there is a connection between what drivers see and what they do. While some models of drivers' gaze account for top-down effects of drivers' actions, the majority learn only bottom-up correlations between human gaze and driving footage. The crux of the problem is lack of public data with annotations that could be used to train top-down models and evaluate how well models of any kind capture effects of task on attention. As a result, top-down models are trained and evaluated on private data and public benchmarks measure only the overall fit to human data. In this paper, we focus on data limitations by examining four large-scale public datasets, DR(eye)VE, BDD-A, MAAD, and LBW, used to train and evaluate algorithms for drivers' gaze prediction. We define a set of driving tasks (lateral and longitudinal maneuvers) and context elements (intersections and right-of-way) known to affect drivers' attention, augment the datasets with annotations based on the said definitions, and analyze the characteristics of data recording and processing pipelines w.r.t. capturing what the drivers see and do. In sum, the contributions of this work are: 1) quantifying biases of the public datasets, 2) examining performance of the SOTA bottom-up models on subsets of the data involving non-trivial drivers' actions, 3) linking shortcomings of the bottom-up models to data limitations, and 4) recommendations for future data collection and processing. The new annotations and code for reproducing the results is available at https://github.com/ykotseruba/SCOUT.

4/16/2024

cs.CV

Driver Attention Tracking and Analysis

Dat Viet Thanh Nguyen, Anh Tran, Hoai Nam Vu, Cuong Pham, Minh Hoai

We propose a novel method to estimate a driver's points-of-gaze using a pair of ordinary cameras mounted on the windshield and dashboard of a car. This is a challenging problem due to the dynamics of traffic environments with 3D scenes of unknown depths. This problem is further complicated by the volatile distance between the driver and the camera system. To tackle these challenges, we develop a novel convolutional network that simultaneously analyzes the image of the scene and the image of the driver's face. This network has a camera calibration module that can compute an embedding vector that represents the spatial configuration between the driver and the camera system. This calibration module improves the overall network's performance, which can be jointly trained end to end. We also address the lack of annotated data for training and evaluation by introducing a large-scale driving dataset with point-of-gaze annotations. This is an in situ dataset of real driving sessions in an urban city, containing synchronized images of the driving scene as well as the face and gaze of the driver. Experiments on this dataset show that the proposed method outperforms various baseline methods, having the mean prediction error of 29.69 pixels, which is relatively small compared to the $1280{times}720$ resolution of the scene camera.

4/12/2024

cs.CV

Guiding Attention in End-to-End Driving Models

Diego Porres, Yi Xiao, Gabriel Villalonga, Alexandre Levy, Antonio M. L'opez

Vision-based end-to-end driving models trained by imitation learning can lead to affordable solutions for autonomous driving. However, training these well-performing models usually requires a huge amount of data, while still lacking explicit and intuitive activation maps to reveal the inner workings of these models while driving. In this paper, we study how to guide the attention of these models to improve their driving quality and obtain more intuitive activation maps by adding a loss term during training using salient semantic maps. In contrast to previous work, our method does not require these salient semantic maps to be available during testing time, as well as removing the need to modify the model's architecture to which it is applied. We perform tests using perfect and noisy salient semantic maps with encouraging results in both, the latter of which is inspired by possible errors encountered with real data. Using CIL++ as a representative state-of-the-art model and the CARLA simulator with its standard benchmarks, we conduct experiments that show the effectiveness of our method in training better autonomous driving models, especially when data and computational resources are scarce.

5/2/2024

cs.CV cs.AI