OAT: Object-Level Attention Transformer for Gaze Scanpath Prediction

Read original: arXiv:2407.13335 - Published 7/19/2024 by Yini Fang, Jingling Yu, Haozheng Zhang, Ralf van der Lans, Bertram Shi

OAT: Object-Level Attention Transformer for Gaze Scanpath Prediction

Overview

This paper introduces a novel deep learning model called OAT (Object-Level Attention Transformer) for predicting human gaze scanpaths when viewing images.
The model leverages object-level attention to capture the relationships between objects in the image and how they guide the viewer's gaze.
The authors demonstrate the effectiveness of OAT on several gaze prediction benchmarks, outperforming previous state-of-the-art methods.

Plain English Explanation

When we look at an image, our eyes tend to jump around from one interesting object or area to another in a specific pattern called a "scanpath." Predicting these scanpaths can be useful for applications like improving user interfaces, advertising, and virtual reality.

The OAT model proposed in this paper tries to mimic how humans focus their attention on different objects in an image and how that guides their gaze. It does this by using a type of deep learning architecture called a "transformer" that can model the relationships between the objects.

By understanding these object-level relationships, the OAT model can more accurately predict where a person's eyes will move when looking at a new image. This is an improvement over previous methods that just looked at the overall visual features of the image, without considering how the individual elements interact.

The authors show that OAT outperforms other state-of-the-art gaze prediction models on several standard benchmarks. This suggests the model is capturing important aspects of human visual attention that were missing from earlier approaches.

Technical Explanation

The key innovation in the OAT model is the use of "object-level attention" to predict gaze scanpaths. Rather than looking at the entire image as a whole, the model first detects and extracts features for individual objects in the scene.

It then uses a transformer-based architecture to model the relationships and interactions between these objects. This allows the model to capture how the viewer's attention shifts between different salient elements in the image, which is a crucial aspect of scanpath prediction.

The transformer component consists of multiple "attention heads" that learn to focus on the most relevant objects given the current gaze position. This dynamic, object-centric attention mechanism is a departure from previous approaches that relied on static, image-level features.

The authors evaluate OAT on several public gaze prediction datasets, including SALICON, MIT300, and CAT2000. They show that OAT achieves state-of-the-art performance, outperforming previous methods that used saliency maps, recurrent neural networks, and other techniques.

Critical Analysis

A key advantage of the OAT model is its ability to model the interactions between objects in a scene, which aligns well with how humans shift their visual attention. However, the paper does not provide a deep analysis of the specific object relationships and attention patterns learned by the model.

Additionally, while the authors demonstrate strong performance on standard benchmarks, it is unclear how well OAT would generalize to more diverse or complex real-world scenarios. The datasets used for evaluation may not fully capture the nuances of natural viewing behavior.

Further research could investigate the model's interpretability, examining which object features and relationships are most influential for accurate scanpath prediction. Integrating additional cues, such as semantic or contextual information, may also improve the model's performance and robustness.

Conclusion

The OAT model represents a promising advance in gaze scanpath prediction by leveraging object-level attention mechanisms. By modeling the relationships between salient elements in an image, the model can more accurately capture the dynamic nature of human visual attention.

The strong results on benchmark datasets suggest that object-centric approaches like OAT have the potential to enhance a wide range of applications, from user interface design to advertising and entertainment. As the field of gaze prediction continues to evolve, further research on interpretable, context-aware models could lead to even more accurate and insightful predictions of human visual behavior.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

OAT: Object-Level Attention Transformer for Gaze Scanpath Prediction

Yini Fang, Jingling Yu, Haozheng Zhang, Ralf van der Lans, Bertram Shi

Visual search is important in our daily life. The efficient allocation of visual attention is critical to effectively complete visual search tasks. Prior research has predominantly modelled the spatial allocation of visual attention in images at the pixel level, e.g. using a saliency map. However, emerging evidence shows that visual attention is guided by objects rather than pixel intensities. This paper introduces the Object-level Attention Transformer (OAT), which predicts human scanpaths as they search for a target object within a cluttered scene of distractors. OAT uses an encoder-decoder architecture. The encoder captures information about the position and appearance of the objects within an image and about the target. The decoder predicts the gaze scanpath as a sequence of object fixations, by integrating output features from both the encoder and decoder. We also propose a new positional encoding that better reflects spatial relationships between objects. We evaluated OAT on the Amazon book cover dataset and a new dataset for visual search that we collected. OAT's predicted gaze scanpaths align more closely with human gaze patterns, compared to predictions by algorithms based on spatial attention on both established metrics and a novel behavioural-based metric. Our results demonstrate the generalization ability of OAT, as it accurately predicts human scanpaths for unseen layouts and target objects.

7/19/2024

Look Hear: Gaze Prediction for Speech-directed Human Attention

Sounak Mondal, Seoyoung Ahn, Zhibo Yang, Niranjan Balasubramanian, Dimitris Samaras, Gregory Zelinsky, Minh Hoai

For computer systems to effectively interact with humans using spoken language, they need to understand how the words being generated affect the users' moment-by-moment attention. Our study focuses on the incremental prediction of attention as a person is seeing an image and hearing a referring expression defining the object in the scene that should be fixated by gaze. To predict the gaze scanpaths in this incremental object referral task, we developed the Attention in Referral Transformer model or ART, which predicts the human fixations spurred by each word in a referring expression. ART uses a multimodal transformer encoder to jointly learn gaze behavior and its underlying grounding tasks, and an autoregressive transformer decoder to predict, for each word, a variable number of fixations based on fixation history. To train ART, we created RefCOCO-Gaze, a large-scale dataset of 19,738 human gaze scanpaths, corresponding to 2,094 unique image-expression pairs, from 220 participants performing our referral task. In our quantitative and qualitative analyses, ART not only outperforms existing methods in scanpath prediction, but also appears to capture several human attention patterns, such as waiting, scanning, and verification.

9/11/2024

Beyond Average: Individualized Visual Scanpath Prediction

Xianyu Chen, Ming Jiang, Qi Zhao

Understanding how attention varies across individuals has significant scientific and societal impacts. However, existing visual scanpath models treat attention uniformly, neglecting individual differences. To bridge this gap, this paper focuses on individualized scanpath prediction (ISP), a new attention modeling task that aims to accurately predict how different individuals shift their attention in diverse visual tasks. It proposes an ISP method featuring three novel technical components: (1) an observer encoder to characterize and integrate an observer's unique attention traits, (2) an observer-centric feature integration approach that holistically combines visual features, task guidance, and observer-specific characteristics, and (3) an adaptive fixation prioritization mechanism that refines scanpath predictions by dynamically prioritizing semantic feature maps based on individual observers' attention traits. These novel components allow scanpath models to effectively address the attention variations across different observers. Our method is generally applicable to different datasets, model architectures, and visual tasks, offering a comprehensive tool for transforming general scanpath models into individualized ones. Comprehensive evaluations using value-based and ranking-based metrics verify the method's effectiveness and generalizability.

4/22/2024

A Robotics-Inspired Scanpath Model Reveals the Importance of Uncertainty and Semantic Object Cues for Gaze Guidance in Dynamic Scenes

Vito Mengers, Nicolas Roth, Oliver Brock, Klaus Obermayer, Martin Rolfs

How we perceive objects around us depends on what we actively attend to, yet our eye movements depend on the perceived objects. Still, object segmentation and gaze behavior are typically treated as two independent processes. Drawing on an information processing pattern from robotics, we present a mechanistic model that simulates these processes for dynamic real-world scenes. Our image-computable model uses the current scene segmentation for object-based saccadic decision-making while using the foveated object to refine its scene segmentation recursively. To model this refinement, we use a Bayesian filter, which also provides an uncertainty estimate for the segmentation that we use to guide active scene exploration. We demonstrate that this model closely resembles observers' free viewing behavior, measured by scanpath statistics, including foveation duration and saccade amplitude distributions used for parameter fitting and higher-level statistics not used for fitting. These include how object detections, inspections, and returns are balanced and a delay of returning saccades without an explicit implementation of such temporal inhibition of return. Extensive simulations and ablation studies show that uncertainty promotes balanced exploration and that semantic object cues are crucial to form the perceptual units used in object-based attention. Moreover, we show how our model's modular design allows for extensions, such as incorporating saccadic momentum or pre-saccadic attention, to further align its output with human scanpaths.

8/6/2024