iCub Detecting Gazed Objects: A Pipeline Estimating Human Attention

Read original: arXiv:2308.13318 - Published 5/10/2024 by Shiva Hanifi, Elisa Maiettini, Maria Lombardi, Lorenzo Natale
Total Score

0

🌀

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • This research paper explores the use of eye gaze in human-robot interactions.
  • It proposes a learning system that allows a robot to detect and interpret the objects a human is looking at, using only visual feedback.
  • The system combines face detection, human attention prediction, and online object detection to enable the robot to accurately perceive and interpret human gaze.
  • A novel dataset of over 22,000 images from 10 participants gazing at different annotated objects is introduced as a benchmark for human gaze estimation in table-top human-robot interaction (HRI) contexts.
  • The proposed system is deployed on the humanoid robot iCub, and its functionality is demonstrated in a supplementary video.

Plain English Explanation

The paper focuses on helping robots better understand and respond to human eye gaze during interactions. The researchers developed a system that allows a robot to detect and interpret what a human is looking at, using only visual information. This is an important capability for robots to have, as it can help them establish "joint attention" with their human partners, where both the robot and human are focused on the same object or task.

The system works by combining several key technologies: face detection to find the human's face, human attention prediction to determine where the person is looking, and object detection to identify the objects in the robot's view. By putting these pieces together, the robot can accurately perceive and interpret the human's gaze, which is a crucial step for building more natural and responsive interactions between humans and robots.

To test and validate the system, the researchers created a new dataset of over 22,000 images showing people gazing at different objects on a tabletop. This dataset serves as a benchmark for evaluating gaze estimation in table-top HRI scenarios, which are common in many collaborative tasks. The researchers used this dataset to assess the performance of their proposed system.

Finally, the developed system was deployed on the humanoid robot iCub, and a video was provided to demonstrate its functionality. Overall, this work represents an important step towards enhancing the social awareness and responsiveness of robots, which can improve their ability to collaborate and assist humans in various scenarios.

Technical Explanation

The paper proposes a learning system that enables a robot to detect and interpret the objects a human is looking at, using only visual feedback. The system combines three key components: face detection to locate the human's face, human attention prediction to estimate where the person is looking, and online object detection to identify the objects in the robot's view.

By integrating these components, the robot can accurately perceive and interpret the human's gaze, which is crucial for establishing joint attention between the human and robot partners. To evaluate the system, the researchers introduced a novel dataset of over 22,000 images from 10 participants gazing at different annotated objects on a tabletop. This dataset serves as a benchmark for human gaze estimation in table-top HRI contexts and was used to assess the performance of the proposed pipeline.

Furthermore, the developed system was deployed on the humanoid robot iCub, and a supplementary video showcased its functionality. The results demonstrate the potential of the proposed approach as a first step towards enhancing social awareness and responsiveness in social robotics, as well as improving assistance and support in collaborative scenarios, which can promote efficient human-robot collaboration.

Critical Analysis

The paper presents a novel and promising approach for enabling robots to perceive and interpret human gaze during interactive scenarios. The use of a comprehensive pipeline that combines face detection, human attention prediction, and object detection is a well-designed and logical approach to address this challenge. The introduction of a benchmark dataset for human gaze estimation in table-top HRI settings is also a valuable contribution to the research community.

However, the paper does not address some potential limitations or areas for further research. For example, the system's performance in real-world, dynamic environments with multiple moving objects and distractions is not evaluated. Additionally, the paper does not discuss how the system might handle cases where the human's gaze is ambiguous or divided between multiple objects.

Further research could also explore the integration of this gaze estimation system with other modalities, such as speech or gesture recognition, to create a more holistic understanding of human intent and attention. This could lead to more natural and responsive interactions between humans and robots in collaborative tasks.

Conclusion

This research paper presents a novel learning system that enables robots to detect and interpret human gaze during interactive scenarios. By combining face detection, human attention prediction, and online object detection, the proposed system allows robots to accurately perceive and respond to the objects a human is looking at, which is a crucial step for establishing joint attention and promoting efficient human-robot collaboration.

The introduction of a benchmark dataset for table-top HRI scenarios and the successful deployment of the system on the iCub robot are valuable contributions to the field of social robotics. While the paper demonstrates the potential of this approach, further research is needed to address limitations and explore the integration of gaze estimation with other modalities to create more holistic and responsive human-robot interactions.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌀

Total Score

0

iCub Detecting Gazed Objects: A Pipeline Estimating Human Attention

Shiva Hanifi, Elisa Maiettini, Maria Lombardi, Lorenzo Natale

This research report explores the role of eye gaze in human-robot interactions and proposes a learning system for detecting objects gazed at by humans using solely visual feedback. The system leverages face detection, human attention prediction, and online object detection, and it allows the robot to perceive and interpret human gaze accurately, paving the way for establishing joint attention with human partners. Additionally, a novel dataset collected with the humanoid robot iCub is introduced, comprising over 22,000 images from ten participants gazing at different annotated objects. This dataset serves as a benchmark for the field of human gaze estimation in table-top human-robot interaction (HRI) contexts. In this work, we use it to evaluate the performance of the proposed pipeline and examine the performance of each component. Furthermore, the developed system is deployed on the iCub, and a supplementary video showcases its functionality. The results demonstrate the potential of the proposed approach as a first step to enhance social awareness and responsiveness in social robotics, as well as improve assistance and support in collaborative scenarios, promoting efficient human-robot collaboration.

Read more

5/10/2024

Learning from Observer Gaze:Zero-Shot Attention Prediction Oriented by Human-Object Interaction Recognition
Total Score

0

Learning from Observer Gaze:Zero-Shot Attention Prediction Oriented by Human-Object Interaction Recognition

Yuchen Zhou, Linkai Liu, Chao Gou

Most existing attention prediction research focuses on salient instances like humans and objects. However, the more complex interaction-oriented attention, arising from the comprehension of interactions between instances by human observers, remains largely unexplored. This is equally crucial for advancing human-machine interaction and human-centered artificial intelligence. To bridge this gap, we first collect a novel gaze fixation dataset named IG, comprising 530,000 fixation points across 740 diverse interaction categories, capturing visual attention during human observers cognitive processes of interactions. Subsequently, we introduce the zero-shot interaction-oriented attention prediction task ZeroIA, which challenges models to predict visual cues for interactions not encountered during training. Thirdly, we present the Interactive Attention model IA, designed to emulate human observers cognitive processes to tackle the ZeroIA problem. Extensive experiments demonstrate that the proposed IA outperforms other state-of-the-art approaches in both ZeroIA and fully supervised settings. Lastly, we endeavor to apply interaction-oriented attention to the interaction recognition task itself. Further experimental results demonstrate the promising potential to enhance the performance and interpretability of existing state-of-the-art HOI models by incorporating real human attention data from IG and attention labels generated by IA.

Read more

5/17/2024

👁️

Total Score

0

Gaze-Based Intention Recognition for Human-Robot Collaboration

Valerio Belcamino, Miwa Takase, Mariya Kilina, Alessandro Carf`i, Akira Shimada, Sota Shimizu, Fulvio Mastrogiovanni

This work aims to tackle the intent recognition problem in Human-Robot Collaborative assembly scenarios. Precisely, we consider an interactive assembly of a wooden stool where the robot fetches the pieces in the correct order and the human builds the parts following the instruction manual. The intent recognition is limited to the idle state estimation and it is needed to ensure a better synchronization between the two agents. We carried out a comparison between two distinct solutions involving wearable sensors and eye tracking integrated into the perception pipeline of a flexible planning architecture based on Hierarchical Task Networks. At runtime, the wearable sensing module exploits the raw measurements from four 9-axis Inertial Measurement Units positioned on the wrists and hands of the user as an input for a Long Short-Term Memory Network. On the other hand, the eye tracking relies on a Head Mounted Display and Unreal Engine. We tested the effectiveness of the two approaches with 10 participants, each of whom explored both options in alternate order. We collected explicit metrics about the attractiveness and efficiency of the two techniques through User Experience Questionnaires as well as implicit criteria regarding the classification time and the overall assembly time. The results of our work show that the two methods can reach comparable performances both in terms of effectiveness and user preference. Future development could aim at joining the two approaches two allow the recognition of more complex activities and to anticipate the user actions.

Read more

5/14/2024

Look Hear: Gaze Prediction for Speech-directed Human Attention
Total Score

0

Look Hear: Gaze Prediction for Speech-directed Human Attention

Sounak Mondal, Seoyoung Ahn, Zhibo Yang, Niranjan Balasubramanian, Dimitris Samaras, Gregory Zelinsky, Minh Hoai

For computer systems to effectively interact with humans using spoken language, they need to understand how the words being generated affect the users' moment-by-moment attention. Our study focuses on the incremental prediction of attention as a person is seeing an image and hearing a referring expression defining the object in the scene that should be fixated by gaze. To predict the gaze scanpaths in this incremental object referral task, we developed the Attention in Referral Transformer model or ART, which predicts the human fixations spurred by each word in a referring expression. ART uses a multimodal transformer encoder to jointly learn gaze behavior and its underlying grounding tasks, and an autoregressive transformer decoder to predict, for each word, a variable number of fixations based on fixation history. To train ART, we created RefCOCO-Gaze, a large-scale dataset of 19,738 human gaze scanpaths, corresponding to 2,094 unique image-expression pairs, from 220 participants performing our referral task. In our quantitative and qualitative analyses, ART not only outperforms existing methods in scanpath prediction, but also appears to capture several human attention patterns, such as waiting, scanning, and verification.

Read more

9/11/2024