Modeling State Shifting via Local-Global Distillation for Event-Frame Gaze Tracking

Read original: arXiv:2404.00548 - Published 7/1/2024 by Jiading Li, Zhiyu Zhu, Jinhui Hou, Junhui Hou, Jinjian Wu

Modeling State Shifting via Local-Global Distillation for Event-Frame Gaze Tracking

Overview

This research paper explores the use of event-frame transformers for accurate gaze tracking.
The authors propose a novel denoising distillation technique to improve the performance of event-frame transformers in gaze estimation tasks.
The paper demonstrates that the proposed method can achieve comparable accuracy to state-of-the-art frame-based gaze estimation models, while offering the benefits of low latency and low power consumption associated with event-based vision sensors.

Plain English Explanation

The paper is about improving the accuracy of gaze tracking systems that use event-based cameras. Event-based cameras are a type of camera that only record changes in the scene, rather than capturing a full frame like traditional cameras. This makes them more efficient and responsive, but the data they produce can be noisier and more challenging to work with.

The researchers developed a new technique called "denoising distillation" to help event-frame transformers - a type of machine learning model - overcome the noise in event-based data and become just as accurate at gaze tracking as traditional frame-based models. Essentially, they train the event-frame transformer to learn from a more accurate, frame-based model, while also learning to remove the noise in the event-based data.

This is significant because event-based cameras have the potential to enable faster, more energy-efficient gaze tracking systems, which could have important applications in areas like human-computer interaction, virtual reality, and robotics. The denoising distillation technique developed in this paper helps bring the accuracy of event-based gaze tracking up to par with traditional approaches.

Technical Explanation

The key technical contributions of the paper are:

Denoising Distillation: The authors propose a novel training procedure called "denoising distillation" that enables event-frame transformers to achieve state-of-the-art gaze estimation accuracy. This involves training the event-frame transformer to not only predict gaze, but also to denoise the event-based input data by learning from a more accurate, frame-based gaze estimation model.
Event-Frame Transformer Architecture: The paper introduces a transformer-based architecture that can effectively process the sparse, event-based input data to perform gaze estimation. The model consists of an event feature encoder, a denoising module, and a gaze prediction head.
Evaluation on Benchmark Datasets: The authors evaluate their proposed approach on multiple public gaze estimation datasets, including FNIRS and CVSW. They demonstrate that the denoising distillation technique allows the event-frame transformer to achieve comparable accuracy to state-of-the-art frame-based models, while offering the benefits of low latency and low power consumption.

Critical Analysis

The paper presents a well-designed and thorough evaluation of the proposed denoising distillation technique for event-frame transformers in gaze estimation tasks. The authors acknowledge several limitations of their work, including the need for further research on the generalization of the approach to different event-based sensor configurations and the potential impact of environmental factors on gaze estimation performance.

One area that could be explored further is the interpretability of the denoising module within the event-frame transformer. Understanding how the model learns to denoise the event-based input data could provide valuable insights for improving the robustness and adaptability of the approach.

Additionally, the paper does not discuss the potential privacy implications of using gaze tracking systems, particularly in applications like human-computer interaction. As these technologies become more prevalent, it will be important for researchers to consider the ethical and societal impacts of their work.

Conclusion

This research paper presents a significant advancement in the field of event-based vision for gaze estimation. The proposed denoising distillation technique enables event-frame transformers to achieve accuracy on par with state-of-the-art frame-based models, while offering the benefits of low latency and low power consumption associated with event-based sensors.

The findings of this work have important implications for the development of efficient, real-time gaze tracking systems with applications in areas such as human-computer interaction, virtual reality, and robotics. The authors have made a valuable contribution to the ongoing efforts to bridge the gap between event-based and frame-based computer vision approaches.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Modeling State Shifting via Local-Global Distillation for Event-Frame Gaze Tracking

Jiading Li, Zhiyu Zhu, Jinhui Hou, Junhui Hou, Jinjian Wu

This paper tackles the problem of passive gaze estimation using both event and frame data. Considering the inherently different physiological structures, it is intractable to accurately estimate gaze purely based on a given state. Thus, we reformulate gaze estimation as the quantification of the state shifting from the current state to several prior registered anchor states. Specifically, we propose a two-stage learning-based gaze estimation framework that divides the whole gaze estimation process into a coarse-to-fine approach involving anchor state selection and final gaze location. Moreover, to improve the generalization ability, instead of learning a large gaze estimation network directly, we align a group of local experts with a student network, where a novel denoising distillation algorithm is introduced to utilize denoising diffusion techniques to iteratively remove inherent noise in event data. Extensive experiments demonstrate the effectiveness of the proposed method, which surpasses state-of-the-art methods by a large margin of 15$%$. The code will be publicly available at https://github.com/jdjdli/Denoise_distill_EF_gazetracker.

7/1/2024

Global-Local Distillation Network-Based Audio-Visual Speaker Tracking with Incomplete Modalities

Yidi Li, Yihan Li, Yixin Guo, Bin Ren, Zhenhuan Xu, Hao Guo, Hong Liu, Nicu Sebe

In speaker tracking research, integrating and complementing multi-modal data is a crucial strategy for improving the accuracy and robustness of tracking systems. However, tracking with incomplete modalities remains a challenging issue due to noisy observations caused by occlusion, acoustic noise, and sensor failures. Especially when there is missing data in multiple modalities, the performance of existing multi-modal fusion methods tends to decrease. To this end, we propose a Global-Local Distillation-based Tracker (GLDTracker) for robust audio-visual speaker tracking. GLDTracker is driven by a teacher-student distillation model, enabling the flexible fusion of incomplete information from each modality. The teacher network processes global signals captured by camera and microphone arrays, and the student network handles local information subject to visual occlusion and missing audio channels. By transferring knowledge from teacher to student, the student network can better adapt to complex dynamic scenes with incomplete observations. In the student network, a global feature reconstruction module based on the generative adversarial network is constructed to reconstruct global features from feature embedding with missing local information. Furthermore, a multi-modal multi-level fusion attention is introduced to integrate the incomplete feature and the reconstructed feature, leveraging the complementarity and consistency of audio-visual and global-local features. Experimental results on the AV16.3 dataset demonstrate that the proposed GLDTracker outperforms existing state-of-the-art audio-visual trackers and achieves leading performance on both standard and incomplete modalities datasets, highlighting its superiority and robustness in complex conditions. The code and models will be available.

8/28/2024

Object-Centric Diffusion for Efficient Video Editing

Kumara Kahatapitiya, Adil Karjauv, Davide Abati, Fatih Porikli, Yuki M. Asano, Amirhossein Habibian

Diffusion-based video editing have reached impressive quality and can transform either the global style, local structure, and attributes of given video inputs, following textual edit prompts. However, such solutions typically incur heavy memory and computational costs to generate temporally-coherent frames, either in the form of diffusion inversion and/or cross-frame attention. In this paper, we conduct an analysis of such inefficiencies, and suggest simple yet effective modifications that allow significant speed-ups whilst maintaining quality. Moreover, we introduce Object-Centric Diffusion, to fix generation artifacts and further reduce latency by allocating more computations towards foreground edited regions, arguably more important for perceptual quality. We achieve this by two novel proposals: i) Object-Centric Sampling, decoupling the diffusion steps spent on salient or background regions and spending most on the former, and ii) Object-Centric Token Merging, which reduces cost of cross-frame attention by fusing redundant tokens in unimportant background regions. Both techniques are readily applicable to a given video editing model without retraining, and can drastically reduce its memory and computational cost. We evaluate our proposals on inversion-based and control-signal-based editing pipelines, and show a latency reduction up to 10x for a comparable synthesis quality. Project page: qualcomm-ai-research.github.io/object-centric-diffusion.

9/2/2024

✨

Domain-Adaptive Full-Face Gaze Estimation via Novel-View-Synthesis and Feature Disentanglement

Jiawei Qin, Takuru Shimoyama, Xucong Zhang, Yusuke Sugano

Along with the recent development of deep neural networks, appearance-based gaze estimation has succeeded considerably when training and testing within the same domain. Compared to the within-domain task, the variance of different domains makes the cross-domain performance drop severely, preventing gaze estimation deployment in real-world applications. Among all the factors, ranges of head pose and gaze are believed to play significant roles in the final performance of gaze estimation, while collecting large ranges of data is expensive. This work proposes an effective model training pipeline consisting of a training data synthesis and a gaze estimation model for unsupervised domain adaptation. The proposed data synthesis leverages the single-image 3D reconstruction to expand the range of the head poses from the source domain without requiring a 3D facial shape dataset. To bridge the inevitable gap between synthetic and real images, we further propose an unsupervised domain adaptation method suitable for synthetic full-face data. We propose a disentangling autoencoder network to separate gaze-related features and introduce background augmentation consistency loss to utilize the characteristics of the synthetic source domain. Through comprehensive experiments, it shows that the model using only our synthetic training data can perform comparably to real data extended with a large label range. Our proposed domain adaptation approach further improves the performance on multiple target domains. The code and data will be available at https://github.com/ut-vision/AdaptiveGaze.

7/9/2024