Long-term Frame-Event Visual Tracking: Benchmark Dataset and Baseline

2403.05839

Published 4/4/2024 by Xiao Wang, Ju Huang, Shiao Wang, Chuanming Tang, Bo Jiang, Yonghong Tian, Jin Tang, Bin Luo

Long-term Frame-Event Visual Tracking: Benchmark Dataset and Baseline

Abstract

Current event-/frame-event based trackers undergo evaluation on short-term tracking datasets, however, the tracking of real-world scenarios involves long-term tracking, and the performance of existing tracking algorithms in these scenarios remains unclear. In this paper, we first propose a new long-term and large-scale frame-event single object tracking dataset, termed FELT. It contains 742 videos and 1,594,474 RGB frames and event stream pairs and has become the largest frame-event tracking dataset to date. We re-train and evaluate 15 baseline trackers on our dataset for future works to compare. More importantly, we find that the RGB frames and event streams are naturally incomplete due to the influence of challenging factors and spatially sparse event flow. In response to this, we propose a novel associative memory Transformer network as a unified backbone by introducing modern Hopfield layers into multi-head self-attention blocks to fuse both RGB and event data. Extensive experiments on RGB-Event (FELT), RGB-Thermal (RGBT234, LasHeR), and RGB-Depth (DepthTrack) datasets fully validated the effectiveness of our model. The dataset and source code can be found at url{https://github.com/Event-AHU/FELT_SOT_Benchmark}.

Get summaries of the top AI research delivered straight to your inbox:

Overview

Introduces a new benchmark dataset and baseline approach for long-term frame-event visual tracking
Focuses on using both frame and event-based data from cameras to improve tracking performance over long time periods
Proposes a transformer-based model that combines frame and event information to track objects effectively

Plain English Explanation

This paper presents a new benchmark dataset and baseline approach for improving long-term visual object tracking using a combination of frame-based and event-based camera data. Traditional tracking methods often struggle to maintain accurate object location over extended time periods due to factors like occlusion, illumination changes, and camera motion. The researchers address this challenge by leveraging the complementary strengths of frame-based and event-based cameras.

Event-based cameras are a novel type of sensor that capture sparse, asynchronous data about changes in the visual scene, rather than full image frames. This allows them to operate with very low latency and power consumption, making them well-suited for tracking dynamic objects. The proposed approach combines event-based data with traditional frame-based information to create a more robust and long-lasting tracking system.

The key innovation is a transformer-based neural network model that can effectively fuse the event and frame data to maintain accurate object locations over long time periods, even in the presence of challenging conditions like occlusions or illumination shifts. This builds on recent advances in event-based visual processing and long-term scene flow estimation.

Technical Explanation

The authors introduce a new benchmark dataset called RGB-DVS, which contains synchronized frame and event-based data captured over long video sequences with a variety of moving objects and camera motion. This provides a realistic testbed for evaluating long-term visual tracking approaches that can leverage both modalities.

To establish a strong baseline, the researchers propose a transformer-based model that takes in the frame and event data, processes them through separate encoder networks, and then fuses the features using a cross-attention mechanism. This allows the model to dynamically emphasize the most relevant spatial and temporal information from each data source when making tracking predictions. The fused features are then used to predict the bounding box location of the target object in each frame.

The proposed model, called ETRAM, is trained end-to-end on the RGB-DVS dataset and achieves state-of-the-art performance on long-term tracking metrics. Ablation studies demonstrate the importance of the event-based data and the transformer-based fusion approach for maintaining accurate tracking over extended time periods, even in challenging conditions.

Critical Analysis

The authors thoroughly evaluate ETRAM on the new RGB-DVS benchmark and compare against several strong baselines, providing valuable insights into the benefits and limitations of their approach. However, the paper does not delve deeply into the potential real-world applications or limitations of the technique.

For instance, the RGB-DVS dataset, while more realistic than previous benchmarks, may still not capture the full complexity of tracking in unconstrained environments. Additionally, the computational and memory requirements of the transformer-based architecture may limit its deployment on resource-constrained edge devices, where event-based cameras are often used.

Further research could explore more efficient fusion architectures, event-based sleep activity recognition, or the integration of ETRAM with simultaneous localization and mapping (SLAM) systems to enable robust, long-term tracking in dynamic scenes.

Conclusion

This paper presents a significant advancement in long-term visual object tracking by leveraging the complementary strengths of frame-based and event-based cameras. The proposed ETRAM model, trained on the new RGB-DVS benchmark dataset, demonstrates state-of-the-art performance in maintaining accurate object location over extended time periods, even in challenging conditions.

The work highlights the potential of hybrid sensor fusion approaches to address the limitations of traditional tracking methods, paving the way for more robust and adaptive visual perception systems. As event-based cameras become more widely adopted, techniques like ETRAM could enable a new generation of intelligent applications that require reliable, long-term object tracking in dynamic environments.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Mamba-FETrack: Frame-Event Tracking via State Space Model

Ju Huang, Shiao Wang, Shuai Wang, Zhe Wu, Xiao Wang, Bo Jiang

RGB-Event based tracking is an emerging research topic, focusing on how to effectively integrate heterogeneous multi-modal data (synchronized exposure video frames and asynchronous pulse Event stream). Existing works typically employ Transformer based networks to handle these modalities and achieve decent accuracy through input-level or feature-level fusion on multiple datasets. However, these trackers require significant memory consumption and computational complexity due to the use of self-attention mechanism. This paper proposes a novel RGB-Event tracking framework, Mamba-FETrack, based on the State Space Model (SSM) to achieve high-performance tracking while effectively reducing computational costs and realizing more efficient tracking. Specifically, we adopt two modality-specific Mamba backbone networks to extract the features of RGB frames and Event streams. Then, we also propose to boost the interactive learning between the RGB and Event features using the Mamba network. The fused features will be fed into the tracking head for target object localization. Extensive experiments on FELT and FE108 datasets fully validated the efficiency and effectiveness of our proposed tracker. Specifically, our Mamba-based tracker achieves 43.5/55.6 on the SR/PR metric, while the ViT-S based tracker (OSTrack) obtains 40.0/50.9. The GPU memory cost of ours and ViT-S based tracker is 13.98GB and 15.44GB, which decreased about $9.5%$. The FLOPs and parameters of ours/ViT-S based OSTrack are 59GB/1076GB and 7MB/60MB, which decreased about $94.5%$ and $88.3%$, respectively. We hope this work can bring some new insights to the tracking field and greatly promote the application of the Mamba architecture in tracking. The source code of this work will be released on url{https://github.com/Event-AHU/Mamba_FETrack}.

4/30/2024

cs.CV cs.AI

🌐

TENet: Targetness Entanglement Incorporating with Multi-Scale Pooling and Mutually-Guided Fusion for RGB-E Object Tracking

Pengcheng Shao, Tianyang Xu, Zhangyong Tang, Linze Li, Xiao-Jun Wu, Josef Kittler

There is currently strong interest in improving visual object tracking by augmenting the RGB modality with the output of a visual event camera that is particularly informative about the scene motion. However, existing approaches perform event feature extraction for RGB-E tracking using traditional appearance models, which have been optimised for RGB only tracking, without adapting it for the intrinsic characteristics of the event data. To address this problem, we propose an Event backbone (Pooler), designed to obtain a high-quality feature representation that is cognisant of the innate characteristics of the event data, namely its sparsity. In particular, Multi-Scale Pooling is introduced to capture all the motion feature trends within event data through the utilisation of diverse pooling kernel sizes. The association between the derived RGB and event representations is established by an innovative module performing adaptive Mutually Guided Fusion (MGF). Extensive experimental results show that our method significantly outperforms state-of-the-art trackers on two widely used RGB-E tracking datasets, including VisEvent and COESOT, where the precision and success rates on COESOT are improved by 4.9% and 5.2%, respectively. Our code will be available at https://github.com/SSSpc333/TENet.

5/9/2024

cs.CV

Seeing Motion at Nighttime with an Event Camera

Haoyue Liu, Shihan Peng, Lin Zhu, Yi Chang, Hanyu Zhou, Luxin Yan

We focus on a very challenging task: imaging at nighttime dynamic scenes. Most previous methods rely on the low-light enhancement of a conventional RGB camera. However, they would inevitably face a dilemma between the long exposure time of nighttime and the motion blur of dynamic scenes. Event cameras react to dynamic changes with higher temporal resolution (microsecond) and higher dynamic range (120dB), offering an alternative solution. In this work, we present a novel nighttime dynamic imaging method with an event camera. Specifically, we discover that the event at nighttime exhibits temporal trailing characteristics and spatial non-stationary distribution. Consequently, we propose a nighttime event reconstruction network (NER-Net) which mainly includes a learnable event timestamps calibration module (LETC) to align the temporal trailing events and a non-uniform illumination aware module (NIAM) to stabilize the spatiotemporal distribution of events. Moreover, we construct a paired real low-light event dataset (RLED) through a co-axial imaging system, including 64,200 spatially and temporally aligned image GTs and low-light events. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art methods in terms of visual quality and generalization ability on real-world nighttime datasets. The project are available at: https://github.com/Liu-haoyue/NER-Net.

4/19/2024

cs.CV

eTraM: Event-based Traffic Monitoring Dataset

Aayush Atul Verma, Bharatesh Chakravarthi, Arpitsinh Vaghela, Hua Wei, Yezhou Yang

Event cameras, with their high temporal and dynamic range and minimal memory usage, have found applications in various fields. However, their potential in static traffic monitoring remains largely unexplored. To facilitate this exploration, we present eTraM - a first-of-its-kind, fully event-based traffic monitoring dataset. eTraM offers 10 hr of data from different traffic scenarios in various lighting and weather conditions, providing a comprehensive overview of real-world situations. Providing 2M bounding box annotations, it covers eight distinct classes of traffic participants, ranging from vehicles to pedestrians and micro-mobility. eTraM's utility has been assessed using state-of-the-art methods for traffic participant detection, including RVT, RED, and YOLOv8. We quantitatively evaluate the ability of event-based models to generalize on nighttime and unseen scenes. Our findings substantiate the compelling potential of leveraging event cameras for traffic monitoring, opening new avenues for research and application. eTraM is available at https://eventbasedvision.github.io/eTraM

4/3/2024

cs.CV