TENet: Targetness Entanglement Incorporating with Multi-Scale Pooling and Mutually-Guided Fusion for RGB-E Object Tracking

Read original: arXiv:2405.05004 - Published 5/9/2024 by Pengcheng Shao, Tianyang Xu, Zhangyong Tang, Linze Li, Xiao-Jun Wu, Josef Kittler

🌐

Overview

This paper explores improving visual object tracking by combining RGB (color) data with the output of a visual event camera, which is particularly informative about scene motion.
Existing approaches use traditional appearance models for RGB-E (RGB + event) tracking, which are optimized for RGB-only tracking and do not account for the unique characteristics of event data.
The authors propose a novel "Event Backbone (Pooler)" that extracts high-quality feature representations tailored to the sparsity of event data, using Multi-Scale Pooling to capture motion feature trends.
An innovative "Mutually Guided Fusion" module is introduced to associate the derived RGB and event representations.
Extensive experiments show the proposed method significantly outperforms state-of-the-art RGB-E trackers on two widely used datasets.

Plain English Explanation

Visual object tracking is the task of continuously following the location of an object in a video. Existing tracking methods use standard camera RGB (color) data, but recent event cameras can provide additional useful information about the scene's motion.

The researchers in this paper recognized that current approaches don't properly utilize the unique characteristics of event data when combining it with RGB for improved tracking. To address this, they developed a new "Event Backbone" that extracts high-quality motion features from the sparse event data, using techniques like "Multi-Scale Pooling" to capture different scales of movement.

This event-specific feature representation is then combined with the RGB features in an innovative "Mutually Guided Fusion" module, enabling the tracker to leverage the complementary information from both data sources.

Through extensive testing on benchmark datasets, the authors show their method significantly outperforms existing state-of-the-art RGB-E trackers, improving precision and success rates by over 5% on one dataset. This demonstrates the value of designing components specifically tailored to the strengths of event data, rather than simply using it as an add-on to traditional RGB-based tracking.

Technical Explanation

The authors propose a novel RGB-E tracking framework that learns high-quality feature representations from the event data, in contrast to existing methods that use traditional appearance models optimized only for RGB.

At the core of their approach is the "Event Backbone (Pooler)", a module designed to extract features that capture the inherent sparsity and motion information present in event data. This is achieved through the use of "Multi-Scale Pooling", which applies pooling kernels of varying sizes to the event stream to capture motion features at different scales.

The resulting event features are then combined with the RGB features through an "Adaptive Mutually Guided Fusion" (MGF) module. This innovative fusion technique adaptively weights and integrates the complementary information from the two modalities, allowing the tracker to leverage the strengths of both.

Extensive experiments on the VisEvent and COESOT RGB-E tracking benchmarks demonstrate the effectiveness of the proposed approach. Compared to state-of-the-art methods like FOTS and MAMBA-FETRack, the authors' tracker achieves significant improvements in precision (4.9%) and success rate (5.2%) on the COESOT dataset.

Critical Analysis

The paper presents a compelling approach to leveraging event camera data for improved visual object tracking, with a strong focus on designing components that effectively harness the unique characteristics of event information.

One potential limitation is the lack of a detailed analysis of the individual contributions of the Event Backbone and Mutually Guided Fusion modules. While the overall system demonstrates impressive performance gains, it would be helpful to understand the relative importance and impact of these two key innovations.

Additionally, the paper does not address potential issues with the scalability or computational efficiency of the proposed method, which could be important considerations for real-world deployment. Further research could explore ways to optimize the architecture or explore lightweight variants without sacrificing tracking accuracy.

It would also be valuable to see the method evaluated on a broader range of tracking scenarios, such as low-light conditions or long-term tracking, to better understand its strengths and limitations across different use cases.

Overall, the paper presents a well-designed and effective approach to RGB-E visual object tracking, demonstrating the value of tailoring feature extraction and fusion techniques to the unique properties of event data. The results are promising and provide a solid foundation for further research in this area.

Conclusion

This paper introduces a novel RGB-E tracking framework that addresses the limitations of existing methods by designing components specifically tailored to the characteristics of event data. The proposed "Event Backbone (Pooler)" and "Adaptive Mutually Guided Fusion" module enable the tracker to effectively leverage the complementary information from both RGB and event modalities, leading to significant performance improvements on benchmark datasets.

The research highlights the importance of adapting computer vision techniques to the unique properties of emerging sensor technologies, such as event cameras, in order to unlock their full potential. The findings of this work have important implications for the field of visual tracking, suggesting that further advancements may come from a deeper understanding and exploitation of the intrinsic strengths of different data sources.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌐

TENet: Targetness Entanglement Incorporating with Multi-Scale Pooling and Mutually-Guided Fusion for RGB-E Object Tracking

Pengcheng Shao, Tianyang Xu, Zhangyong Tang, Linze Li, Xiao-Jun Wu, Josef Kittler

There is currently strong interest in improving visual object tracking by augmenting the RGB modality with the output of a visual event camera that is particularly informative about the scene motion. However, existing approaches perform event feature extraction for RGB-E tracking using traditional appearance models, which have been optimised for RGB only tracking, without adapting it for the intrinsic characteristics of the event data. To address this problem, we propose an Event backbone (Pooler), designed to obtain a high-quality feature representation that is cognisant of the innate characteristics of the event data, namely its sparsity. In particular, Multi-Scale Pooling is introduced to capture all the motion feature trends within event data through the utilisation of diverse pooling kernel sizes. The association between the derived RGB and event representations is established by an innovative module performing adaptive Mutually Guided Fusion (MGF). Extensive experimental results show that our method significantly outperforms state-of-the-art trackers on two widely used RGB-E tracking datasets, including VisEvent and COESOT, where the precision and success rates on COESOT are improved by 4.9% and 5.2%, respectively. Our code will be available at https://github.com/SSSpc333/TENet.

5/9/2024

Long-term Frame-Event Visual Tracking: Benchmark Dataset and Baseline

Xiao Wang, Ju Huang, Shiao Wang, Chuanming Tang, Bo Jiang, Yonghong Tian, Jin Tang, Bin Luo

Current event-/frame-event based trackers undergo evaluation on short-term tracking datasets, however, the tracking of real-world scenarios involves long-term tracking, and the performance of existing tracking algorithms in these scenarios remains unclear. In this paper, we first propose a new long-term and large-scale frame-event single object tracking dataset, termed FELT. It contains 742 videos and 1,594,474 RGB frames and event stream pairs and has become the largest frame-event tracking dataset to date. We re-train and evaluate 15 baseline trackers on our dataset for future works to compare. More importantly, we find that the RGB frames and event streams are naturally incomplete due to the influence of challenging factors and spatially sparse event flow. In response to this, we propose a novel associative memory Transformer network as a unified backbone by introducing modern Hopfield layers into multi-head self-attention blocks to fuse both RGB and event data. Extensive experiments on RGB-Event (FELT), RGB-Thermal (RGBT234, LasHeR), and RGB-Depth (DepthTrack) datasets fully validated the effectiveness of our model. The dataset and source code can be found at url{https://github.com/Event-AHU/FELT_SOT_Benchmark}.

4/4/2024

🔎

New!Enhancing Traffic Object Detection in Variable Illumination with RGB-Event Fusion

Zhanwen Liu, Nan Yang, Yang Wang, Yuke Li, Xiangmo Zhao, Fei-Yue Wang

Traffic object detection under variable illumination is challenging due to the information loss caused by the limited dynamic range of conventional frame-based cameras. To address this issue, we introduce bio-inspired event cameras and propose a novel Structure-aware Fusion Network (SFNet) that extracts sharp and complete object structures from the event stream to compensate for the lost information in images through cross-modality fusion, enabling the network to obtain illumination-robust representations for traffic object detection. Specifically, to mitigate the sparsity or blurriness issues arising from diverse motion states of traffic objects in fixed-interval event sampling methods, we propose the Reliable Structure Generation Network (RSGNet) to generate Speed Invariant Frames (SIF), ensuring the integrity and sharpness of object structures. Next, we design a novel Adaptive Feature Complement Module (AFCM) which guides the adaptive fusion of two modality features to compensate for the information loss in the images by perceiving the global lightness distribution of the images, thereby generating illumination-robust representations. Finally, considering the lack of large-scale and high-quality annotations in the existing event-based object detection datasets, we build a DSEC-Det dataset, which consists of 53 sequences with 63,931 images and more than 208,000 labels for 8 classes. Extensive experimental results demonstrate that our proposed SFNet can overcome the perceptual boundaries of conventional cameras and outperform the frame-based method by 8.0% in mAP50 and 5.9% in mAP50:95. Our code and dataset will be available at https://github.com/YN-Yang/SFNet.

9/17/2024

Mamba-FETrack: Frame-Event Tracking via State Space Model

Ju Huang, Shiao Wang, Shuai Wang, Zhe Wu, Xiao Wang, Bo Jiang

RGB-Event based tracking is an emerging research topic, focusing on how to effectively integrate heterogeneous multi-modal data (synchronized exposure video frames and asynchronous pulse Event stream). Existing works typically employ Transformer based networks to handle these modalities and achieve decent accuracy through input-level or feature-level fusion on multiple datasets. However, these trackers require significant memory consumption and computational complexity due to the use of self-attention mechanism. This paper proposes a novel RGB-Event tracking framework, Mamba-FETrack, based on the State Space Model (SSM) to achieve high-performance tracking while effectively reducing computational costs and realizing more efficient tracking. Specifically, we adopt two modality-specific Mamba backbone networks to extract the features of RGB frames and Event streams. Then, we also propose to boost the interactive learning between the RGB and Event features using the Mamba network. The fused features will be fed into the tracking head for target object localization. Extensive experiments on FELT and FE108 datasets fully validated the efficiency and effectiveness of our proposed tracker. Specifically, our Mamba-based tracker achieves 43.5/55.6 on the SR/PR metric, while the ViT-S based tracker (OSTrack) obtains 40.0/50.9. The GPU memory cost of ours and ViT-S based tracker is 13.98GB and 15.44GB, which decreased about $9.5%$. The FLOPs and parameters of ours/ViT-S based OSTrack are 59GB/1076GB and 7MB/60MB, which decreased about $94.5%$ and $88.3%$, respectively. We hope this work can bring some new insights to the tracking field and greatly promote the application of the Mamba architecture in tracking. The source code of this work will be released on url{https://github.com/Event-AHU/Mamba_FETrack}.

4/30/2024