RGBT Tracking via All-layer Multimodal Interactions with Progressive Fusion Mamba

Read original: arXiv:2408.08827 - Published 8/19/2024 by Andong Lu, Wanyu Wang, Chenglong Li, Jin Tang, Bin Luo

RGBT Tracking via All-layer Multimodal Interactions with Progressive Fusion Mamba

Overview

This paper presents a new method called RGBT Tracking via All-layer Multimodal Interactions with Progressive Fusion Mamba (RGBT Tracking) for tracking objects across RGB and thermal (infrared) video feeds.
The key innovations include a multimodal fusion network that combines RGB and thermal data at multiple levels, and a progressive fusion strategy that adaptively fuses the modalities based on their relative importance.
The method is evaluated on several RGBT tracking benchmarks and shown to outperform state-of-the-art approaches.

Plain English Explanation

RGBT Tracking is a new technique for tracking objects in video feeds that combine regular color (RGB) video and thermal (infrared) video. This can be useful in scenarios like surveillance or self-driving cars, where combining the two types of video can provide more complete information about the environment.

The key idea behind RGBT Tracking is to fuse the RGB and thermal data at multiple levels of the tracking model, rather than just at the final output. This allows the model to take advantage of the complementary strengths of the two modalities throughout the tracking process. For example, thermal data may be better at detecting people in dark or occluded areas, while RGB data provides richer color and texture information.

RGBT Tracking also uses a progressive fusion strategy, which means it adaptively combines the RGB and thermal data based on their relative importance at each step of the tracking. This helps the model make the most effective use of the available information.

The paper shows that RGBT Tracking outperforms other state-of-the-art methods for RGBT object tracking on several benchmark datasets. This suggests the approach is an effective way to leverage multimodal video data for robust object tracking.

Technical Explanation

The RGBT Tracking method uses a multimodal fusion network that combines RGB and thermal data at multiple levels of the tracking model. Specifically, it includes cross-modal attention modules that allow the network to dynamically fuse the features from the two modalities based on their relative importance.

This progressive fusion strategy is implemented through a series of fusion blocks that progressively integrate the RGB and thermal features. The network learns to adapt the fusion process based on the characteristics of the target object and the surrounding environment.

The authors evaluate RGBT Tracking on several RGBT tracking benchmarks, including GTOT, PTB-RGBT, and TNL-RGBT. The results show that RGBT Tracking outperforms state-of-the-art RGBT tracking methods in terms of key metrics like success rate and precision.

Critical Analysis

The paper provides a thorough evaluation of the RGBT Tracking method and highlights its advantages over other approaches. However, the authors do not discuss any significant limitations or potential drawbacks of the proposed technique.

One area that could be explored further is the generalization of the method to different types of target objects and environments. The evaluation is focused on relatively standard RGBT tracking scenarios, and it's unclear how well the approach would perform in more challenging or diverse settings.

Additionally, the computational complexity and real-time performance of RGBT Tracking are not discussed in detail. This information would be important for understanding the practical deployment of the method, especially in time-sensitive applications like autonomous vehicles.

Conclusion

RGBT Tracking presents a novel and effective approach for combining RGB and thermal video data for object tracking. The multimodal fusion network and progressive fusion strategy allow the method to leverage the complementary strengths of the two modalities, resulting in improved tracking performance on several benchmarks.

While the paper provides a strong technical foundation, further research is needed to understand the broader applicability and practical considerations of the RGBT Tracking approach. Nonetheless, this work represents an important step forward in the field of multimodal video analysis and object tracking.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

RGBT Tracking via All-layer Multimodal Interactions with Progressive Fusion Mamba

Andong Lu, Wanyu Wang, Chenglong Li, Jin Tang, Bin Luo

Existing RGBT tracking methods often design various interaction models to perform cross-modal fusion of each layer, but can not execute the feature interactions among all layers, which plays a critical role in robust multimodal representation, due to large computational burden. To address this issue, this paper presents a novel All-layer multimodal Interaction Network, named AINet, which performs efficient and effective feature interactions of all modalities and layers in a progressive fusion Mamba, for robust RGBT tracking. Even though modality features in different layers are known to contain different cues, it is always challenging to build multimodal interactions in each layer due to struggling in balancing interaction capabilities and efficiency. Meanwhile, considering that the feature discrepancy between RGB and thermal modalities reflects their complementary information to some extent, we design a Difference-based Fusion Mamba (DFM) to achieve enhanced fusion of different modalities with linear complexity. When interacting with features from all layers, a huge number of token sequences (3840 tokens in this work) are involved and the computational burden is thus large. To handle this problem, we design an Order-dynamic Fusion Mamba (OFM) to execute efficient and effective feature interactions of all layers by dynamically adjusting the scan order of different layers in Mamba. Extensive experiments on four public RGBT tracking datasets show that AINet achieves leading performance against existing state-of-the-art methods.

8/19/2024

Why mamba is effective? Exploit Linear Transformer-Mamba Network for Multi-Modality Image Fusion

Chenguang Zhu, Shan Gao, Huafeng Chen, Guangqian Guo, Chaowei Wang, Yaoxing Wang, Chen Shu Lei, Quanjiang Fan

Multi-modality image fusion aims to integrate the merits of images from different sources and render high-quality fusion images. However, existing feature extraction and fusion methods are either constrained by inherent local reduction bias and static parameters during inference (CNN) or limited by quadratic computational complexity (Transformers), and cannot effectively extract and fuse features. To solve this problem, we propose a dual-branch image fusion network called Tmamba. It consists of linear Transformer and Mamba, which has global modeling capabilities while maintaining linear complexity. Due to the difference between the Transformer and Mamba structures, the features extracted by the two branches carry channel and position information respectively. T-M interaction structure is designed between the two branches, using global learnable parameters and convolutional layers to transfer position and channel information respectively. We further propose cross-modal interaction at the attention level to obtain cross-modal attention. Experiments show that our Tmamba achieves promising results in multiple fusion tasks, including infrared-visible image fusion and medical image fusion. Code with checkpoints will be available after the peer-review process.

9/6/2024

FusionMamba: Dynamic Feature Enhancement for Multimodal Image Fusion with Mamba

Xinyu Xie, Yawen Cui, Chio-In Ieong, Tao Tan, Xiaozhi Zhang, Xubin Zheng, Zitong Yu

Multi-modal image fusion aims to combine information from different modes to create a single image with comprehensive information and detailed textures. However, fusion models based on convolutional neural networks encounter limitations in capturing global image features due to their focus on local convolution operations. Transformer-based models, while excelling in global feature modeling, confront computational challenges stemming from their quadratic complexity. Recently, the Selective Structured State Space Model has exhibited significant potential for long-range dependency modeling with linear complexity, offering a promising avenue to address the aforementioned dilemma. In this paper, we propose FusionMamba, a novel dynamic feature enhancement method for multimodal image fusion with Mamba. Specifically, we devise an improved efficient Mamba model for image fusion, integrating efficient visual state space model with dynamic convolution and channel attention. This refined model not only upholds the performance of Mamba and global modeling capability but also diminishes channel redundancy while enhancing local enhancement capability. Additionally, we devise a dynamic feature fusion module (DFFM) comprising two dynamic feature enhancement modules (DFEM) and a cross modality fusion mamba module (CMFM). The former serves for dynamic texture enhancement and dynamic difference perception, whereas the latter enhances correlation features between modes and suppresses redundant intermodal information. FusionMamba has yielded state-of-the-art (SOTA) performance across various multimodal medical image fusion tasks (CT-MRI, PET-MRI, SPECT-MRI), infrared and visible image fusion task (IR-VIS) and multimodal biomedical image fusion dataset (GFP-PC), which is proved that our model has generalization ability. The code for FusionMamba is available at https://github.com/millieXie/FusionMamba.

4/23/2024

Fusion-Mamba for Cross-modality Object Detection

Wenhao Dong, Haodong Zhu, Shaohui Lin, Xiaoyan Luo, Yunhang Shen, Xuhui Liu, Juan Zhang, Guodong Guo, Baochang Zhang

Cross-modality fusing complementary information from different modalities effectively improves object detection performance, making it more useful and robust for a wider range of applications. Existing fusion strategies combine different types of images or merge different backbone features through elaborated neural network modules. However, these methods neglect that modality disparities affect cross-modality fusion performance, as different modalities with different camera focal lengths, placements, and angles are hardly fused. In this paper, we investigate cross-modality fusion by associating cross-modal features in a hidden state space based on an improved Mamba with a gating mechanism. We design a Fusion-Mamba block (FMB) to map cross-modal features into a hidden state space for interaction, thereby reducing disparities between cross-modal features and enhancing the representation consistency of fused features. FMB contains two modules: the State Space Channel Swapping (SSCS) module facilitates shallow feature fusion, and the Dual State Space Fusion (DSSF) enables deep fusion in a hidden state space. Through extensive experiments on public datasets, our proposed approach outperforms the state-of-the-art methods on $m$AP with 5.9% on $M^3FD$ and 4.9% on FLIR-Aligned datasets, demonstrating superior object detection performance. To the best of our knowledge, this is the first work to explore the potential of Mamba for cross-modal fusion and establish a new baseline for cross-modality object detection.

4/16/2024