Removal and Selection: Improving RGB-Infrared Object Detection via Coarse-to-Fine Fusion






Published 5/8/2024 by Tianyi Zhao, Maoxun Yuan, Feng Jiang, Nan Wang, Xingxing Wei
Removal and Selection: Improving RGB-Infrared Object Detection via Coarse-to-Fine Fusion


Object detection in visible (RGB) and infrared (IR) images has been widely applied in recent years. Leveraging the complementary characteristics of RGB and IR images, the object detector provides reliable and robust object localization from day to night. Most existing fusion strategies directly input RGB and IR images into deep neural networks, leading to inferior detection performance. However, the RGB and IR features have modality-specific noise, these strategies will exacerbate the fused features along with the propagation. Inspired by the mechanism of the human brain processing multimodal information, in this paper, we introduce a new coarse-to-fine perspective to purify and fuse two modality features. Specifically, following this perspective, we design a Redundant Spectrum Removal module to coarsely remove interfering information within each modality and a Dynamic Feature Selection module to finely select the desired features for feature fusion. To verify the effectiveness of the coarse-to-fine fusion strategy, we construct a new object detector called the Removal and Selection Detector (RSDet). Extensive experiments on three RGB-IR object detection datasets verify the superior performance of our method.

Get summaries of the top AI research delivered straight to your inbox:


  • The paper presents a new method for improving RGB-infrared object detection using a coarse-to-fine fusion approach.
  • The proposed method aims to effectively leverage both RGB and infrared data to enhance object detection performance.
  • The authors introduce a novel Removal and Selection (RS) module that selectively fuses features from the two modalities at different scales.

Plain English Explanation

The paper introduces a new technique for improving object detection in images that combine regular color (RGB) information with infrared data. Object detection is the process of identifying and locating objects of interest within an image. By combining RGB and infrared data, the researchers hope to create a more robust and accurate object detection system.

The key innovation in this work is the Removal and Selection (RS) module, which selectively fuses features from the two data sources at different scales. This allows the system to focus on the most relevant information from each modality, rather than simply combining everything.

The coarse-to-fine fusion approach used in this method means that the fusion happens at multiple levels of detail, from the broad, high-level features down to the more granular, low-level details. This helps the system better understand the overall context and relationships between objects in the image.

Overall, the goal of this research is to improve the performance of RGB-infrared object detection by smartly combining the complementary information provided by the two data sources. This could have applications in areas like low-light object detection or other scenarios where both color and thermal data are available.

Technical Explanation

The paper proposes a new Removal and Selection (RS) module to effectively fuse RGB and infrared features for improved object detection. The RS module selectively combines features from the two modalities at different scales, allowing the system to focus on the most relevant information.

The approach uses a coarse-to-fine fusion strategy, where feature fusion happens at multiple levels of the detection network. This helps the system better understand the overall context and relationships between objects in the image.

The authors evaluate their method on several RGB-infrared object detection benchmarks and show significant improvements over previous state-of-the-art techniques. They also conduct ablation studies to analyze the contributions of the different components of their proposed approach.

Critical Analysis

The paper presents a well-designed and thoroughly evaluated method for improving RGB-infrared object detection. The Removal and Selection module appears to be a novel and effective way to selectively combine features from the two modalities.

One potential limitation is that the method relies on having access to both RGB and infrared data, which may not always be available in real-world scenarios. The authors mention this and suggest exploring implicit multi-spectral fusion as a future direction to address this.

Additionally, while the paper demonstrates strong performance on existing benchmarks, it would be valuable to see the method evaluated in more diverse and challenging real-world settings to assess its robustness and practical applicability.


This paper introduces a novel Removal and Selection module that enables effective fusion of RGB and infrared features for improved object detection. The coarse-to-fine fusion approach and selective combination of modalities lead to significant performance gains over prior work.

The proposed method has the potential to enhance object detection in a variety of applications, particularly those involving low-light or thermal imaging. Future research could explore ways to make the approach more widely applicable, such as through the use of implicit multi-spectral fusion techniques.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Rethinking Early-Fusion Strategies for Improved Multispectral Object Detection

Rethinking Early-Fusion Strategies for Improved Multispectral Object Detection

Xue Zhang, Si-Yuan Cao, Fang Wang, Runmin Zhang, Zhe Wu, Xiaohan Zhang, Xiaokai Bai, Hui-Liang Shen





Most recent multispectral object detectors employ a two-branch structure to extract features from RGB and thermal images. While the two-branch structure achieves better performance than a single-branch structure, it overlooks inference efficiency. This conflict is increasingly aggressive, as recent works solely pursue higher performance rather than both performance and efficiency. In this paper, we address this issue by improving the performance of efficient single-branch structures. We revisit the reasons causing the performance gap between these structures. For the first time, we reveal the information interference problem in the naive early-fusion strategy adopted by previous single-branch structures. Besides, we find that the domain gap between multispectral images, and weak feature representation of the single-branch structure are also key obstacles for performance. Focusing on these three problems, we propose corresponding solutions, including a novel shape-priority early-fusion strategy, a weakly supervised learning method, and a core knowledge distillation technique. Experiments demonstrate that single-branch networks equipped with these three contributions achieve significant performance enhancements while retaining high efficiency. Our code will be available at url{}.

Read more


NIR-Assisted Image Denoising: A Selective Fusion Approach and A Real-World Benchmark Datase

NIR-Assisted Image Denoising: A Selective Fusion Approach and A Real-World Benchmark Datase

Rongjian Xu, Zhilu Zhang, Renlong Wu, Wangmeng Zuo





Despite the significant progress in image denoising, it is still challenging to restore fine-scale details while removing noise, especially in extremely low-light environments. Leveraging near-infrared (NIR) images to assist visible RGB image denoising shows the potential to address this issue, becoming a promising technology. Nonetheless, existing works still struggle with taking advantage of NIR information effectively for real-world image denoising, due to the content inconsistency between NIR-RGB images and the scarcity of real-world paired datasets. To alleviate the problem, we propose an efficient Selective Fusion Module (SFM), which can be plug-and-played into the advanced denoising networks to merge the deep NIR-RGB features. Specifically, we sequentially perform the global and local modulation for NIR and RGB features, and then integrate the two modulated features. Furthermore, we present a Real-world NIR-Assisted Image Denoising (Real-NAID) dataset, which covers diverse scenarios as well as various noise levels. Extensive experiments on both synthetic and our real-world datasets demonstrate that the proposed method achieves better results than state-of-the-art ones. The dataset, codes, and pre-trained models will be publicly available at

Read more


RGB-T Object Detection via Group Shuffled Multi-receptive Attention and Multi-modal Supervision

New!RGB-T Object Detection via Group Shuffled Multi-receptive Attention and Multi-modal Supervision

Jinzhong Wang, Xuetao Tian, Shun Dai, Tao Zhuo, Haorui Zeng, Hongjuan Liu, Jiaqi Liu, Xiuwei Zhang, Yanning Zhang





Multispectral object detection, utilizing both visible (RGB) and thermal infrared (T) modals, has garnered significant attention for its robust performance across diverse weather and lighting conditions. However, effectively exploiting the complementarity between RGB-T modals while maintaining efficiency remains a critical challenge. In this paper, a very simple Group Shuffled Multi-receptive Attention (GSMA) module is proposed to extract and combine multi-scale RGB and thermal features. Then, the extracted multi-modal features are directly integrated with a multi-level path aggregation neck, which significantly improves the fusion effect and efficiency. Meanwhile, multi-modal object detection often adopts union annotations for both modals. This kind of supervision is not sufficient and unfair, since objects observed in one modal may not be seen in the other modal. To solve this issue, Multi-modal Supervision (MS) is proposed to sufficiently supervise RGB-T object detection. Comprehensive experiments on two challenging benchmarks, KAIST and DroneVehicle, demonstrate the proposed model achieves the state-of-the-art accuracy while maintaining competitive efficiency.

Read more



UniRGB-IR: A Unified Framework for Visible-Infrared Downstream Tasks via Adapter Tuning

Maoxun Yuan, Bo Cui, Tianyi Zhao, Xingxing Wei





Semantic analysis on visible (RGB) and infrared (IR) images has gained attention for its ability to be more accurate and robust under low-illumination and complex weather conditions. Due to the lack of pre-trained foundation models on the large-scale infrared image datasets, existing methods prefer to design task-specific frameworks and directly fine-tune them with pre-trained foundation models on their RGB-IR semantic relevance datasets, which results in poor scalability and limited generalization. In this work, we propose a scalable and efficient framework called UniRGB-IR to unify RGB-IR downstream tasks, in which a novel adapter is developed to efficiently introduce richer RGB-IR features into the pre-trained RGB-based foundation model. Specifically, our framework consists of a vision transformer (ViT) foundation model, a Multi-modal Feature Pool (MFP) module and a Supplementary Feature Injector (SFI) module. The MFP and SFI modules cooperate with each other as an adpater to effectively complement the ViT features with the contextual multi-scale features. During training process, we freeze the entire foundation model to inherit prior knowledge and only optimize the MFP and SFI modules. Furthermore, to verify the effectiveness of our framework, we utilize the ViT-Base as the pre-trained foundation model to perform extensive experiments. Experimental results on various RGB-IR downstream tasks demonstrate that our method can achieve state-of-the-art performance. The source code and results are available at

Read more
