Alignment-Free RGBT Salient Object Detection: Semantics-guided Asymmetric Correlation Network and A Unified Benchmark

Read original: arXiv:2406.00917 - Published 6/4/2024 by Kunpeng Wang, Danying Lin, Chenglong Li, Zhengzheng Tu, Bin Luo

Alignment-Free RGBT Salient Object Detection: Semantics-guided Asymmetric Correlation Network and A Unified Benchmark

Overview

This paper proposes an alignment-free RGBT (Red, Green, Blue, and Thermal) salient object detection framework that utilizes an Asymmetric Correlation Network (ACNet) and an Associated Feature Sampling (AFS) module.
The framework aims to effectively leverage multi-modal information from RGB and thermal images without the need for pixel-level alignment, which can be challenging in real-world scenarios.
The authors also introduce a new RGBT salient object detection benchmark dataset to facilitate further research in this area.

Plain English Explanation

The paper discusses a new approach for detecting important objects in images that combine color (RGB) and thermal information (T). Traditional methods often require the color and thermal images to be precisely aligned, which can be difficult to achieve in real-world applications. To address this, the researchers developed an alignment-free RGBT salient object detection framework that can effectively utilize the complementary information from both types of images without needing them to be perfectly aligned.

The key components of their framework are:

Asymmetric Correlation Network (ACNet): This module learns to extract relevant features from the RGB and thermal data and then fuses them in an asymmetric way, emphasizing the most important information from each modality.
Associated Feature Sampling (AFS) module: This component helps the network focus on the most relevant areas of the images by selectively sampling features based on their importance.

By combining these innovative techniques, the researchers were able to create a system that can accurately detect salient objects in RGBT images without requiring precise alignment of the color and thermal data. They also developed a new benchmark dataset to help further research in this area.

Technical Explanation

The proposed alignment-free RGBT salient object detection framework consists of two key components:

Asymmetric Correlation Network (ACNet): This module learns to extract relevant features from the RGB and thermal data and then fuses them in an asymmetric way. The asymmetric fusion is guided by semantic information, allowing the network to emphasize the most important features from each modality. This helps the model effectively leverage the complementary information in the RGB and thermal data.
Associated Feature Sampling (AFS) module: This component selectively samples features based on their importance, allowing the network to focus on the most relevant areas of the images. The AFS module is designed to work in conjunction with the ACNet, further enhancing the model's ability to detect salient objects.

The authors also introduce a new RGBT salient object detection benchmark dataset, which includes a diverse set of images with corresponding RGB and thermal data. This dataset is intended to facilitate further research in this area and enable a more comprehensive evaluation of RGBT salient object detection methods.

Critical Analysis

The alignment-free RGBT salient object detection approach proposed in this paper addresses an important challenge in real-world applications, where the precise alignment of RGB and thermal data can be difficult to achieve. By introducing the ACNet and AFS modules, the researchers have demonstrated a promising solution that can effectively leverage multi-modal information without the need for alignment.

However, the paper does not provide a detailed discussion of the computational efficiency and runtime performance of the proposed framework. In practical applications, the inference speed of the model may be a crucial factor, and this aspect could be further explored in future research.

Additionally, the authors mention that their benchmark dataset is designed to facilitate research in this area, but they do not provide a comprehensive analysis of the dataset's diversity, complexity, or potential biases. A more in-depth evaluation of the dataset's characteristics and its impact on the research community would be valuable.

Conclusion

This paper presents an innovative alignment-free RGBT salient object detection framework that addresses the challenge of effectively leveraging multi-modal information without the need for precise pixel-level alignment. The proposed ACNet and AFS modules demonstrate promising results and offer a novel approach to this problem. The introduction of a new RGBT salient object detection benchmark dataset is also a valuable contribution to the research community. While the paper provides a solid technical foundation, further research could explore the computational efficiency and a more comprehensive analysis of the benchmark dataset to unlock the full potential of this alignment-free RGBT salient object detection framework.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Alignment-Free RGBT Salient Object Detection: Semantics-guided Asymmetric Correlation Network and A Unified Benchmark

Kunpeng Wang, Danying Lin, Chenglong Li, Zhengzheng Tu, Bin Luo

RGB and Thermal (RGBT) Salient Object Detection (SOD) aims to achieve high-quality saliency prediction by exploiting the complementary information of visible and thermal image pairs, which are initially captured in an unaligned manner. However, existing methods are tailored for manually aligned image pairs, which are labor-intensive, and directly applying these methods to original unaligned image pairs could significantly degrade their performance. In this paper, we make the first attempt to address RGBT SOD for initially captured RGB and thermal image pairs without manual alignment. Specifically, we propose a Semantics-guided Asymmetric Correlation Network (SACNet) that consists of two novel components: 1) an asymmetric correlation module utilizing semantics-guided attention to model cross-modal correlations specific to unaligned salient regions; 2) an associated feature sampling module to sample relevant thermal features according to the corresponding RGB features for multi-modal feature integration. In addition, we construct a unified benchmark dataset called UVT2000, containing 2000 RGB and thermal image pairs directly captured from various real-world scenes without any alignment, to facilitate research on alignment-free RGBT SOD. Extensive experiments on both aligned and unaligned datasets demonstrate the effectiveness and superior performance of our method. The dataset and code are available at https://github.com/Angknpng/SACNet.

6/4/2024

Visible-Thermal Tiny Object Detection: A Benchmark Dataset and Baselines

Xinyi Ying, Chao Xiao, Ruojing Li, Xu He, Boyang Li, Zhaoxu Li, Yingqian Wang, Mingyuan Hu, Qingyu Xu, Zaiping Lin, Miao Li, Shilin Zhou, Wei An, Weidong Sheng, Li Liu

Small object detection (SOD) has been a longstanding yet challenging task for decades, with numerous datasets and algorithms being developed. However, they mainly focus on either visible or thermal modality, while visible-thermal (RGBT) bimodality is rarely explored. Although some RGBT datasets have been developed recently, the insufficient quantity, limited category, misaligned images and large target size cannot provide an impartial benchmark to evaluate multi-category visible-thermal small object detection (RGBT SOD) algorithms. In this paper, we build the first large-scale benchmark with high diversity for RGBT SOD (namely RGBT-Tiny), including 115 paired sequences, 93K frames and 1.2M manual annotations. RGBT-Tiny contains abundant targets (7 categories) and high-diversity scenes (8 types that cover different illumination and density variations). Note that, over 81% of targets are smaller than 16x16, and we provide paired bounding box annotations with tracking ID to offer an extremely challenging benchmark with wide-range applications, such as RGBT fusion, detection and tracking. In addition, we propose a scale adaptive fitness (SAFit) measure that exhibits high robustness on both small and large targets. The proposed SAFit can provide reasonable performance evaluation and promote detection performance. Based on the proposed RGBT-Tiny dataset and SAFit measure, extensive evaluations have been conducted, including 23 recent state-of-the-art algorithms that cover four different types (i.e., visible generic detection, visible SOD, thermal SOD and RGBT object detection). Project is available at https://github.com/XinyiYing24/RGBT-Tiny.

6/21/2024

🔎

Salient Object Detection in RGB-D Videos

Ao Mou, Yukang Lu, Jiahao He, Dingyao Min, Keren Fu, Qijun Zhao

Given the widespread adoption of depth-sensing acquisition devices, RGB-D videos and related data/media have gained considerable traction in various aspects of daily life. Consequently, conducting salient object detection (SOD) in RGB-D videos presents a highly promising and evolving avenue. Despite the potential of this area, SOD in RGB-D videos remains somewhat under-explored, with RGB-D SOD and video SOD (VSOD) traditionally studied in isolation. To explore this emerging field, this paper makes two primary contributions: the dataset and the model. On one front, we construct the RDVS dataset, a new RGB-D VSOD dataset with realistic depth and characterized by its diversity of scenes and rigorous frame-by-frame annotations. We validate the dataset through comprehensive attribute and object-oriented analyses, and provide training and testing splits. Moreover, we introduce DCTNet+, a three-stream network tailored for RGB-D VSOD, with an emphasis on RGB modality and treats depth and optical flow as auxiliary modalities. In pursuit of effective feature enhancement, refinement, and fusion for precise final prediction, we propose two modules: the multi-modal attention module (MAM) and the refinement fusion module (RFM). To enhance interaction and fusion within RFM, we design a universal interaction module (UIM) and then integrate holistic multi-modal attentive paths (HMAPs) for refining multi-modal low-level features before reaching RFMs. Comprehensive experiments, conducted on pseudo RGB-D video datasets alongside our RDVS, highlight the superiority of DCTNet+ over 17 VSOD models and 14 RGB-D SOD models. Ablation experiments were performed on both pseudo and realistic RGB-D video datasets to demonstrate the advantages of individual modules as well as the necessity of introducing realistic depth. Our code together with RDVS dataset will be available at https://github.com/kerenfu/RDVS/.

5/22/2024

ViDSOD-100: A New Dataset and a Baseline Model for RGB-D Video Salient Object Detection

Junhao Lin, Lei Zhu, Jiaxing Shen, Huazhu Fu, Qing Zhang, Liansheng Wang

With the rapid development of depth sensor, more and more RGB-D videos could be obtained. Identifying the foreground in RGB-D videos is a fundamental and important task. However, the existing salient object detection (SOD) works only focus on either static RGB-D images or RGB videos, ignoring the collaborating of RGB-D and video information. In this paper, we first collect a new annotated RGB-D video SOD (ViDSOD-100) dataset, which contains 100 videos within a total of 9,362 frames, acquired from diverse natural scenes. All the frames in each video are manually annotated to a high-quality saliency annotation. Moreover, we propose a new baseline model, named attentive triple-fusion network (ATF-Net), for RGB-D video salient object detection. Our method aggregates the appearance information from an input RGB image, spatio-temporal information from an estimated motion map, and the geometry information from the depth map by devising three modality-specific branches and a multi-modality integration branch. The modality-specific branches extract the representation of different inputs, while the multi-modality integration branch combines the multi-level modality-specific features by introducing the encoder feature aggregation (MEA) modules and decoder feature aggregation (MDA) modules. The experimental findings conducted on both our newly introduced ViDSOD-100 dataset and the well-established DAVSOD dataset highlight the superior performance of the proposed ATF-Net. This performance enhancement is demonstrated both quantitatively and qualitatively, surpassing the capabilities of current state-of-the-art techniques across various domains, including RGB-D saliency detection, video saliency detection, and video object segmentation. Our data and our code are available at github.com/jhl-Det/RGBD_Video_SOD.

6/19/2024