Visible-Thermal Tiny Object Detection: A Benchmark Dataset and Baselines

Read original: arXiv:2406.14482 - Published 6/21/2024 by Xinyi Ying, Chao Xiao, Ruojing Li, Xu He, Boyang Li, Zhaoxu Li, Yingqian Wang, Mingyuan Hu, Qingyu Xu, Zaiping Lin and 5 others

Visible-Thermal Tiny Object Detection: A Benchmark Dataset and Baselines

Overview

Presents a new benchmark dataset for visible-thermal tiny object detection
Establishes baseline models for this task, exploring various approaches
Highlights the challenges of detecting small objects across visible and thermal modalities

Plain English Explanation

This paper introduces a new dataset and baseline models for the task of detecting tiny objects in both visible and thermal imagery. Tiny objects can be difficult to spot, and having both visible and thermal data can provide complementary information to help with this challenge. The researchers created a dataset with a variety of tiny objects across different visible and thermal scenes. They then tested several different machine learning models to see how well they could detect these small objects in the combined visible-thermal data.

The key ideas are to create a new benchmark dataset that captures the difficulty of tiny object detection, and to establish some initial baseline approaches that can be built upon by other researchers working in this area. Detecting small objects is an important problem in applications like surveillance, security, and autonomous vehicles, and having a thermal camera in addition to visible sensors can help overcome limitations of each individual modality.

Technical Explanation

The paper introduces the "Visible-Thermal Tiny Object Detection" (VTTOD) dataset, which contains over 180,000 annotated tiny objects across 10,000 image pairs captured in both visible and thermal spectra. The objects range in size from 8x8 to 32x32 pixels, making them very difficult to detect. The dataset covers a variety of indoor and outdoor scenes with diverse backgrounds and lighting conditions.

The researchers then evaluate several baseline models for this task, including Faster R-CNN, Cascade R-CNN, and YOLOX. These models are trained and tested on both the visible and thermal modalities individually, as well as on the combined visible-thermal input. The results show that using the combined visible-thermal data generally outperforms using either modality alone, highlighting the complementary nature of the two sensor types for tiny object detection.

Critical Analysis

The VTTOD dataset and baselines presented in this paper provide a valuable contribution to the field of tiny object detection. However, the authors acknowledge several limitations and areas for future work. For example, the dataset may not fully capture the diversity of real-world tiny object scenarios, and the baseline models still have room for improvement in terms of accuracy and robustness.

Additionally, the paper does not explore the computational efficiency or inference speed of the different models, which is an important consideration for real-world applications. Further research could investigate more lightweight and efficient architectures that maintain high detection performance.

Conclusion

This paper introduces a new benchmark dataset and baseline models for the task of visible-thermal tiny object detection. The dataset and baselines serve as a valuable starting point for continued research in this area, which has important applications in fields like surveillance, security, and autonomous vehicles. The results demonstrate the benefits of combining visible and thermal data, and highlight the ongoing challenges of detecting small objects in complex real-world scenes.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Visible-Thermal Tiny Object Detection: A Benchmark Dataset and Baselines

Xinyi Ying, Chao Xiao, Ruojing Li, Xu He, Boyang Li, Zhaoxu Li, Yingqian Wang, Mingyuan Hu, Qingyu Xu, Zaiping Lin, Miao Li, Shilin Zhou, Wei An, Weidong Sheng, Li Liu

Small object detection (SOD) has been a longstanding yet challenging task for decades, with numerous datasets and algorithms being developed. However, they mainly focus on either visible or thermal modality, while visible-thermal (RGBT) bimodality is rarely explored. Although some RGBT datasets have been developed recently, the insufficient quantity, limited category, misaligned images and large target size cannot provide an impartial benchmark to evaluate multi-category visible-thermal small object detection (RGBT SOD) algorithms. In this paper, we build the first large-scale benchmark with high diversity for RGBT SOD (namely RGBT-Tiny), including 115 paired sequences, 93K frames and 1.2M manual annotations. RGBT-Tiny contains abundant targets (7 categories) and high-diversity scenes (8 types that cover different illumination and density variations). Note that, over 81% of targets are smaller than 16x16, and we provide paired bounding box annotations with tracking ID to offer an extremely challenging benchmark with wide-range applications, such as RGBT fusion, detection and tracking. In addition, we propose a scale adaptive fitness (SAFit) measure that exhibits high robustness on both small and large targets. The proposed SAFit can provide reasonable performance evaluation and promote detection performance. Based on the proposed RGBT-Tiny dataset and SAFit measure, extensive evaluations have been conducted, including 23 recent state-of-the-art algorithms that cover four different types (i.e., visible generic detection, visible SOD, thermal SOD and RGBT object detection). Project is available at https://github.com/XinyiYing24/RGBT-Tiny.

6/21/2024

Alignment-Free RGBT Salient Object Detection: Semantics-guided Asymmetric Correlation Network and A Unified Benchmark

Kunpeng Wang, Danying Lin, Chenglong Li, Zhengzheng Tu, Bin Luo

RGB and Thermal (RGBT) Salient Object Detection (SOD) aims to achieve high-quality saliency prediction by exploiting the complementary information of visible and thermal image pairs, which are initially captured in an unaligned manner. However, existing methods are tailored for manually aligned image pairs, which are labor-intensive, and directly applying these methods to original unaligned image pairs could significantly degrade their performance. In this paper, we make the first attempt to address RGBT SOD for initially captured RGB and thermal image pairs without manual alignment. Specifically, we propose a Semantics-guided Asymmetric Correlation Network (SACNet) that consists of two novel components: 1) an asymmetric correlation module utilizing semantics-guided attention to model cross-modal correlations specific to unaligned salient regions; 2) an associated feature sampling module to sample relevant thermal features according to the corresponding RGB features for multi-modal feature integration. In addition, we construct a unified benchmark dataset called UVT2000, containing 2000 RGB and thermal image pairs directly captured from various real-world scenes without any alignment, to facilitate research on alignment-free RGBT SOD. Extensive experiments on both aligned and unaligned datasets demonstrate the effectiveness and superior performance of our method. The dataset and code are available at https://github.com/Angknpng/SACNet.

6/4/2024

ViDSOD-100: A New Dataset and a Baseline Model for RGB-D Video Salient Object Detection

Junhao Lin, Lei Zhu, Jiaxing Shen, Huazhu Fu, Qing Zhang, Liansheng Wang

With the rapid development of depth sensor, more and more RGB-D videos could be obtained. Identifying the foreground in RGB-D videos is a fundamental and important task. However, the existing salient object detection (SOD) works only focus on either static RGB-D images or RGB videos, ignoring the collaborating of RGB-D and video information. In this paper, we first collect a new annotated RGB-D video SOD (ViDSOD-100) dataset, which contains 100 videos within a total of 9,362 frames, acquired from diverse natural scenes. All the frames in each video are manually annotated to a high-quality saliency annotation. Moreover, we propose a new baseline model, named attentive triple-fusion network (ATF-Net), for RGB-D video salient object detection. Our method aggregates the appearance information from an input RGB image, spatio-temporal information from an estimated motion map, and the geometry information from the depth map by devising three modality-specific branches and a multi-modality integration branch. The modality-specific branches extract the representation of different inputs, while the multi-modality integration branch combines the multi-level modality-specific features by introducing the encoder feature aggregation (MEA) modules and decoder feature aggregation (MDA) modules. The experimental findings conducted on both our newly introduced ViDSOD-100 dataset and the well-established DAVSOD dataset highlight the superior performance of the proposed ATF-Net. This performance enhancement is demonstrated both quantitatively and qualitatively, surpassing the capabilities of current state-of-the-art techniques across various domains, including RGB-D saliency detection, video saliency detection, and video object segmentation. Our data and our code are available at github.com/jhl-Det/RGBD_Video_SOD.

9/20/2024

🔎

Salient Object Detection in RGB-D Videos

Ao Mou, Yukang Lu, Jiahao He, Dingyao Min, Keren Fu, Qijun Zhao

Given the widespread adoption of depth-sensing acquisition devices, RGB-D videos and related data/media have gained considerable traction in various aspects of daily life. Consequently, conducting salient object detection (SOD) in RGB-D videos presents a highly promising and evolving avenue. Despite the potential of this area, SOD in RGB-D videos remains somewhat under-explored, with RGB-D SOD and video SOD (VSOD) traditionally studied in isolation. To explore this emerging field, this paper makes two primary contributions: the dataset and the model. On one front, we construct the RDVS dataset, a new RGB-D VSOD dataset with realistic depth and characterized by its diversity of scenes and rigorous frame-by-frame annotations. We validate the dataset through comprehensive attribute and object-oriented analyses, and provide training and testing splits. Moreover, we introduce DCTNet+, a three-stream network tailored for RGB-D VSOD, with an emphasis on RGB modality and treats depth and optical flow as auxiliary modalities. In pursuit of effective feature enhancement, refinement, and fusion for precise final prediction, we propose two modules: the multi-modal attention module (MAM) and the refinement fusion module (RFM). To enhance interaction and fusion within RFM, we design a universal interaction module (UIM) and then integrate holistic multi-modal attentive paths (HMAPs) for refining multi-modal low-level features before reaching RFMs. Comprehensive experiments, conducted on pseudo RGB-D video datasets alongside our RDVS, highlight the superiority of DCTNet+ over 17 VSOD models and 14 RGB-D SOD models. Ablation experiments were performed on both pseudo and realistic RGB-D video datasets to demonstrate the advantages of individual modules as well as the necessity of introducing realistic depth. Our code together with RDVS dataset will be available at https://github.com/kerenfu/RDVS/.

5/22/2024