ESOD: Efficient Small Object Detection on High-Resolution Images

Read original: arXiv:2407.16424 - Published 7/24/2024 by Kai Liu, Zhihang Fu, Sheng Jin, Ze Chen, Fan Zhou, Rongxin Jiang, Yaowu Chen, Jieping Ye

ESOD: Efficient Small Object Detection on High-Resolution Images

Overview

Focuses on the challenge of detecting small objects in high-resolution images
Proposes a novel "Filter-Then-Detect" approach to efficiently detect small objects
Develops a sparse detection mechanism to reduce computational complexity

Plain English Explanation

This research paper tackles the problem of detecting small objects in high-resolution images. Small objects can be difficult to detect because they often have low visual contrast and appear in cluttered environments. The researchers propose a new approach called "Filter-Then-Detect" to address this challenge.

The key idea is to first use a lightweight filter to identify areas of interest where small objects are likely to be present. This helps focus the detection process on the most relevant regions, rather than scanning the entire image. Then, a more sophisticated object detection model is applied only to these filtered regions.

Additionally, the researchers developed a sparse detection mechanism to further reduce the computational complexity of the detection process. This involves selectively processing only the most promising areas, rather than analyzing the entire image in a dense manner.

By combining these techniques, the researchers were able to achieve efficient small object detection on high-resolution images without sacrificing accuracy.

Technical Explanation

The paper proposes an Efficient Small Object Detection (ESOD) framework that consists of two key components:

Filter-Then-Detect: The first stage is a lightweight filter that quickly identifies regions of interest where small objects are likely to be present. This helps focus the subsequent, more computationally intensive, object detection stage on the most relevant areas.
Sparse Detection: The object detection model is applied in a sparse manner, only processing the most promising regions identified by the filter, rather than analyzing the entire image densely. This reduces the overall computational complexity.

The authors evaluate their ESOD framework on several challenging small object detection benchmarks, including VisDrone and COCO. They demonstrate that ESOD can achieve state-of-the-art performance while being significantly more efficient than traditional dense detection approaches.

Critical Analysis

The researchers acknowledge that their approach may be less effective for small objects that are highly occluded or have very low contrast. They suggest that further improvements could be made by incorporating additional cues, such as context information, to better handle these challenging cases.

Additionally, the performance of the filter stage is crucial to the overall effectiveness of the ESOD framework. If the filter fails to accurately identify relevant regions, the subsequent detection stage may miss important small objects. Careful optimization of the filter's design and parameters would be important for practical deployment.

Conclusion

This paper presents a novel "Filter-Then-Detect" approach to efficiently detect small objects in high-resolution images. By combining a lightweight filtering stage with a sparse detection mechanism, the ESOD framework can achieve state-of-the-art performance while being more computationally efficient than traditional dense detection methods. This research represents an important step forward in addressing the challenge of small object detection, which has significant implications for a wide range of applications, from autonomous vehicles to video surveillance.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

ESOD: Efficient Small Object Detection on High-Resolution Images

Kai Liu, Zhihang Fu, Sheng Jin, Ze Chen, Fan Zhou, Rongxin Jiang, Yaowu Chen, Jieping Ye

Enlarging input images is a straightforward and effective approach to promote small object detection. However, simple image enlargement is significantly expensive on both computations and GPU memory. In fact, small objects are usually sparsely distributed and locally clustered. Therefore, massive feature extraction computations are wasted on the non-target background area of images. Recent works have tried to pick out target-containing regions using an extra network and perform conventional object detection, but the newly introduced computation limits their final performance. In this paper, we propose to reuse the detector's backbone to conduct feature-level object-seeking and patch-slicing, which can avoid redundant feature extraction and reduce the computation cost. Incorporating a sparse detection head, we are able to detect small objects on high-resolution inputs (e.g., 1080P or larger) for superior performance. The resulting Efficient Small Object Detection (ESOD) approach is a generic framework, which can be applied to both CNN- and ViT-based detectors to save the computation and GPU memory costs. Extensive experiments demonstrate the efficacy and efficiency of our method. In particular, our method consistently surpasses the SOTA detectors by a large margin (e.g., 8% gains on AP) on the representative VisDrone, UAVDT, and TinyPerson datasets. Code will be made public soon.

7/24/2024

SOD-YOLOv8 -- Enhancing YOLOv8 for Small Object Detection in Traffic Scenes

Boshra Khalili, Andrew W. Smyth

Object detection as part of computer vision can be crucial for traffic management, emergency response, autonomous vehicles, and smart cities. Despite significant advances in object detection, detecting small objects in images captured by distant cameras remains challenging due to their size, distance from the camera, varied shapes, and cluttered backgrounds. To address these challenges, we propose Small Object Detection YOLOv8 (SOD-YOLOv8), a novel model specifically designed for scenarios involving numerous small objects. Inspired by Efficient Generalized Feature Pyramid Networks (GFPN), we enhance multi-path fusion within YOLOv8 to integrate features across different levels, preserving details from shallower layers and improving small object detection accuracy. Also, A fourth detection layer is added to leverage high-resolution spatial information effectively. The Efficient Multi-Scale Attention Module (EMA) in the C2f-EMA module enhances feature extraction by redistributing weights and prioritizing relevant features. We introduce Powerful-IoU (PIoU) as a replacement for CIoU, focusing on moderate-quality anchor boxes and adding a penalty based on differences between predicted and ground truth bounding box corners. This approach simplifies calculations, speeds up convergence, and enhances detection accuracy. SOD-YOLOv8 significantly improves small object detection, surpassing widely used models in various metrics, without substantially increasing computational cost or latency compared to YOLOv8s. Specifically, it increases recall from 40.1% to 43.9%, precision from 51.2% to 53.9%, $text{mAP}_{0.5}$ from 40.6% to 45.1%, and $text{mAP}_{0.5:0.95}$ from 24% to 26.6%. In dynamic real-world traffic scenes, SOD-YOLOv8 demonstrated notable improvements in diverse conditions, proving its reliability and effectiveness in detecting small objects even in challenging environments.

8/12/2024

Better Sampling, towards Better End-to-end Small Object Detection

Zile Huang, Chong Zhang, Mingyu Jin, Fangyu Wu, Chengzhi Liu, Xiaobo Jin

While deep learning-based general object detection has made significant strides in recent years, the effectiveness and efficiency of small object detection remain unsatisfactory. This is primarily attributed not only to the limited characteristics of such small targets but also to the high density and mutual overlap among these targets. The existing transformer-based small object detectors do not leverage the gap between accuracy and inference speed. To address challenges, we propose methods enhancing sampling within an end-to-end framework. Sample Points Refinement (SPR) constrains localization and attention, preserving meaningful interactions in the region of interest and filtering out misleading information. Scale-aligned Target (ST) integrates scale information into target confidence, improving classification for small object detection. A task-decoupled Sample Reweighting (SR) mechanism guides attention toward challenging positive examples, utilizing a weight generator module to assess the difficulty and adjust classification loss based on decoder layer outcomes. Comprehensive experiments across various benchmarks reveal that our proposed detector excels in detecting small objects. Our model demonstrates a significant enhancement, achieving a 2.9% increase in average precision (AP) over the state-of-the-art (SOTA) on the VisDrone dataset and a 1.7% improvement on the SODA-D dataset.

7/9/2024

PGNeXt: High-Resolution Salient Object Detection via Pyramid Grafting Network

Changqun Xia, Chenxi Xie, Zhentao He, Tianshu Yu, Jia Li

We present an advanced study on more challenging high-resolution salient object detection (HRSOD) from both dataset and network framework perspectives. To compensate for the lack of HRSOD dataset, we thoughtfully collect a large-scale high resolution salient object detection dataset, called UHRSD, containing 5,920 images from real-world complex scenarios at 4K-8K resolutions. All the images are finely annotated in pixel-level, far exceeding previous low-resolution SOD datasets. Aiming at overcoming the contradiction between the sampling depth and the receptive field size in the past methods, we propose a novel one-stage framework for HR-SOD task using pyramid grafting mechanism. In general, transformer-based and CNN-based backbones are adopted to extract features from different resolution images independently and then these features are grafted from transformer branch to CNN branch. An attention-based Cross-Model Grafting Module (CMGM) is proposed to enable CNN branch to combine broken detailed information more holistically, guided by different source feature during decoding process. Moreover, we design an Attention Guided Loss (AGL) to explicitly supervise the attention matrix generated by CMGM to help the network better interact with the attention from different branches. Comprehensive experiments on UHRSD and widely-used SOD datasets demonstrate that our method can simultaneously locate salient object and preserve rich details, outperforming state-of-the-art methods. To verify the generalization ability of the proposed framework, we apply it to the camouflaged object detection (COD) task. Notably, our method performs superior to most state-of-the-art COD methods without bells and whistles.

8/6/2024