OVA-DETR: Open Vocabulary Aerial Object Detection Using Image-Text Alignment and Fusion

Read original: arXiv:2408.12246 - Published 8/23/2024 by Guoting Wei, Xia Yuan, Yu Liu, Zhenhao Shang, Kelu Yao, Chao Li, Qingsen Yan, Chunxia Zhao, Haokui Zhang, Rong Xiao

OVA-DETR: Open Vocabulary Aerial Object Detection Using Image-Text Alignment and Fusion

Overview

OVA-DETR is a new open-vocabulary aerial object detection model that aligns images and text to identify objects.
It combines image and text encoders to learn object representations from both visual and textual data.
OVA-DETR can detect a wide range of objects without retraining, making it more flexible than traditional object detectors.

Plain English Explanation

OVA-DETR is a machine learning model designed to detect objects in aerial images, such as those taken from drones or satellites. What makes this model unique is its ability to recognize a wide variety of objects without needing to be retrained on new data.

Traditionally, object detectors are trained on a fixed set of object classes and cannot identify objects outside of that set. OVA-DETR overcomes this limitation by aligning the image data with corresponding text descriptions. This allows the model to learn about objects from both visual and language-based information.

The key innovation is the fusion of an image encoder and a text encoder. The image encoder extracts visual features from the input image, while the text encoder understands the meanings and relationships in the associated text descriptions. By combining these two encoders, OVA-DETR can learn rich representations of objects that go beyond just their visual appearance.

This open-vocabulary approach means OVA-DETR can detect any object that has been described in the training text, without needing to retrain the model on new object classes. This makes it a more flexible and adaptable tool for aerial object detection compared to traditional methods.

Technical Explanation

OVA-DETR builds on the DETR architecture, which uses transformers to directly predict bounding boxes and object classes from an input image. OVA-DETR extends this by adding a text encoder that learns representations from associated textual descriptions.

The model consists of three main components:

Visual Encoder: This is a convolutional neural network that extracts visual features from the input image.
Text Encoder: This is a transformer-based language model that encodes the meaning and relationships in the textual descriptions.
Fusion Module: This combines the visual and textual representations to produce the final object detections.

During training, OVA-DETR learns to align the image and text data, allowing it to build robust object representations that capture both visual and semantic information. This enables the model to detect a wide range of objects, even ones not seen during training.

The experiments in the paper demonstrate OVA-DETR's strong performance on several aerial object detection benchmarks, outperforming previous open-vocabulary and fixed-vocabulary models.

Critical Analysis

The paper provides a comprehensive evaluation of OVA-DETR, examining its performance on a variety of aerial object detection tasks. The results show that the model's open-vocabulary capability leads to significant improvements over previous approaches.

However, the paper does acknowledge some limitations. For example, the text descriptions used for training may not always be comprehensive or accurate, which could introduce biases into the learned representations. Additionally, the fusion of visual and textual features is a complex process, and further research may be needed to understand the best ways to combine these modalities.

Another potential issue is the computational cost of the model, as the use of transformer-based architectures can be resource-intensive. This could limit the deployment of OVA-DETR in real-world applications with strict latency or hardware constraints.

Overall, the paper presents a promising approach to open-vocabulary object detection that could have significant implications for a wide range of aerial imaging applications. However, further research is needed to address the model's limitations and explore its practical deployment.

Conclusion

OVA-DETR is a novel open-vocabulary object detection model that leverages the alignment of image and text data to learn robust object representations. By combining visual and textual encoders, the model can detect a wide range of objects without requiring retraining, making it a more flexible and adaptable tool for aerial imaging tasks.

The paper's experimental results demonstrate the effectiveness of this approach, with OVA-DETR outperforming previous open-vocabulary and fixed-vocabulary detectors. While the model has some limitations, the authors have made a significant contribution to the field of object detection, paving the way for more versatile and powerful visual recognition systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

OVA-DETR: Open Vocabulary Aerial Object Detection Using Image-Text Alignment and Fusion

Guoting Wei, Xia Yuan, Yu Liu, Zhenhao Shang, Kelu Yao, Chao Li, Qingsen Yan, Chunxia Zhao, Haokui Zhang, Rong Xiao

Aerial object detection has been a hot topic for many years due to its wide application requirements. However, most existing approaches can only handle predefined categories, which limits their applicability for the open scenarios in real-world. In this paper, we extend aerial object detection to open scenarios by exploiting the relationship between image and text, and propose OVA-DETR, a high-efficiency open-vocabulary detector for aerial images. Specifically, based on the idea of image-text alignment, we propose region-text contrastive loss to replace the category regression loss in the traditional detection framework, which breaks the category limitation. Then, we propose Bidirectional Vision-Language Fusion (Bi-VLF), which includes a dual-attention fusion encoder and a multi-level text-guided Fusion Decoder. The dual-attention fusion encoder enhances the feature extraction process in the encoder part. The multi-level text-guided Fusion Decoder is designed to improve the detection ability for small objects, which frequently appear in aerial object detection scenarios. Experimental results on three widely used benchmark datasets show that our proposed method significantly improves the mAP and recall, while enjoying faster inference speed. For instance, in zero shot detection experiments on DIOR, the proposed OVA-DETR outperforms DescReg and YOLO-World by 37.4% and 33.1%, respectively, while achieving 87 FPS inference speed, which is 7.9x faster than DescReg and 3x faster than YOLO-world. The code is available at https://github.com/GT-Wei/OVA-DETR.

8/23/2024

OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision

Junjie Wang, Bin Chen, Bin Kang, Yulin Li, YiChi Chen, Weizhi Xian, Huifeng Chang, Yong Xu

Open-vocabulary detection aims to detect objects from novel categories beyond the base categories on which the detector is trained. However, existing open-vocabulary detectors trained on base category data tend to assign higher confidence to trained categories and confuse novel categories with the background. To resolve this, we propose OV-DQUO, an textbf{O}pen-textbf{V}ocabulary DETR with textbf{D}enoising text textbf{Q}uery training and open-world textbf{U}nknown textbf{O}bjects supervision. Specifically, we introduce a wildcard matching method. This method enables the detector to learn from pairs of unknown objects recognized by the open-world detector and text embeddings with general semantics, mitigating the confidence bias between base and novel categories. Additionally, we propose a denoising text query training strategy. It synthesizes foreground and background query-box pairs from open-world unknown objects to train the detector through contrastive learning, enhancing its ability to distinguish novel objects from the background. We conducted extensive experiments on the challenging OV-COCO and OV-LVIS benchmarks, achieving new state-of-the-art results of 45.6 AP50 and 39.3 mAP on novel categories respectively, without the need for additional training data. Models and code are released at url{https://github.com/xiaomoguhz/OV-DQUO}

8/22/2024

OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer

Yu Wang, Xiangbo Su, Qiang Chen, Xinyu Zhang, Teng Xi, Kun Yao, Errui Ding, Gang Zhang, Jingdong Wang

Open-vocabulary object detection focusing on detecting novel categories guided by natural language. In this report, we propose Open-Vocabulary Light-Weighted Detection Transformer (OVLW-DETR), a deployment friendly open-vocabulary detector with strong performance and low latency. Building upon OVLW-DETR, we provide an end-to-end training recipe that transferring knowledge from vision-language model (VLM) to object detector with simple alignment. We align detector with the text encoder from VLM by replacing the fixed classification layer weights in detector with the class-name embeddings extracted from the text encoder. Without additional fusing module, OVLW-DETR is flexible and deployment friendly, making it easier to implement and modulate. improving the efficiency of interleaved attention computation. Experimental results demonstrate that the proposed approach is superior over existing real-time open-vocabulary detectors on standard Zero-Shot LVIS benchmark. Source code and pre-trained models are available at [https://github.com/Atten4Vis/LW-DETR].

7/16/2024

🔎

Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning

Yan Li, Weiwei Guo, Xue Yang, Ning Liao, Dunyun He, Jiaqi Zhou, Wenxian Yu

An increasingly massive number of remote-sensing images spurs the development of extensible object detectors that can detect objects beyond training categories without costly collecting new labeled data. In this paper, we aim to develop open-vocabulary object detection (OVD) technique in aerial images that scales up object vocabulary size beyond training data. The performance of OVD greatly relies on the quality of class-agnostic region proposals and pseudo-labels for novel object categories. To simultaneously generate high-quality proposals and pseudo-labels, we propose CastDet, a CLIP-activated student-teacher open-vocabulary object Detection framework. Our end-to-end framework following the student-teacher self-learning mechanism employs the RemoteCLIP model as an extra omniscient teacher with rich knowledge. By doing so, our approach boosts not only novel object proposals but also classification. Furthermore, we devise a dynamic label queue strategy to maintain high-quality pseudo labels during batch training. We conduct extensive experiments on multiple existing aerial object detection datasets, which are set up for the OVD task. Experimental results demonstrate our CastDet achieving superior open-vocabulary detection performance, e.g., reaching 46.5% mAP on VisDroneZSD novel categories, which outperforms the state-of-the-art open-vocabulary detectors by 21.0% mAP. To our best knowledge, this is the first work to apply and develop the open-vocabulary object detection technique for aerial images. The code is available at https://github.com/lizzy8587/CastDet.

8/13/2024