OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Read original: arXiv:2407.07844 - Published 7/23/2024 by Hao Wang, Pengzhen Ren, Zequn Jie, Xiao Dong, Chengjian Feng, Yinlong Qian, Lin Ma, Dongmei Jiang, Yaowei Wang, Xiangyuan Lan and 1 other

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Overview

This paper introduces OV-DINO, a unified open-vocabulary object detection model that uses language-aware selective fusion to improve performance across a wide range of object categories.
OV-DINO builds on the DINO transformer-based detection architecture and incorporates novel techniques to enable open-vocabulary detection.
The key innovations include a Language-Aware Selective Fusion (LASF) module that fuses vision and language features, and a unified training approach that jointly optimizes object detection and text classification.

Plain English Explanation

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion is a new object detection model that can recognize a wide variety of objects, even ones it hasn't been explicitly trained on before. This is accomplished by combining visual information from the image with linguistic knowledge from text data in a smart way.

Traditionally, object detectors are trained on a fixed set of object categories, limiting their ability to recognize novel or uncommon objects. OV-DINO overcomes this by using a "language-aware" fusion module that learns to selectively combine visual and textual features. This allows the model to understand the relationships between words and visual concepts, enabling it to detect objects it hasn't seen before.

The key innovation is the Language-Aware Selective Fusion (LASF) module, which intelligently merges the visual and linguistic information to improve detection performance. This is combined with a unified training approach that jointly optimizes object detection and text classification, further enhancing the model's open-vocabulary capabilities.

OV-DINO's open-vocabulary abilities could be useful in a wide range of applications, such as robotics, autonomous vehicles, and image search, where the ability to recognize a diverse set of objects is crucial. By bridging the gap between visual and language understanding, this research represents an important step towards more versatile and capable object detection systems.

Technical Explanation

OV-DINO builds on the DINO transformer-based detection architecture, which has shown strong performance on standard object detection benchmarks. To enable open-vocabulary detection, the authors introduce two key innovations:

Language-Aware Selective Fusion (LASF): This module selectively fuses visual features from the image with textual features from a language model. By learning to prioritize the most relevant linguistic information for each object, the LASF module allows the model to better understand the relationship between visual and textual cues.
Unified Training: The model is trained jointly on object detection and text classification tasks, enabling it to learn a shared representation that captures both visual and linguistic information. This unified approach helps the model better generalize to unseen object categories.

The authors evaluate OV-DINO on several open-vocabulary detection benchmarks, including COCO-Panoptic and OpenImages-V6. The results demonstrate that OV-DINO outperforms previous open-vocabulary detection methods, highlighting the effectiveness of the LASF module and unified training approach.

Critical Analysis

The researchers acknowledge several limitations and areas for future work:

The current version of OV-DINO relies on a pre-trained language model, which could limit its performance and adaptability to different domains. Developing a more integrated language module could further improve the model's open-vocabulary abilities.
The evaluation is primarily focused on static image datasets, and the model's performance on more dynamic, real-world scenarios remains to be explored.
While the unified training approach is a strength, the trade-offs between detection and text classification performance could be investigated more deeply.

Additionally, one could question whether the model's open-vocabulary capabilities are truly generalizable or if there are still some biases or limitations in the linguistic knowledge it can leverage. Further research is needed to fully understand the model's robustness and generalization across diverse object categories and real-world environments.

Conclusion

OV-DINO represents an important advancement in open-vocabulary object detection, demonstrating how the strategic fusion of visual and linguistic information can enable more versatile and capable detection models. By bridging the gap between image understanding and language understanding, this research paves the way for object detection systems that can adapt to a wide range of object categories and real-world scenarios. While there are still areas for improvement, OV-DINO's innovations showcase the potential of leveraging both visual and textual cues for more robust and open-ended object recognition.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Hao Wang, Pengzhen Ren, Zequn Jie, Xiao Dong, Chengjian Feng, Yinlong Qian, Lin Ma, Dongmei Jiang, Yaowei Wang, Xiangyuan Lan, Xiaodan Liang

Open-vocabulary detection is a challenging task due to the requirement of detecting objects based on class names, including those not encountered during training. Existing methods have shown strong zero-shot detection capabilities through pre-training and pseudo-labeling on diverse large-scale datasets. However, these approaches encounter two main challenges: (i) how to effectively eliminate data noise from pseudo-labeling, and (ii) how to efficiently leverage the language-aware capability for region-level cross-modality fusion and alignment. To address these challenges, we propose a novel unified open-vocabulary detection method called OV-DINO, which is pre-trained on diverse large-scale datasets with language-aware selective fusion in a unified framework. Specifically, we introduce a Unified Data Integration (UniDI) pipeline to enable end-to-end training and eliminate noise from pseudo-label generation by unifying different data sources into detection-centric data format. In addition, we propose a Language-Aware Selective Fusion (LASF) module to enhance the cross-modality alignment through a language-aware query selection and fusion process. We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmarks, achieving state-of-the-art results with an AP of 50.6% on the COCO benchmark and 40.1% on the LVIS benchmark in a zero-shot manner, demonstrating its strong generalization ability. Furthermore, the fine-tuned OV-DINO on COCO achieves 58.4% AP, outperforming many existing methods with the same backbone. The code for OV-DINO is available at https://github.com/wanghao9610/OV-DINO.

7/23/2024

OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision

Junjie Wang, Bin Chen, Bin Kang, Yulin Li, YiChi Chen, Weizhi Xian, Huifeng Chang, Yong Xu

Open-vocabulary detection aims to detect objects from novel categories beyond the base categories on which the detector is trained. However, existing open-vocabulary detectors trained on base category data tend to assign higher confidence to trained categories and confuse novel categories with the background. To resolve this, we propose OV-DQUO, an textbf{O}pen-textbf{V}ocabulary DETR with textbf{D}enoising text textbf{Q}uery training and open-world textbf{U}nknown textbf{O}bjects supervision. Specifically, we introduce a wildcard matching method. This method enables the detector to learn from pairs of unknown objects recognized by the open-world detector and text embeddings with general semantics, mitigating the confidence bias between base and novel categories. Additionally, we propose a denoising text query training strategy. It synthesizes foreground and background query-box pairs from open-world unknown objects to train the detector through contrastive learning, enhancing its ability to distinguish novel objects from the background. We conducted extensive experiments on the challenging OV-COCO and OV-LVIS benchmarks, achieving new state-of-the-art results of 45.6 AP50 and 39.3 mAP on novel categories respectively, without the need for additional training data. Models and code are released at url{https://github.com/xiaomoguhz/OV-DQUO}

8/22/2024

On the Potential of Open-Vocabulary Models for Object Detection in Unusual Street Scenes

Sadia Ilyas, Ido Freeman, Matthias Rottmann

Out-of-distribution (OOD) object detection is a critical task focused on detecting objects that originate from a data distribution different from that of the training data. In this study, we investigate to what extent state-of-the-art open-vocabulary object detectors can detect unusual objects in street scenes, which are considered as OOD or rare scenarios with respect to common street scene datasets. Specifically, we evaluate their performance on the OoDIS Benchmark, which extends RoadAnomaly21 and RoadObstacle21 from SegmentMeIfYouCan, as well as LostAndFound, which was recently extended to object level annotations. The objective of our study is to uncover short-comings of contemporary object detectors in challenging real-world, and particularly in open-world scenarios. Our experiments reveal that open vocabulary models are promising for OOD object detection scenarios, however far from perfect. Substantial improvements are required before they can be reliably deployed in real-world applications. We benchmark four state-of-the-art open-vocabulary object detection models on three different datasets. Noteworthily, Grounding DINO achieves the best results on RoadObstacle21 and LostAndFound in our study with an AP of 48.3% and 25.4% respectively. YOLO-World excels on RoadAnomaly21 with an AP of 21.2%.

8/22/2024

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

Pengkun Jiao, Na Zhao, Jingjing Chen, Yu-Gang Jiang

Open-vocabulary 3D object detection (OV-3DDet) aims to localize and recognize both seen and previously unseen object categories within any new 3D scene. While language and vision foundation models have achieved success in handling various open-vocabulary tasks with abundant training data, OV-3DDet faces a significant challenge due to the limited availability of training data. Although some pioneering efforts have integrated vision-language models (VLM) knowledge into OV-3DDet learning, the full potential of these foundational models has yet to be fully exploited. In this paper, we unlock the textual and visual wisdom to tackle the open-vocabulary 3D detection task by leveraging the language and vision foundation models. We leverage a vision foundation model to provide image-wise guidance for discovering novel classes in 3D scenes. Specifically, we utilize a object detection vision foundation model to enable the zero-shot discovery of objects in images, which serves as the initial seeds and filtering guidance to identify novel 3D objects. Additionally, to align the 3D space with the powerful vision-language space, we introduce a hierarchical alignment approach, where the 3D feature space is aligned with the vision-language feature space using a pre-trained VLM at the instance, category, and scene levels. Through extensive experimentation, we demonstrate significant improvements in accuracy and generalization, highlighting the potential of foundation models in advancing open-vocabulary 3D object detection in real-world scenarios.

7/18/2024