Grounding DINO 1.5: Advance the Edge of Open-Set Object Detection

Read original: arXiv:2405.10300 - Published 6/4/2024 by Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wenlong Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen and 6 others

Introduction

This research paper, titled "Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection," explores advancements in open-set object detection, a challenging computer vision task that aims to identify objects that were not seen during the training process. The authors focus on improving the performance of the DINO (Detecting In Natural images) object detection model, building on its previous 1.0 version.

Model Training

Grounding DINO 1.5

The key aspects of the model training process are as follows:

The authors introduce a new training strategy called "Grounding DINO 1.5" that enhances the model's performance on open-set object detection.
This approach involves leveraging additional data sources, such as Improving Detection in Aerial Images by Capturing Inter-Object Relationships and InstAGEN: Enhancing Object Detection by Training Synthetic, to improve the model's ability to generalize to unseen object categories.
The authors also incorporate insights from other recent advancements in the field, such as DetCLIPv3: Towards Versatile Generative Open-Vocabulary Object and SGV3D: Towards Scenario Generalization for Vision-Based Roadside 3D, to further enhance the model's performance.

Plain English Explanation

The researchers have developed a new version of the DINO object detection model, called DINO 1.5, that is better at recognizing objects that it wasn't trained on. This is a challenging task, as most object detection models are only good at finding things they've seen a lot of examples of during training.

The key innovation in DINO 1.5 is a new training strategy that uses additional data sources to help the model learn to recognize a wider variety of objects. The researchers leveraged insights from other recent advancements in the field, such as techniques for improving object detection in aerial images and using synthetic data to enhance object recognition.

By incorporating these ideas, the DINO 1.5 model is able to perform better on "open-set" object detection, which means it can identify objects that weren't part of its original training data. This is an important capability, as it allows the model to be more widely applicable and useful in real-world scenarios where the objects encountered may not always be the same as those used during training.

Technical Explanation

The researchers introduce a new training strategy called "Grounding DINO 1.5" that builds upon the previous DINO 1.0 model for open-set object detection. This approach involves leveraging additional data sources, such as Improving Detection in Aerial Images by Capturing Inter-Object Relationships and InstAGEN: Enhancing Object Detection by Training Synthetic, to improve the model's ability to generalize to unseen object categories.

The authors also incorporate insights from other recent advancements in the field, such as DetCLIPv3: Towards Versatile Generative Open-Vocabulary Object and SGV3D: Towards Scenario Generalization for Vision-Based Roadside 3D, to further enhance the DINO 1.5 model's performance on open-set object detection.

Critical Analysis

The paper presents a promising approach to advancing the state-of-the-art in open-set object detection. By leveraging additional data sources and incorporating insights from related research, the authors have been able to improve the DINO model's ability to recognize objects that were not part of its original training data.

However, the paper does not provide a detailed analysis of the model's limitations or potential issues. It would be helpful to understand the specific challenges encountered in training the DINO 1.5 model and any trade-offs or compromises made to achieve the reported performance improvements.

Additionally, the authors could have delved deeper into the potential real-world implications and applications of the DINO 1.5 model, as well as its performance compared to other state-of-the-art open-set object detection approaches.

Conclusion

In this paper, the researchers have developed a new version of the DINO object detection model, called DINO 1.5, that is better at recognizing objects that were not part of its original training data. By incorporating additional data sources and insights from related research, the authors have been able to advance the "edge" of open-set object detection, making the model more widely applicable and useful in real-world scenarios.

While the paper presents a promising approach, further research is needed to fully understand the model's limitations and potential areas for improvement. Nonetheless, the advancements made in DINO 1.5 represent an important step forward in the field of computer vision and object detection.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Grounding DINO 1.5: Advance the Edge of Open-Set Object Detection

Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wenlong Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen, Yuda Xiong, Hao Zhang, Feng Li, Peijun Tang, Kent Yu, Lei Zhang

This paper introduces Grounding DINO 1.5, a suite of advanced open-set object detection models developed by IDEA Research, which aims to advance the Edge of open-set object detection. The suite encompasses two models: Grounding DINO 1.5 Pro, a high-performance model designed for stronger generalization capability across a wide range of scenarios, and Grounding DINO 1.5 Edge, an efficient model optimized for faster speed demanded in many applications requiring edge deployment. The Grounding DINO 1.5 Pro model advances its predecessor by scaling up the model architecture, integrating an enhanced vision backbone, and expanding the training dataset to over 20 million images with grounding annotations, thereby achieving a richer semantic understanding. The Grounding DINO 1.5 Edge model, while designed for efficiency with reduced feature scales, maintains robust detection capabilities by being trained on the same comprehensive dataset. Empirical results demonstrate the effectiveness of Grounding DINO 1.5, with the Grounding DINO 1.5 Pro model attaining a 54.3 AP on the COCO detection benchmark and a 55.7 AP on the LVIS-minival zero-shot transfer benchmark, setting new records for open-set object detection. Furthermore, the Grounding DINO 1.5 Edge model, when optimized with TensorRT, achieves a speed of 75.2 FPS while attaining a zero-shot performance of 36.2 AP on the LVIS-minival benchmark, making it more suitable for edge computing scenarios. Model examples and demos with API will be released at https://github.com/IDEA-Research/Grounding-DINO-1.5-API

6/4/2024

🔎

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang

In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion. While previous works mainly evaluate open-set object detection on novel categories, we propose to also perform evaluations on referring expression comprehension for objects specified with attributes. Grounding DINO performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. Grounding DINO achieves a $52.5$ AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data from COCO. It sets a new record on the ODinW zero-shot benchmark with a mean $26.1$ AP. Code will be available at url{https://github.com/IDEA-Research/GroundingDINO}.

7/22/2024

OpenOOD v1.5: Enhanced Benchmark for Out-of-Distribution Detection

Jingyang Zhang, Jingkang Yang, Pengyun Wang, Haoqi Wang, Yueqian Lin, Haoran Zhang, Yiyou Sun, Xuefeng Du, Yixuan Li, Ziwei Liu, Yiran Chen, Hai Li

Out-of-Distribution (OOD) detection is critical for the reliable operation of open-world intelligent systems. Despite the emergence of an increasing number of OOD detection methods, the evaluation inconsistencies present challenges for tracking the progress in this field. OpenOOD v1 initiated the unification of the OOD detection evaluation but faced limitations in scalability and usability. In response, this paper presents OpenOOD v1.5, a significant improvement from its predecessor that ensures accurate, standardized, and user-friendly evaluation of OOD detection methodologies. Notably, OpenOOD v1.5 extends its evaluation capabilities to large-scale datasets such as ImageNet, investigates full-spectrum OOD detection which is important yet underexplored, and introduces new features including an online leaderboard and an easy-to-use evaluator. This work also contributes in-depth analysis and insights derived from comprehensive experimental results, thereby enriching the knowledge pool of OOD detection methodologies. With these enhancements, OpenOOD v1.5 aims to drive advancements and offer a more robust and comprehensive evaluation benchmark for OOD detection research.

9/25/2024

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Hao Wang, Pengzhen Ren, Zequn Jie, Xiao Dong, Chengjian Feng, Yinlong Qian, Lin Ma, Dongmei Jiang, Yaowei Wang, Xiangyuan Lan, Xiaodan Liang

Open-vocabulary detection is a challenging task due to the requirement of detecting objects based on class names, including those not encountered during training. Existing methods have shown strong zero-shot detection capabilities through pre-training and pseudo-labeling on diverse large-scale datasets. However, these approaches encounter two main challenges: (i) how to effectively eliminate data noise from pseudo-labeling, and (ii) how to efficiently leverage the language-aware capability for region-level cross-modality fusion and alignment. To address these challenges, we propose a novel unified open-vocabulary detection method called OV-DINO, which is pre-trained on diverse large-scale datasets with language-aware selective fusion in a unified framework. Specifically, we introduce a Unified Data Integration (UniDI) pipeline to enable end-to-end training and eliminate noise from pseudo-label generation by unifying different data sources into detection-centric data format. In addition, we propose a Language-Aware Selective Fusion (LASF) module to enhance the cross-modality alignment through a language-aware query selection and fusion process. We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmarks, achieving state-of-the-art results with an AP of 50.6% on the COCO benchmark and 40.1% on the LVIS benchmark in a zero-shot manner, demonstrating its strong generalization ability. Furthermore, the fine-tuned OV-DINO on COCO achieves 58.4% AP, outperforming many existing methods with the same backbone. The code for OV-DINO is available at https://github.com/wanghao9610/OV-DINO.

7/23/2024