A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future

Read original: arXiv:2307.09220 - Published 4/16/2024 by Chaoyang Zhu, Long Chen

🔎

Overview

This paper provides a comprehensive review of recent developments in the field of Open-Vocabulary Detection (OVD) and Open-Vocabulary Segmentation (OVS).
OVD and OVS aim to build models that can classify objects beyond pre-defined categories, overcoming the limitations of traditional fully-supervised approaches that are constrained by the categories in existing datasets.
The paper develops a taxonomy to organize different tasks and methodologies in OVD and OVS, highlighting the role of weak supervision signals and how they can be utilized.
The review covers a wide range of applications, including object detection, semantic/instance/panoptic segmentation, 3D, and video understanding.

Plain English Explanation

Object detection and segmentation are fundamental tasks in computer vision that have seen significant progress with the rise of deep learning. However, the existing datasets used to train these models often have a limited number of pre-defined object categories, which restricts the models' ability to recognize objects beyond those categories.

To address this limitation, researchers have focused on developing Open-Vocabulary Detection (OVD) and Open-Vocabulary Segmentation (OVS) models. These models can classify objects that are not part of the pre-defined categories, expanding the range of objects they can recognize.

The paper in question provides a comprehensive review of the recent advancements in OVD and OVS. It develops a taxonomy to organize the different tasks and methodologies in this field, highlighting how the use of weak supervision signals, such as textual descriptions or human-provided bounding boxes, can help train these models without the need for expensive manual labeling.

The review covers a wide range of applications, from 3D open-vocabulary panoptic segmentation to open-world video instance segmentation. It also provides a detailed analysis of the key design principles, challenges, and development trends in this area, as well as the strengths and weaknesses of different methodologies.

Overall, this work aims to help researchers and practitioners better understand the current state of OVD and OVS, and to inspire future research directions in this exciting field of computer vision.

Technical Explanation

The paper begins by highlighting the importance of object detection and segmentation as fundamental tasks in scene understanding, and the limitations of traditional fully-supervised approaches that are constrained by the pre-defined categories in existing datasets.

To address this, the authors introduce the concepts of Open-Vocabulary Detection (OVD) and Open-Vocabulary Segmentation (OVS), where the goal is to build models that can classify objects beyond the pre-defined categories.

A key contribution of the paper is the development of a taxonomy to organize the different tasks and methodologies in OVD and OVS. The authors find that the use and permission of weak supervision signals, such as textual descriptions or human-provided bounding boxes, is a key factor that discriminates between different methodologies.

The reviewed methodologies include:

Visual-semantic space mapping: Linking visual features to textual descriptions to enable classification of novel objects.
Novel visual feature synthesis: Generating visual features for novel objects based on textual descriptions.
Region-aware training: Leveraging human-provided bounding boxes to learn object representations.
Pseudo-labeling, knowledge distillation, and transfer learning: Using additional techniques to improve performance on novel objects.

The paper provides a detailed analysis of the main design principles, key challenges, development trends, and the strengths and weaknesses of each methodology. It also includes benchmarking results for the different tasks and components of the reviewed methods.

Critical Analysis

The paper provides a comprehensive and insightful review of the current state of OVD and OVS research. The proposed taxonomy is a valuable contribution, as it helps organize the diverse set of methodologies in this field and highlights the importance of weak supervision signals.

One potential limitation is that the review focuses primarily on recent developments, and does not explore the historical context and evolution of research in this area. Additionally, while the paper covers a wide range of applications, some specific tasks, such as fine-grained open-vocabulary segmentation, could have been discussed in greater depth.

Furthermore, the paper does not delve into the potential societal implications or ethical considerations of open-vocabulary object recognition, such as the risks of misclassification or bias in real-world applications.

Nonetheless, the paper is a valuable resource for researchers and practitioners working in the field of computer vision, as it provides a comprehensive overview of the current state of the art and identifies promising directions for future research.

Conclusion

This paper presents a comprehensive review of recent developments in Open-Vocabulary Detection (OVD) and Open-Vocabulary Segmentation (OVS), two important areas of computer vision research. By developing a taxonomy to organize different tasks and methodologies, the authors highlight the key role of weak supervision signals in training models that can recognize objects beyond pre-defined categories.

The review covers a wide range of applications, from 3D open-vocabulary panoptic segmentation to open-world video instance segmentation, and provides a detailed analysis of the design principles, challenges, and trends in this field. The benchmarking results and identified areas for future research make this paper a valuable resource for the computer vision community.

Overall, the work represents a significant contribution to the understanding and advancement of open-vocabulary object recognition, with the potential to pave the way for more flexible and versatile computer vision systems in the future.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔎

A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future

Chaoyang Zhu, Long Chen

As the most fundamental scene understanding tasks, object detection and segmentation have made tremendous progress in deep learning era. Due to the expensive manual labeling cost, the annotated categories in existing datasets are often small-scale and pre-defined, i.e., state-of-the-art fully-supervised detectors and segmentors fail to generalize beyond the closed vocabulary. To resolve this limitation, in the last few years, the community has witnessed an increasing attention toward Open-Vocabulary Detection (OVD) and Segmentation (OVS). By ``open-vocabulary'', we mean that the models can classify objects beyond pre-defined categories. In this survey, we provide a comprehensive review on recent developments of OVD and OVS. A taxonomy is first developed to organize different tasks and methodologies. We find that the permission and usage of weak supervision signals can well discriminate different methodologies, including: visual-semantic space mapping, novel visual feature synthesis, region-aware training, pseudo-labeling, knowledge distillation, and transfer learning. The proposed taxonomy is universal across different tasks, covering object detection, semantic/instance/panoptic segmentation, 3D and video understanding. The main design principles, key challenges, development routes, methodology strengths, and weaknesses are thoroughly analyzed. In addition, we benchmark each task along with the vital components of each method in appendix and updated online at https://github.com/seanzhuh/awesome-open-vocabulary-detection-and-segmentation. Finally, several promising directions are provided and discussed to stimulate future research.

4/16/2024

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

Pengkun Jiao, Na Zhao, Jingjing Chen, Yu-Gang Jiang

Open-vocabulary 3D object detection (OV-3DDet) aims to localize and recognize both seen and previously unseen object categories within any new 3D scene. While language and vision foundation models have achieved success in handling various open-vocabulary tasks with abundant training data, OV-3DDet faces a significant challenge due to the limited availability of training data. Although some pioneering efforts have integrated vision-language models (VLM) knowledge into OV-3DDet learning, the full potential of these foundational models has yet to be fully exploited. In this paper, we unlock the textual and visual wisdom to tackle the open-vocabulary 3D detection task by leveraging the language and vision foundation models. We leverage a vision foundation model to provide image-wise guidance for discovering novel classes in 3D scenes. Specifically, we utilize a object detection vision foundation model to enable the zero-shot discovery of objects in images, which serves as the initial seeds and filtering guidance to identify novel 3D objects. Additionally, to align the 3D space with the powerful vision-language space, we introduce a hierarchical alignment approach, where the 3D feature space is aligned with the vision-language feature space using a pre-trained VLM at the instance, category, and scene levels. Through extensive experimentation, we demonstrate significant improvements in accuracy and generalization, highlighting the potential of foundation models in advancing open-vocabulary 3D object detection in real-world scenarios.

7/18/2024

🔮

Open-Vocabulary Camouflaged Object Segmentation

Youwei Pang, Xiaoqi Zhao, Jiaming Zuo, Lihe Zhang, Huchuan Lu

Recently, the emergence of the large-scale vision-language model (VLM), such as CLIP, has opened the way towards open-world object perception. Many works have explored the utilization of pre-trained VLM for the challenging open-vocabulary dense prediction task that requires perceiving diverse objects with novel classes at inference time. Existing methods construct experiments based on the public datasets of related tasks, which are not tailored for open vocabulary and rarely involve imperceptible objects camouflaged in complex scenes due to data collection bias and annotation costs. To fill in the gaps, we introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS), and construct a large-scale complex scene dataset (textbf{OVCamo}) containing 11,483 hand-selected images with fine annotations and corresponding object classes. Further, we build a strong single-stage open-vocabulary underline{c}amouflaged underline{o}bject underline{s}egmentation transformunderline{er} baseline textbf{OVCoser} attached to the parameter-fixed CLIP with iterative semantic guidance and structure enhancement. By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects. Moreover, this effective framework also surpasses previous state-of-the-arts of open-vocabulary semantic image segmentation by a large margin on our OVCamo dataset. With the proposed dataset and baseline, we hope that this new task with more practical value can further expand the research on open-vocabulary dense prediction tasks. Our code and data can be found in the href{https://github.com/lartpang/OVCamo}{link}.

7/8/2024

Open-Vocabulary Audio-Visual Semantic Segmentation

Ruohao Guo, Liao Qu, Dantong Niu, Yanyu Qi, Wenzhen Yue, Ji Shi, Bowei Xing, Xianghua Ying

Audio-visual semantic segmentation (AVSS) aims to segment and classify sounding objects in videos with acoustic cues. However, most approaches operate on the close-set assumption and only identify pre-defined categories from training data, lacking the generalization ability to detect novel categories in practical applications. In this paper, we introduce a new task: open-vocabulary audio-visual semantic segmentation, extending AVSS task to open-world scenarios beyond the annotated label space. This is a more challenging task that requires recognizing all categories, even those that have never been seen nor heard during training. Moreover, we propose the first open-vocabulary AVSS framework, OV-AVSS, which mainly consists of two parts: 1) a universal sound source localization module to perform audio-visual fusion and locate all potential sounding objects and 2) an open-vocabulary classification module to predict categories with the help of the prior knowledge from large-scale pre-trained vision-language models. To properly evaluate the open-vocabulary AVSS, we split zero-shot training and testing subsets based on the AVSBench-semantic benchmark, namely AVSBench-OV. Extensive experiments demonstrate the strong segmentation and zero-shot generalization ability of our model on all categories. On the AVSBench-OV dataset, OV-AVSS achieves 55.43% mIoU on base categories and 29.14% mIoU on novel categories, exceeding the state-of-the-art zero-shot method by 41.88%/20.61% and open-vocabulary method by 10.2%/11.6%. The code is available at https://github.com/ruohaoguo/ovavss.

8/1/2024