Taming Self-Training for Open-Vocabulary Object Detection

Read original: arXiv:2308.06412 - Published 4/16/2024 by Shiyu Zhao, Samuel Schulter, Long Zhao, Zhixing Zhang, Vijay Kumar B. G, Yumin Suh, Manmohan Chandraker, Dimitris N. Metaxas

🔎

Overview

Recent studies show promising performance in open-vocabulary object detection (OVD) using pseudo labels (PLs) from pre-trained vision and language models (VLMs)
However, teacher-student self-training, a powerful technique to leverage PLs, is rarely explored for OVD
This work identifies two challenges in using self-training for OVD: noisy PLs from VLMs and frequent distribution changes of PLs
To address these challenges, the researchers propose SAS-Det, a method that tames self-training for OVD

Plain English Explanation

Object detection is the task of identifying and localizing objects in an image. Open-vocabulary object detection aims to detect a wide range of objects, even ones not seen during training. Recent advances have shown that using pseudo labels (PL) - predictions from pre-trained vision and language models - can boost the performance of open-vocabulary object detectors.

One powerful technique to leverage these PLs is called self-training. In self-training, the model uses its own predictions to "teach" itself and improve over time. However, self-training has rarely been used for open-vocabulary object detection. The researchers identified two key challenges:

The PLs from vision and language models can be noisy and inaccurate, which can degrade the model's performance if used directly.
In open-vocabulary detection, the distribution of PLs is constantly changing as the model sees new objects, making the self-training process unstable.

To address these issues, the researchers developed a new method called SAS-Det. The key ideas are:

A "split-and-fusion" detection head that separates the model into an "open" branch for novel objects and a "closed" branch for familiar objects. This helps reduce the impact of noisy PLs.
A strategy to periodically update the teacher model, which reduces the frequency of changes in the PL distribution and stabilizes the training process.

Through extensive experiments, the researchers showed that SAS-Det outperforms other open-vocabulary detection models and achieves strong performance on challenging benchmarks like COCO and LVIS.

Technical Explanation

The researchers propose SAS-Det, a method that tames self-training for open-vocabulary object detection (OVD). SAS-Det addresses two key challenges in using self-training for OVD:

Noisy Pseudo Labels (PLs): PLs from pre-trained vision and language models (VLMs) can be noisy and inaccurate, which can degrade the model's performance if used directly.
Frequent Distribution Changes of PLs: Unlike in closed-set tasks, the PL distributions in OVD are solely determined by the teacher model. As the teacher sees new objects, the PL distribution changes frequently, making the self-training process unstable.

To address these challenges, SAS-Det introduces two key components:

Split-and-Fusion (SAF) Head: The detection head is split into an "open" branch and a "closed" branch. The open branch focuses on learning from noisy PLs for novel objects, while the closed branch learns from clean ground-truth labels for familiar objects. The outputs of the two branches are then fused to leverage complementary knowledge.
Periodic Update Strategy: Instead of continuously updating the teacher model, SAS-Det updates the teacher model periodically. This reduces the frequency of changes in the PL distribution, stabilizing the self-training process.

The researchers evaluate SAS-Det on the COCO and LVIS benchmarks, and show that it outperforms recent open-vocabulary object detection and self-training methods. SAS-Det achieves 37.4 AP50 and 29.1 APr on the novel categories of COCO and LVIS, respectively.

Critical Analysis

The researchers have identified and addressed two important challenges in using self-training for open-vocabulary object detection, which is a promising but underexplored area. The proposed SAS-Det method shows strong performance on challenging benchmarks, demonstrating the effectiveness of the split-and-fusion head and periodic update strategy.

However, the paper does not provide a deep analysis of the trade-offs and limitations of the approach. For example, it's unclear how the split-and-fusion head impacts the model complexity and inference time. Additionally, the periodic update strategy may not be optimal for all scenarios, and the researchers could explore more adaptive approaches to updating the teacher model.

Further research could also investigate the generalization of SAS-Det to other open-vocabulary tasks, such as segmentation, and explore ways to make the method more robust to noisy and dynamic pseudo labels.

Conclusion

This work presents SAS-Det, a method that tames self-training for open-vocabulary object detection. By addressing the challenges of noisy pseudo labels and frequent distribution changes, SAS-Det achieves state-of-the-art performance on COCO and LVIS benchmarks. The split-and-fusion detection head and periodic update strategy are key innovations that could inspire future research in open-vocabulary learning and self-training. As the field of open-vocabulary perception continues to advance, methods like SAS-Det will play an important role in enabling AI systems to robustly detect and recognize a wide range of objects in the real world.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔎

Taming Self-Training for Open-Vocabulary Object Detection

Shiyu Zhao, Samuel Schulter, Long Zhao, Zhixing Zhang, Vijay Kumar B. G, Yumin Suh, Manmohan Chandraker, Dimitris N. Metaxas

Recent studies have shown promising performance in open-vocabulary object detection (OVD) by utilizing pseudo labels (PLs) from pretrained vision and language models (VLMs). However, teacher-student self-training, a powerful and widely used paradigm to leverage PLs, is rarely explored for OVD. This work identifies two challenges of using self-training in OVD: noisy PLs from VLMs and frequent distribution changes of PLs. To address these challenges, we propose SAS-Det that tames self-training for OVD from two key perspectives. First, we present a split-and-fusion (SAF) head that splits a standard detection into an open-branch and a closed-branch. This design can reduce noisy supervision from pseudo boxes. Moreover, the two branches learn complementary knowledge from different training data, significantly enhancing performance when fused together. Second, in our view, unlike in closed-set tasks, the PL distributions in OVD are solely determined by the teacher model. We introduce a periodic update strategy to decrease the number of updates to the teacher, thereby decreasing the frequency of changes in PL distributions, which stabilizes the training process. Extensive experiments demonstrate SAS-Det is both efficient and effective. SAS-Det outperforms recent models of the same scale by a clear margin and achieves 37.4 AP50 and 29.1 APr on novel categories of the COCO and LVIS benchmarks, respectively. Code is available at url{https://github.com/xiaofeng94/SAS-Det}.

4/16/2024

Scaling Open-Vocabulary Object Detection

Matthias Minderer, Alexey Gritsenko, Neil Houlsby

Open-vocabulary object detection has benefited greatly from pretrained vision-language models, but is still limited by the amount of available detection training data. While detection training data can be expanded by using Web image-text pairs as weak supervision, this has not been done at scales comparable to image-level pretraining. Here, we scale up detection data with self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. Major challenges in scaling self-training are the choice of label space, pseudo-annotation filtering, and training efficiency. We present the OWLv2 model and OWL-ST self-training recipe, which address these challenges. OWLv2 surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales (~10M examples). However, with OWL-ST, we can scale to over 1B examples, yielding further large improvement: With an L/14 architecture, OWL-ST improves AP on LVIS rare classes, for which the model has seen no human box annotations, from 31.2% to 44.6% (43% relative improvement). OWL-ST unlocks Web-scale training for open-world localization, similar to what has been seen for image classification and language modelling.

5/24/2024

MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection

Kuo Wang, Lechao Cheng, Weikai Chen, Pingping Zhang, Liang Lin, Fan Zhou, Guanbin Li

Learning from pseudo-labels that generated with VLMs~(Vision Language Models) has been shown as a promising solution to assist open vocabulary detection (OVD) in recent studies. However, due to the domain gap between VLM and vision-detection tasks, pseudo-labels produced by the VLMs are prone to be noisy, while the training design of the detector further amplifies the bias. In this work, we investigate the root cause of VLMs' biased prediction under the OVD context. Our observations lead to a simple yet effective paradigm, coded MarvelOVD, that generates significantly better training targets and optimizes the learning procedure in an online manner by marrying the capability of the detector with the vision-language model. Our key insight is that the detector itself can act as a strong auxiliary guidance to accommodate VLM's inability of understanding both the ``background'' and the context of a proposal within the image. Based on it, we greatly purify the noisy pseudo-labels via Online Mining and propose Adaptive Reweighting to effectively suppress the biased training boxes that are not well aligned with the target object. In addition, we also identify a neglected ``base-novel-conflict'' problem and introduce stratified label assignments to prevent it. Extensive experiments on COCO and LVIS datasets demonstrate that our method outperforms the other state-of-the-arts by significant margins. Codes are available at https://github.com/wkfdb/MarvelOVD

8/1/2024

🔎

Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning

Yan Li, Weiwei Guo, Xue Yang, Ning Liao, Dunyun He, Jiaqi Zhou, Wenxian Yu

An increasingly massive number of remote-sensing images spurs the development of extensible object detectors that can detect objects beyond training categories without costly collecting new labeled data. In this paper, we aim to develop open-vocabulary object detection (OVD) technique in aerial images that scales up object vocabulary size beyond training data. The performance of OVD greatly relies on the quality of class-agnostic region proposals and pseudo-labels for novel object categories. To simultaneously generate high-quality proposals and pseudo-labels, we propose CastDet, a CLIP-activated student-teacher open-vocabulary object Detection framework. Our end-to-end framework following the student-teacher self-learning mechanism employs the RemoteCLIP model as an extra omniscient teacher with rich knowledge. By doing so, our approach boosts not only novel object proposals but also classification. Furthermore, we devise a dynamic label queue strategy to maintain high-quality pseudo labels during batch training. We conduct extensive experiments on multiple existing aerial object detection datasets, which are set up for the OVD task. Experimental results demonstrate our CastDet achieving superior open-vocabulary detection performance, e.g., reaching 46.5% mAP on VisDroneZSD novel categories, which outperforms the state-of-the-art open-vocabulary detectors by 21.0% mAP. To our best knowledge, this is the first work to apply and develop the open-vocabulary object detection technique for aerial images. The code is available at https://github.com/lizzy8587/CastDet.

8/13/2024