Aligned Unsupervised Pretraining of Object Detectors with Self-training

Read original: arXiv:2307.15697 - Published 7/9/2024 by Ioannis Maniadis Metaxas, Adrian Bulat, Ioannis Patras, Brais Martinez, Georgios Tzimiropoulos

🤷

Overview

This paper proposes a new framework for unsupervised pretraining of object detectors that addresses the limitations of existing methods.
The key components of the framework include: (i) using richer initial proposals that encode high-level semantics, (ii) class pseudo-labeling through clustering, and (iii) self-training to iteratively improve the object proposals.
The authors show that this approach can achieve state-of-the-art performance on object detection tasks, including in low-data regimes, across different detector architectures and datasets.
The framework also enables unsupervised representation learning using object detection as a pretext task, paving the way for more effective and scalable self-supervised learning.

Plain English Explanation

Object detectors are AI models that can identify and locate objects in images. Pretraining these models in an unsupervised way (without using labeled data) has become an important step, as it can improve their performance and speed up the final training process.

Existing unsupervised pretraining methods typically rely on low-level visual information to define the object proposals (potential object locations) used to train the detector. They then add high-level semantic information through an auxiliary loss function. This results in a complex training pipeline and a disconnect between the pretraining and the final object detection task.

The framework proposed in this paper addresses these limitations. It starts with richer initial object proposals that already encode high-level semantics. It then uses a technique called "class pseudo-labeling" to assign provisional class labels to these proposals based on clustering. This allows the pretraining to use a standard object detection training pipeline, aligning it more closely with the final task.

Additionally, the framework employs self-training, where the model iteratively improves and refines the object proposals. This helps to further enrich the proposals with semantic information.

By aligning the pretraining and final tasks, the authors show that a simple detection pipeline without additional complexity can be used for both pretraining and the final model, and still achieve state-of-the-art performance. This is true even in situations where there is limited labeled data available for the final task.

Importantly, the framework also works for pretraining the entire model, including the backbone neural network, from scratch. This paves the way for more effective and scalable unsupervised representation learning using object detection as the primary pretext task.

Technical Explanation

The proposed framework consists of three key components:

Richer Initial Proposals: Rather than relying on low-level visual cues to define object proposals, the authors use a pre-trained object proposal generation model to obtain proposals that already encode high-level semantic information.
Class Pseudo-Labeling: Without access to ground truth class labels for the object proposals, the authors use a clustering-based approach to assign provisional "pseudo-labels" to the proposals. This allows the pretraining to use a standard object detection training pipeline, aligning the pretraining and downstream tasks.
Self-Training: The framework then employs an iterative self-training process, where the model progressively refines and improves the object proposals, further enriching them with semantic information.

The authors evaluate this framework on a variety of object detection benchmarks, including COCO, and show that it outperforms existing unsupervised pretraining methods by a significant margin, even in low-data regimes. Importantly, they demonstrate that their approach can also be used to pretrain the entire model, including the backbone network, from scratch, paving the way for more effective unsupervised representation learning using object detection as a pretext task.

Critical Analysis

The authors' framework addresses important limitations of existing unsupervised pretraining methods for object detectors. By aligning the pretraining and downstream tasks more closely, they are able to achieve state-of-the-art performance without the need for complex training pipelines or auxiliary losses.

However, the paper does not provide a deeper exploration of the underlying reasons for the performance gains. It would be interesting to understand the specific ways in which the richer initial proposals, class pseudo-labeling, and self-training contribute to the improved results. Additionally, the authors could have compared their approach to other self-supervised pretraining techniques to better contextualize the contributions of their framework.

Another potential area for further research is the extensibility of the framework to other pretext tasks beyond object detection, such as self-supervised pretraining for text recognizers or unsupervised domain adaptation. Exploring how the key ideas can be adapted to different domains and tasks could lead to more broadly applicable unsupervised pretraining strategies.

Conclusion

This paper presents a novel framework for unsupervised pretraining of object detectors that addresses the limitations of existing methods. By aligning the pretraining and downstream tasks more closely, the authors achieve state-of-the-art performance on object detection benchmarks, including in low-data regimes. Importantly, their approach also enables unsupervised representation learning using object detection as a pretext task, paving the way for more effective and scalable self-supervised learning. While the paper could have explored the underlying reasons for the performance gains in more depth, it represents an important step forward in advancing the field of unsupervised pretraining for object detection and beyond.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤷

Aligned Unsupervised Pretraining of Object Detectors with Self-training

Ioannis Maniadis Metaxas, Adrian Bulat, Ioannis Patras, Brais Martinez, Georgios Tzimiropoulos

The unsupervised pretraining of object detectors has recently become a key component of object detector training, as it leads to improved performance and faster convergence during the supervised fine-tuning stage. Existing unsupervised pretraining methods, however, typically rely on low-level information to define proposals that are used to train the detector. Furthermore, in the absence of class labels for these proposals, an auxiliary loss is used to add high-level semantics. This results in complex pipelines and a task gap between the pretraining and the downstream task. We propose a framework that mitigates this issue and consists of three simple yet key ingredients: (i) richer initial proposals that do encode high-level semantics, (ii) class pseudo-labeling through clustering, that enables pretraining using a standard object detection training pipeline, (iii) self-training to iteratively improve and enrich the object proposals. Once the pretraining and downstream tasks are aligned, a simple detection pipeline without further bells and whistles can be directly used for pretraining and, in fact, results in state-of-the-art performance on both the full and low data regimes, across detector architectures and datasets, by significant margins. We further show that our pretraining strategy is also capable of pretraining from scratch (including the backbone) and works on complex images like COCO, paving the path for unsupervised representation learning using object detection directly as a pretext task.

7/9/2024

Shelf-Supervised Multi-Modal Pre-Training for 3D Object Detection

Mehar Khurana, Neehar Peri, Deva Ramanan, James Hays

State-of-the-art 3D object detectors are often trained on massive labeled datasets. However, annotating 3D bounding boxes remains prohibitively expensive and time-consuming, particularly for LiDAR. Instead, recent works demonstrate that self-supervised pre-training with unlabeled data can improve detection accuracy with limited labels. Contemporary methods adapt best-practices for self-supervised learning from the image domain to point clouds (such as contrastive learning). However, publicly available 3D datasets are considerably smaller and less diverse than those used for image-based self-supervised learning, limiting their effectiveness. We do note, however, that such data is naturally collected in a multimodal fashion, often paired with images. Rather than pre-training with only self-supervised objectives, we argue that it is better to bootstrap point cloud representations using image-based foundation models trained on internet-scale image data. Specifically, we propose a shelf-supervised approach (e.g. supervised with off-the-shelf image foundation models) for generating zero-shot 3D bounding boxes from paired RGB and LiDAR data. Pre-training 3D detectors with such pseudo-labels yields significantly better semi-supervised detection accuracy than prior self-supervised pretext tasks. Importantly, we show that image-based shelf-supervision is helpful for training LiDAR-only and multi-modal (RGB + LiDAR) detectors. We demonstrate the effectiveness of our approach on nuScenes and WOD, significantly improving over prior work in limited data settings.

6/17/2024

UNION: Unsupervised 3D Object Detection using Object Appearance-based Pseudo-Classes

Ted Lentsch, Holger Caesar, Dariu M. Gavrila

Unsupervised 3D object detection methods have emerged to leverage vast amounts of data efficiently without requiring manual labels for training. Recent approaches rely on dynamic objects for learning to detect objects but penalize the detections of static instances during training. Multiple rounds of (self) training are used in which detected static instances are added to the set of training targets; this procedure to improve performance is computationally expensive. To address this, we propose the method UNION. We use spatial clustering and self-supervised scene flow to obtain a set of static and dynamic object proposals from LiDAR. Subsequently, object proposals' visual appearances are encoded to distinguish static objects in the foreground and background by selecting static instances that are visually similar to dynamic objects. As a result, static and dynamic foreground objects are obtained together, and existing detectors can be trained with a single training. In addition, we extend 3D object discovery to detection by using object appearance-based cluster labels as pseudo-class labels for training object classification. We conduct extensive experiments on the nuScenes dataset and increase the state-of-the-art performance for unsupervised object discovery, i.e. UNION more than doubles the average precision to 33.9. The code will be made publicly available.

5/27/2024

Towards Open-World Object-based Anomaly Detection via Self-Supervised Outlier Synthesis

Brian K. S. Isaac-Medina, Yona Falinie A. Gaus, Neelanjan Bhowmik, Toby P. Breckon

Object detection is a pivotal task in computer vision that has received significant attention in previous years. Nonetheless, the capability of a detector to localise objects out of the training distribution remains unexplored. Whilst recent approaches in object-level out-of-distribution (OoD) detection heavily rely on class labels, such approaches contradict truly open-world scenarios where the class distribution is often unknown. In this context, anomaly detection focuses on detecting unseen instances rather than classifying detections as OoD. This work aims to bridge this gap by leveraging an open-world object detector and an OoD detector via virtual outlier synthesis. This is achieved by using the detector backbone features to first learn object pseudo-classes via self-supervision. These pseudo-classes serve as the basis for class-conditional virtual outlier sampling of anomalous features that are classified by an OoD head. Our approach empowers our overall object detector architecture to learn anomaly-aware feature representations without relying on class labels, hence enabling truly open-world object anomaly detection. Empirical validation of our approach demonstrates its effectiveness across diverse datasets encompassing various imaging modalities (visible, infrared, and X-ray). Moreover, our method establishes state-of-the-art performance on object-level anomaly detection, achieving an average recall score improvement of over 5.4% for natural images and 23.5% for a security X-ray dataset compared to the current approaches. In addition, our method detects anomalies in datasets where current approaches fail. Code available at https://github.com/KostadinovShalon/oln-ssos.

7/23/2024