ProSub: Probabilistic Open-Set Semi-Supervised Learning with Subspace-Based Out-of-Distribution Detection

Read original: arXiv:2407.11735 - Published 7/17/2024 by Erik Wallin, Lennart Svensson, Fredrik Kahl, Lars Hammarstrand

🔎

Overview

The paper introduces a new approach called ProSub for open-set semi-supervised learning (OSSL), where unlabeled datasets may contain unknown classes.
Existing OSSL methods often use softmax confidence to classify data as in-distribution (ID) or out-of-distribution (OOD), and rely on ad-hoc thresholds without considering the problem's statistics.
The proposed ProSub framework uses a new score based on angles in feature space between data and an ID subspace, and estimates the conditional distributions of scores given ID or OOD data to enable probabilistic predictions of ID or OOD.
ProSub is shown to outperform state-of-the-art methods on several benchmark problems.

Plain English Explanation

In machine learning, there are situations where we have a dataset with both labeled and unlabeled data, and the unlabeled data may contain examples of classes that were not included in the labeled data. This is known as open-set semi-supervised learning (OSSL). Existing methods for OSSL often use a simple approach, like looking at how confident the model is in its predictions, to decide whether a piece of unlabeled data belongs to a known class (in-distribution) or an unknown class (out-of-distribution).

The researchers in this paper propose a new approach called ProSub that takes a more principled approach to this problem. Instead of just looking at confidence scores, ProSub uses the angles between the unlabeled data and the known classes in the feature space (the high-dimensional representation the model uses to make its predictions). By modeling the statistical properties of these angles, ProSub can make more informed decisions about whether an unlabeled example belongs to a known or unknown class.

The researchers show that ProSub outperforms other state-of-the-art OSSL methods on several benchmark problems, demonstrating the value of their more sophisticated approach to this challenging machine learning problem.

Technical Explanation

The paper introduces a new framework called ProSub for open-set semi-supervised learning (OSSL), where unlabeled datasets may contain unknown classes. Existing OSSL methods often use the softmax confidence for classifying data as in-distribution (ID) or out-of-distribution (OOD), and rely on ad-hoc thresholds without considering the statistics of the problem.

The key components of ProSub are:

A new score for ID/OOD classification based on angles in feature space between data and an ID subspace.
An approach to estimate the conditional distributions of scores given ID or OOD data, enabling probabilistic predictions of data being ID or OOD.

These components are combined in the ProSub framework, which is shown to outperform state-of-the-art methods on several benchmark problems. The researchers demonstrate the effectiveness of their more principled approach compared to existing heuristic-based OSSL methods.

Critical Analysis

The paper presents a thoughtful and well-designed approach to the challenging problem of open-set semi-supervised learning. The authors' use of feature space angles and probabilistic modeling is a significant advancement over previous heuristic-based methods.

One potential limitation of the work is that it relies on the assumption that the ID and OOD data can be well-separated in the feature space. In more complex real-world scenarios, this assumption may not always hold true, and the performance of ProSub could suffer. The authors acknowledge this issue and suggest further research into more robust feature representations.

Additionally, the paper does not explore the sensitivity of ProSub to the quality and quantity of labeled data. In many practical OSSL scenarios, the labeled data may be limited or of poor quality, and it would be important to understand how well the method performs in these more realistic conditions.

Despite these caveats, the ProSub framework represents an important step forward in open-world semi-supervised learning and long-tailed semi-supervised learning. The authors' rigorous approach and strong experimental results make a compelling case for the value of their contribution to the field.

Conclusion

The paper introduces ProSub, a new framework for open-set semi-supervised learning that uses a more principled approach to classifying unlabeled data as in-distribution or out-of-distribution. By modeling the statistical properties of feature space angles, ProSub is able to outperform existing heuristic-based methods on several benchmark problems.

This work represents an important advancement in open-world machine learning, where the ability to detect and handle unknown classes is crucial for deploying robust and reliable systems. The authors' thoughtful design and rigorous evaluation of ProSub suggest that their approach could have a significant impact on the future development of semi-supervised learning techniques.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔎

ProSub: Probabilistic Open-Set Semi-Supervised Learning with Subspace-Based Out-of-Distribution Detection

Erik Wallin, Lennart Svensson, Fredrik Kahl, Lars Hammarstrand

In open-set semi-supervised learning (OSSL), we consider unlabeled datasets that may contain unknown classes. Existing OSSL methods often use the softmax confidence for classifying data as in-distribution (ID) or out-of-distribution (OOD). Additionally, many works for OSSL rely on ad-hoc thresholds for ID/OOD classification, without considering the statistics of the problem. We propose a new score for ID/OOD classification based on angles in feature space between data and an ID subspace. Moreover, we propose an approach to estimate the conditional distributions of scores given ID or OOD data, enabling probabilistic predictions of data being ID or OOD. These components are put together in a framework for OSSL, termed emph{ProSub}, that is experimentally shown to reach SOTA performance on several benchmark problems. Our code is available at https://github.com/walline/prosub.

7/17/2024

📊

Robust Semi-supervised Learning by Wisely Leveraging Open-set Data

Yang Yang, Nan Jiang, Yi Xu, De-Chuan Zhan

Open-set Semi-supervised Learning (OSSL) holds a realistic setting that unlabeled data may come from classes unseen in the labeled set, i.e., out-of-distribution (OOD) data, which could cause performance degradation in conventional SSL models. To handle this issue, except for the traditional in-distribution (ID) classifier, some existing OSSL approaches employ an extra OOD detection module to avoid the potential negative impact of the OOD data. Nevertheless, these approaches typically employ the entire set of open-set data during their training process, which may contain data unfriendly to the OSSL task that can negatively influence the model performance. This inspires us to develop a robust open-set data selection strategy for OSSL. Through a theoretical understanding from the perspective of learning theory, we propose Wise Open-set Semi-supervised Learning (WiseOpen), a generic OSSL framework that selectively leverages the open-set data for training the model. By applying a gradient-variance-based selection mechanism, WiseOpen exploits a friendly subset instead of the whole open-set dataset to enhance the model's capability of ID classification. Moreover, to reduce the computational expense, we also propose two practical variants of WiseOpen by adopting low-frequency update and loss-based selection respectively. Extensive experiments demonstrate the effectiveness of WiseOpen in comparison with the state-of-the-art.

5/21/2024

Class-balanced Open-set Semi-supervised Object Detection for Medical Images

Zhanyun Lu, Renshu Gu, Huimin Cheng, Siyu Pang, Mingyu Xu, Peifang Xu, Yaqi Wang, Yuichiro Kinoshita, Juan Ye, Gangyong Jia, Qing Wu

Medical image datasets in the real world are often unlabeled and imbalanced, and Semi-Supervised Object Detection (SSOD) can utilize unlabeled data to improve an object detector. However, existing approaches predominantly assumed that the unlabeled data and test data do not contain out-of-distribution (OOD) classes. The few open-set semi-supervised object detection methods have two weaknesses: first, the class imbalance is not considered; second, the OOD instances are distinguished and simply discarded during pseudo-labeling. In this paper, we consider the open-set semi-supervised object detection problem which leverages unlabeled data that contain OOD classes to improve object detection for medical images. Our study incorporates two key innovations: Category Control Embed (CCE) and out-of-distribution Detection Fusion Classifier (OODFC). CCE is designed to tackle dataset imbalance by constructing a Foreground information Library, while OODFC tackles open-set challenges by integrating the ``unknown'' information into basic pseudo-labels. Our method outperforms the state-of-the-art SSOD performance, achieving a 4.25 mAP improvement on the public Parasite dataset.

8/23/2024

🔎

Out-of-distribution detection based on subspace projection of high-dimensional features output by the last convolutional layer

Qiuyu Zhu, Yiwei He

Out-of-distribution (OOD) detection, crucial for reliable pattern classification, discerns whether a sample originates outside the training distribution. This paper concentrates on the high-dimensional features output by the final convolutional layer, which contain rich image features. Our key idea is to project these high-dimensional features into two specific feature subspaces, leveraging the dimensionality reduction capacity of the network's linear layers, trained with Predefined Evenly-Distribution Class Centroids (PEDCC)-Loss. This involves calculating the cosines of three projection angles and the norm values of features, thereby identifying distinctive information for in-distribution (ID) and OOD data, which assists in OOD detection. Building upon this, we have modified the batch normalization (BN) and ReLU layer preceding the fully connected layer, diminishing their impact on the output feature distributions and thereby widening the distribution gap between ID and OOD data features. Our method requires only the training of the classification network model, eschewing any need for input pre-processing or specific OOD data pre-tuning. Extensive experiments on several benchmark datasets demonstrates that our approach delivers state-of-the-art performance. Our code is available at https://github.com/Hewell0/ProjOOD.

5/6/2024