Robust Semi-supervised Learning by Wisely Leveraging Open-set Data

Read original: arXiv:2405.06979 - Published 5/21/2024 by Yang Yang, Nan Jiang, Yi Xu, De-Chuan Zhan

📊

Overview

This paper proposes a novel open-set semi-supervised learning (OSSL) framework called Wise Open-set Semi-supervised Learning (WiseOpen) that can effectively handle unlabeled data from unseen classes (out-of-distribution or OOD data).
Existing OSSL approaches often employ an extra OOD detection module, but they use the entire set of open-set data during training, which may include data that is unfriendly to the OSSL task and negatively impact model performance.
WiseOpen selectively leverages the open-set data by applying a gradient-variance-based selection mechanism to exploit a friendly subset, enhancing the model's capability for in-distribution (ID) classification.
Two practical variants of WiseOpen are also proposed to reduce computational expense by adopting low-frequency update and loss-based selection, respectively.

Plain English Explanation

In machine learning, it is common to have a set of labeled data (with known classes) and a larger set of unlabeled data. Open-set Semi-supervised Learning (OSSL) is a realistic scenario where the unlabeled data may come from classes that are not present in the labeled set (out-of-distribution or OOD data). This can cause issues for conventional semi-supervised learning (SSL) models, as they may struggle to handle this OOD data.

To address this, some existing OSSL approaches use an extra module to detect the OOD data and avoid its negative impact. However, these approaches still use the entire set of open-set data during training, which may include data that is not helpful for the OSSL task and can negatively influence the model's performance.

The authors of this paper propose a new framework called Wise Open-set Semi-supervised Learning (WiseOpen) that selectively uses the open-set data to train the model. By applying a gradient-variance-based selection mechanism, WiseOpen chooses a "friendly" subset of the open-set data to enhance the model's ability to classify the in-distribution (ID) data correctly. This helps to mitigate the negative impact of the OOD data.

Additionally, the authors propose two practical variants of WiseOpen that further reduce the computational expense by using low-frequency updates and loss-based selection, respectively.

Technical Explanation

The core idea behind WiseOpen is to selectively leverage the open-set data during the training process, as opposed to using the entire set of open-set data, which may contain data that is unfriendly to the OSSL task and can negatively influence the model's performance.

WiseOpen consists of two main components:

In-distribution (ID) classifier: This is the traditional SSL classifier that learns to classify the labeled and unlabeled ID data.
Open-set data selection mechanism: This module selects a "friendly" subset of the open-set data to be used for training the ID classifier, based on a gradient-variance-based selection criterion.

The gradient-variance-based selection mechanism works as follows:

For each unlabeled open-set data point, the method computes the gradient variance of the ID classifier's loss with respect to the model parameters.
Data points with higher gradient variance are considered more "friendly" to the OSSL task and are selected for training the ID classifier.

This selective use of the open-set data helps to enhance the ID classifier's capability and mitigate the negative impact of the OOD data.

To further reduce the computational expense, the authors also propose two practical variants of WiseOpen:

Low-frequency update: Instead of updating the open-set data selection at every iteration, this variant updates the selection at a lower frequency, reducing the computational cost.
Loss-based selection: This variant selects the open-set data based on the ID classifier's loss, rather than the gradient variance, which is simpler to compute.

The authors conduct extensive experiments to demonstrate the effectiveness of WiseOpen and its variants compared to state-of-the-art OSSL approaches.

Critical Analysis

The paper presents a well-designed OSSL framework that addresses a realistic and important problem in semi-supervised learning. The authors provide a strong theoretical foundation for their approach and demonstrate its effectiveness through comprehensive experiments.

One potential limitation of the work is that the open-set data selection mechanism relies on gradient-based computations, which can be computationally expensive, especially for large-scale datasets. The authors' proposed variants, such as low-frequency update and loss-based selection, aim to address this issue, but further research may be needed to develop even more efficient open-set data selection strategies.

Additionally, the paper does not explore the potential impact of the open-set data selection on the OOD detection performance. It would be interesting to investigate whether the selective use of open-set data can also improve the model's ability to identify OOD samples, as this is an important aspect of OSSL.

Overall, the WiseOpen framework presents a promising approach to open-set semi-supervised learning and could inspire further research in this area, particularly on efficient open-set data selection and the integration of OOD detection capabilities.

Conclusion

The proposed WiseOpen framework addresses a crucial challenge in open-set semi-supervised learning by selectively leveraging the open-set data to enhance the in-distribution classification performance. The core idea of gradient-variance-based open-set data selection, along with the practical variants for reducing computational expense, demonstrates the authors' strong understanding of the problem and their ability to develop effective solutions.

This research contributes to the ongoing efforts to improve open-set learning and could have significant implications for various real-world applications where unlabeled data may come from unseen classes. The paper's findings and the WiseOpen framework serve as a valuable foundation for further advancements in the field of open-set semi-supervised learning.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📊

Robust Semi-supervised Learning by Wisely Leveraging Open-set Data

Yang Yang, Nan Jiang, Yi Xu, De-Chuan Zhan

Open-set Semi-supervised Learning (OSSL) holds a realistic setting that unlabeled data may come from classes unseen in the labeled set, i.e., out-of-distribution (OOD) data, which could cause performance degradation in conventional SSL models. To handle this issue, except for the traditional in-distribution (ID) classifier, some existing OSSL approaches employ an extra OOD detection module to avoid the potential negative impact of the OOD data. Nevertheless, these approaches typically employ the entire set of open-set data during their training process, which may contain data unfriendly to the OSSL task that can negatively influence the model performance. This inspires us to develop a robust open-set data selection strategy for OSSL. Through a theoretical understanding from the perspective of learning theory, we propose Wise Open-set Semi-supervised Learning (WiseOpen), a generic OSSL framework that selectively leverages the open-set data for training the model. By applying a gradient-variance-based selection mechanism, WiseOpen exploits a friendly subset instead of the whole open-set dataset to enhance the model's capability of ID classification. Moreover, to reduce the computational expense, we also propose two practical variants of WiseOpen by adopting low-frequency update and loss-based selection respectively. Extensive experiments demonstrate the effectiveness of WiseOpen in comparison with the state-of-the-art.

5/21/2024

🔎

ProSub: Probabilistic Open-Set Semi-Supervised Learning with Subspace-Based Out-of-Distribution Detection

Erik Wallin, Lennart Svensson, Fredrik Kahl, Lars Hammarstrand

In open-set semi-supervised learning (OSSL), we consider unlabeled datasets that may contain unknown classes. Existing OSSL methods often use the softmax confidence for classifying data as in-distribution (ID) or out-of-distribution (OOD). Additionally, many works for OSSL rely on ad-hoc thresholds for ID/OOD classification, without considering the statistics of the problem. We propose a new score for ID/OOD classification based on angles in feature space between data and an ID subspace. Moreover, we propose an approach to estimate the conditional distributions of scores given ID or OOD data, enabling probabilistic predictions of data being ID or OOD. These components are put together in a framework for OSSL, termed emph{ProSub}, that is experimentally shown to reach SOTA performance on several benchmark problems. Our code is available at https://github.com/walline/prosub.

7/17/2024

Rethinking Open-World Semi-Supervised Learning: Distribution Mismatch and Inductive Inference

Seongheon Park, Hyuk Kwon, Kwanghoon Sohn, Kibok Lee

Open-world semi-supervised learning (OWSSL) extends conventional semi-supervised learning to open-world scenarios by taking account of novel categories in unlabeled datasets. Despite the recent advancements in OWSSL, the success often relies on the assumptions that 1) labeled and unlabeled datasets share the same balanced class prior distribution, which does not generally hold in real-world applications, and 2) unlabeled training datasets are utilized for evaluation, where such transductive inference might not adequately address challenges in the wild. In this paper, we aim to generalize OWSSL by addressing them. Our work suggests that practical OWSSL may require different training settings, evaluation methods, and learning strategies compared to those prevalent in the existing literature.

6/3/2024

Class-balanced Open-set Semi-supervised Object Detection for Medical Images

Zhanyun Lu, Renshu Gu, Huimin Cheng, Siyu Pang, Mingyu Xu, Peifang Xu, Yaqi Wang, Yuichiro Kinoshita, Juan Ye, Gangyong Jia, Qing Wu

Medical image datasets in the real world are often unlabeled and imbalanced, and Semi-Supervised Object Detection (SSOD) can utilize unlabeled data to improve an object detector. However, existing approaches predominantly assumed that the unlabeled data and test data do not contain out-of-distribution (OOD) classes. The few open-set semi-supervised object detection methods have two weaknesses: first, the class imbalance is not considered; second, the OOD instances are distinguished and simply discarded during pseudo-labeling. In this paper, we consider the open-set semi-supervised object detection problem which leverages unlabeled data that contain OOD classes to improve object detection for medical images. Our study incorporates two key innovations: Category Control Embed (CCE) and out-of-distribution Detection Fusion Classifier (OODFC). CCE is designed to tackle dataset imbalance by constructing a Foreground information Library, while OODFC tackles open-set challenges by integrating the ``unknown'' information into basic pseudo-labels. Our method outperforms the state-of-the-art SSOD performance, achieving a 4.25 mAP improvement on the public Parasite dataset.

8/23/2024