Positive Unlabeled Learning Selected Not At Random (PULSNAR): class proportion estimation when the SCAR assumption does not hold

Read original: arXiv:2303.08269 - Published 5/7/2024 by Praveen Kumar, Christophe G. Lambert

👁️

Overview

Positive and Unlabeled (PU) learning is a type of semi-supervised binary classification where the algorithm differentiates between positive (labeled) and positive/negative (unlabeled) instances.
PU learning is useful when confirmed negatives are unavailable or difficult to obtain, but there is value in discovering positives among the unlabeled (e.g., finding viable drugs among untested compounds).
Most PU learning algorithms assume positives are selected completely at random (SCAR), but in many real-world applications, positives are not SCAR (e.g., severe cases are more likely to be diagnosed).
This leads to poor estimates of the proportion of positives (α) and poor model calibration, resulting in uncertain decision thresholds for selecting positives.

Plain English Explanation

Imagine you're a doctor trying to identify patients with a particular disease. Normally, you'd have a clear set of patients who have the disease (the "positive" cases) and a clear set of patients who don't (the "negative" cases). But in some situations, you may only have access to patients who are confirmed to have the disease, while the rest of the patients are a mix of those with and without the disease (the "unlabeled" cases).

PU learning is a technique that can help you identify the positive cases in this situation. It works by looking at the characteristics of the confirmed positive cases and trying to find similar patterns in the unlabeled group. However, most PU learning algorithms assume that the positive cases are selected completely at random, which isn't always true in real-world situations.

For example, in healthcare, the severe cases are more likely to be diagnosed and identified as positive. This means the proportion of positive cases (α) in the unlabeled group may be different from what the algorithms assume, leading to inaccurate estimates and poor decision-making.

Technical Explanation

The researchers propose two new PU learning algorithms to address this issue:

PULSCAR (Positive Unlabeled Learning Selected Completely At Random): This algorithm estimates the proportion of positives (α) in the unlabeled set, assuming the positives are selected completely at random.
PULSNAR (Positive Unlabeled Learning Selected Not At Random): This algorithm takes a divide-and-conquer approach, first clustering the positives into subtypes and then estimating α for each subtype by applying PULSCAR to the positives from each cluster and the unlabeled set.

In their experiments, the researchers found that PULSNAR outperformed state-of-the-art PU learning approaches on both synthetic and real-world benchmark datasets. This suggests that accounting for the non-random selection of positives can lead to more accurate estimates of α and better classification performance.

Critical Analysis

The researchers acknowledge that their PULSNAR algorithm relies on the assumption that the positive instances can be meaningfully clustered into subtypes. In some applications, this may not be the case, and the algorithm may not perform as well.

Additionally, the paper does not explore how the PULSNAR algorithm would perform in situations where the unlabeled data comes from a different domain than the labeled positive data, which is a common challenge in real-world PU learning problems.

Further research could investigate the robustness of the PULSNAR algorithm to violations of its underlying assumptions, as well as its performance in cross-domain PU learning scenarios.

Conclusion

The researchers have proposed two new PU learning algorithms, PULSCAR and PULSNAR, that address the limitation of existing approaches by accounting for the non-random selection of positive instances. Their PULSNAR algorithm, in particular, shows promising results in improving the estimation of the proportion of positives (α) and the overall classification performance.

This work highlights the importance of considering the real-world data generation processes when designing machine learning algorithms, as making realistic assumptions can lead to more accurate and reliable models. The insights from this research could have broad implications for a wide range of applications where PU learning is applicable, such as in healthcare, drug discovery, and other domains where confirmed negative instances are scarce or difficult to obtain.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👁️

Positive Unlabeled Learning Selected Not At Random (PULSNAR): class proportion estimation when the SCAR assumption does not hold

Praveen Kumar, Christophe G. Lambert

Positive and Unlabeled (PU) learning is a type of semi-supervised binary classification where the machine learning algorithm differentiates between a set of positive instances (labeled) and a set of both positive and negative instances (unlabeled). PU learning has broad applications in settings where confirmed negatives are unavailable or difficult to obtain, and there is value in discovering positives among the unlabeled (e.g., viable drugs among untested compounds). Most PU learning algorithms make the emph{selected completely at random} (SCAR) assumption, namely that positives are selected independently of their features. However, in many real-world applications, such as healthcare, positives are not SCAR (e.g., severe cases are more likely to be diagnosed), leading to a poor estimate of the proportion, $alpha$, of positives among unlabeled examples and poor model calibration, resulting in an uncertain decision threshold for selecting positives. PU learning algorithms vary; some estimate only the proportion, $alpha$, of positives in the unlabeled set, while others calculate the probability that each specific unlabeled instance is positive, and some can do both. We propose two PU learning algorithms to estimate $alpha$, calculate calibrated probabilities for PU instances, and improve classification metrics: i) PULSCAR (positive unlabeled learning selected completely at random), and ii) PULSNAR (positive unlabeled learning selected not at random). PULSNAR employs a divide-and-conquer approach to cluster SNAR positives into subtypes and estimates $alpha$ for each subtype by applying PULSCAR to positives from each cluster and all unlabeled. In our experiments, PULSNAR outperformed state-of-the-art approaches on both synthetic and real-world benchmark datasets.

5/7/2024

⛏️

Positive Unlabeled Contrastive Learning

Anish Acharya, Sujay Sanghavi, Li Jing, Bhargav Bhushanam, Dhruv Choudhary, Michael Rabbat, Inderjit Dhillon

Self-supervised pretraining on unlabeled data followed by supervised fine-tuning on labeled data is a popular paradigm for learning from limited labeled examples. We extend this paradigm to the classical positive unlabeled (PU) setting, where the task is to learn a binary classifier given only a few labeled positive samples, and (often) a large amount of unlabeled samples (which could be positive or negative). We first propose a simple extension of standard infoNCE family of contrastive losses, to the PU setting; and show that this learns superior representations, as compared to existing unsupervised and supervised approaches. We then develop a simple methodology to pseudo-label the unlabeled samples using a new PU-specific clustering scheme; these pseudo-labels can then be used to train the final (positive vs. negative) classifier. Our method handily outperforms state-of-the-art PU methods over several standard PU benchmark datasets, while not requiring a-priori knowledge of any class prior (which is a common assumption in other PU methods). We also provide a simple theoretical analysis that motivates our methods.

4/1/2024

Augmented prediction of a true class for Positive Unlabeled data under selection bias

Jan Mielniczuk, Adam Wawrze'nczyk

We introduce a new observational setting for Positive Unlabeled (PU) data where the observations at prediction time are also labeled. This occurs commonly in practice -- we argue that the additional information is important for prediction, and call this task augmented PU prediction. We allow for labeling to be feature dependent. In such scenario, Bayes classifier and its risk is established and compared with a risk of a classifier which for unlabeled data is based only on predictors. We introduce several variants of the empirical Bayes rule in such scenario and investigate their performance. We emphasise dangers (and ease) of applying classical classification rule in the augmented PU scenario -- due to no preexisting studies, an unaware researcher is prone to skewing the obtained predictions. We conclude that the variant based on recently proposed variational autoencoder designed for PU scenario works on par or better than other considered variants and yields advantage over feature-only based methods in terms of accuracy for unlabeled samples.

7/16/2024

PUAL: A Classifier on Trifurcate Positive-Unlabeled Data

Xiaoke Wang, Xiaochen Yang, Rui Zhu, Jing-Hao Xue

Positive-unlabeled (PU) learning aims to train a classifier using the data containing only labeled-positive instances and unlabeled instances. However, existing PU learning methods are generally hard to achieve satisfactory performance on trifurcate data, where the positive instances distribute on both sides of the negative instances. To address this issue, firstly we propose a PU classifier with asymmetric loss (PUAL), by introducing a structure of asymmetric loss on positive instances into the objective function of the global and local learning classifier. Then we develop a kernel-based algorithm to enable PUAL to obtain non-linear decision boundary. We show that, through experiments on both simulated and real-world datasets, PUAL can achieve satisfactory classification on trifurcate data.

6/3/2024