Leveraging Ensemble Diversity for Robust Self-Training in the Presence of Sample Selection Bias

2310.14814

Published 4/4/2024 by Ambroise Odonnat, Vasilii Feofanov, Ievgen Redko

🔍

Abstract

Self-training is a well-known approach for semi-supervised learning. It consists of iteratively assigning pseudo-labels to unlabeled data for which the model is confident and treating them as labeled examples. For neural networks, softmax prediction probabilities are often used as a confidence measure, although they are known to be overconfident, even for wrong predictions. This phenomenon is particularly intensified in the presence of sample selection bias, i.e., when data labeling is subject to some constraint. To address this issue, we propose a novel confidence measure, called $mathcal{T}$-similarity, built upon the prediction diversity of an ensemble of linear classifiers. We provide the theoretical analysis of our approach by studying stationary points and describing the relationship between the diversity of the individual members and their performance. We empirically demonstrate the benefit of our confidence measure for three different pseudo-labeling policies on classification datasets of various data modalities. The code is available at https://github.com/ambroiseodt/tsim.

Create account to get full access

Overview

Self-training is a common approach for semi-supervised learning, where the model assigns "pseudo-labels" to unlabeled data it is confident about and then uses those as additional training examples.
However, neural networks are known to be overconfident in their predictions, especially when there is bias in the labeled data.
The paper proposes a new way to measure the model's confidence, called T-similarity, which looks at the diversity of an ensemble of simpler linear classifiers instead of just the neural network's softmax probabilities.
The authors provide theoretical analysis and show empirically that this new confidence measure can improve the performance of self-training on various classification tasks.

Plain English Explanation

Training machine learning models usually requires a lot of labeled data, which can be expensive and time-consuming to obtain. Semi-supervised learning is a way to get around this by also using unlabeled data to help train the model.

One popular semi-supervised technique is called self-training. With self-training, the model first makes predictions on the unlabeled data and then treats the predictions it's most confident about as if they were real labeled examples. The model then retrains on this "pseudo-labeled" data along with the original labeled data.

The problem is that neural network models, which are commonly used for this task, tend to be overconfident in their predictions, even when they're wrong. This overconfidence can lead to the model including incorrect pseudo-labels during self-training, which can actually hurt its performance.

The researchers in this paper propose a new way to measure the model's confidence, called T-similarity. Instead of just looking at the neural network's prediction probabilities, T-similarity looks at how diverse the predictions are from an ensemble of simpler linear models. The idea is that if those models disagree, the neural network shouldn't be too confident in its prediction.

Through both theoretical analysis and experiments on real-world datasets, the paper shows that this T-similarity confidence measure can lead to better performance in self-training compared to just using the neural network's own confidence scores. In other words, it helps the model be more cautious about which unlabeled examples it includes in the training process.

Technical Explanation

The key technical contribution of the paper is the introduction of a novel confidence measure called T-similarity. Whereas traditional self-training methods use the neural network's softmax prediction probabilities as a confidence score, the authors propose using the diversity of predictions from an ensemble of linear classifiers instead.

Specifically, the T-similarity of an unlabeled example is calculated as the average pairwise cosine similarity between the weight vectors of the linear classifiers in the ensemble. The intuition is that if the linear models disagree a lot on the example, then the neural network should not be very confident in its prediction for that example.

The authors provide a theoretical analysis of this T-similarity measure, studying its stationary points and the relationship between the diversity of the linear models and their individual performance. They show that maximizing the T-similarity encourages the linear models to specialize on different parts of the input space.

Empirically, the authors evaluate the proposed confidence measure on three different pseudo-labeling policies across a variety of classification datasets. They demonstrate that using T-similarity can lead to significant improvements in self-training performance compared to relying solely on the neural network's softmax probabilities.

Critical Analysis

The paper makes a compelling case for the benefits of the T-similarity confidence measure, both from a theoretical and practical standpoint. The theoretical analysis provides useful insights into the properties of this new metric.

That said, the paper does not delve too deeply into the potential limitations or caveats of the approach. For example, the ensemble of linear models adds computational overhead compared to simply using the neural network's softmax probabilities. The authors also do not explore how the T-similarity measure might perform in settings with different types of data distributions or label noise.

Additionally, while the empirical results are promising, the paper could have provided more discussion on the failure cases or edge cases where the T-similarity approach might not be as effective. A more thorough analysis of the strengths and weaknesses of the method would help readers better understand its real-world applicability.

Overall, the research represents a valuable contribution to the field of semi-supervised learning. The T-similarity concept is an innovative way to address the overconfidence issue in self-training, and the positive experimental results suggest it is a promising direction for further exploration and refinement.

Conclusion

This paper introduces a novel confidence measure called T-similarity that can improve the performance of self-training, a popular semi-supervised learning technique. By looking at the diversity of predictions from an ensemble of linear classifiers, rather than just the neural network's own confidence scores, the T-similarity approach helps the model be more cautious about which unlabeled examples to include in the training process.

The theoretical analysis and empirical results demonstrate the benefits of this new confidence measure across a range of classification tasks. While the paper does not extensively cover the potential limitations of the method, it represents an important step forward in addressing the overconfidence problem in self-training.

As machine learning models continue to be applied to real-world problems with limited labeled data, innovations like the T-similarity approach will become increasingly valuable. This research highlights the importance of developing robust confidence measures to ensure the reliability and safety of semi-supervised learning systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

👀

Self-Training: A Survey

Massih-Reza Amini, Vasilii Feofanov, Loic Pauletto, Lies Hadjadj, Emilie Devijver, Yury Maximov

Semi-supervised algorithms aim to learn prediction functions from a small set of labeled observations and a large set of unlabeled observations. Because this framework is relevant in many applications, they have received a lot of interest in both academia and industry. Among the existing techniques, self-training methods have undoubtedly attracted greater attention in recent years. These models are designed to find the decision boundary on low density regions without making additional assumptions about the data distribution, and use the unsigned output score of a learned classifier, or its margin, as an indicator of confidence. The working principle of self-training algorithms is to learn a classifier iteratively by assigning pseudo-labels to the set of unlabeled training samples with a margin greater than a certain threshold. The pseudo-labeled examples are then used to enrich the labeled training data and to train a new classifier in conjunction with the labeled training set. In this paper, we present self-training methods for binary and multi-class classification; as well as their variants and two related approaches, namely consistency-based approaches and transductive learning. We examine the impact of significant self-training features on various methods, using different general and image classification benchmarks, and we discuss our ideas for future research in self-training. To the best of our knowledge, this is the first thorough and complete survey on this subject.

5/28/2024

cs.LG

On Pretraining Data Diversity for Self-Supervised Learning

Hasan Abed Al Kader Hammoud, Tuhin Das, Fabio Pizzati, Philip Torr, Adel Bibi, Bernard Ghanem

We explore the impact of training with more diverse datasets, characterized by the number of unique samples, on the performance of self-supervised learning (SSL) under a fixed computational budget. Our findings consistently demonstrate that increasing pretraining data diversity enhances SSL performance, albeit only when the distribution distance to the downstream data is minimal. Notably, even with an exceptionally large pretraining data diversity achieved through methods like web crawling or diffusion-generated data, among other ways, the distribution shift remains a challenge. Our experiments are comprehensive with seven SSL methods using large-scale datasets such as ImageNet and YFCC100M amounting to over 200 GPU days. Code and trained models will be available at https://github.com/hammoudhasan/DiversitySSL .

4/9/2024

cs.CV cs.AI cs.LG

SimPro: A Simple Probabilistic Framework Towards Realistic Long-Tailed Semi-Supervised Learning

Chaoqun Du, Yizeng Han, Gao Huang

Recent advancements in semi-supervised learning have focused on a more realistic yet challenging task: addressing imbalances in labeled data while the class distribution of unlabeled data remains both unknown and potentially mismatched. Current approaches in this sphere often presuppose rigid assumptions regarding the class distribution of unlabeled data, thereby limiting the adaptability of models to only certain distribution ranges. In this study, we propose a novel approach, introducing a highly adaptable framework, designated as SimPro, which does not rely on any predefined assumptions about the distribution of unlabeled data. Our framework, grounded in a probabilistic model, innovatively refines the expectation-maximization (EM) algorithm by explicitly decoupling the modeling of conditional and marginal class distributions. This separation facilitates a closed-form solution for class distribution estimation during the maximization phase, leading to the formulation of a Bayes classifier. The Bayes classifier, in turn, enhances the quality of pseudo-labels in the expectation phase. Remarkably, the SimPro framework not only comes with theoretical guarantees but also is straightforward to implement. Moreover, we introduce two novel class distributions broadening the scope of the evaluation. Our method showcases consistent state-of-the-art performance across diverse benchmarks and data distribution scenarios. Our code is available at https://github.com/LeapLabTHU/SimPro.

6/4/2024

cs.LG cs.CV

🔎

Advanced Detection of Source Code Clones via an Ensemble of Unsupervised Similarity Measures

Jorge Martinez-Gil

The capability of accurately determining code similarity is crucial in many tasks related to software development. For example, it might be essential to identify code duplicates for performing software maintenance. This research introduces a novel ensemble learning approach for code similarity assessment, combining the strengths of multiple unsupervised similarity measures. The key idea is that the strengths of a diverse set of similarity measures can complement each other and mitigate individual weaknesses, leading to improved performance. Preliminary results show that while Transformers-based CodeBERT and its variant GraphCodeBERT are undoubtedly the best option in the presence of abundant training data, in the case of specific small datasets (up to 500 samples), our ensemble achieves similar results, without prejudice to the interpretability of the resulting solution, and with a much lower associated carbon footprint due to training. The source code of this novel approach can be downloaded from https://github.com/jorge-martinez-gil/ensemble-codesim.

5/6/2024

cs.SE cs.AI