Self-Training: A Survey

2202.12040

Published 5/28/2024 by Massih-Reza Amini, Vasilii Feofanov, Loic Pauletto, Lies Hadjadj, Emilie Devijver, Yury Maximov

👀

Abstract

Semi-supervised algorithms aim to learn prediction functions from a small set of labeled observations and a large set of unlabeled observations. Because this framework is relevant in many applications, they have received a lot of interest in both academia and industry. Among the existing techniques, self-training methods have undoubtedly attracted greater attention in recent years. These models are designed to find the decision boundary on low density regions without making additional assumptions about the data distribution, and use the unsigned output score of a learned classifier, or its margin, as an indicator of confidence. The working principle of self-training algorithms is to learn a classifier iteratively by assigning pseudo-labels to the set of unlabeled training samples with a margin greater than a certain threshold. The pseudo-labeled examples are then used to enrich the labeled training data and to train a new classifier in conjunction with the labeled training set. In this paper, we present self-training methods for binary and multi-class classification; as well as their variants and two related approaches, namely consistency-based approaches and transductive learning. We examine the impact of significant self-training features on various methods, using different general and image classification benchmarks, and we discuss our ideas for future research in self-training. To the best of our knowledge, this is the first thorough and complete survey on this subject.

Create account to get full access

Overview

Semi-supervised learning aims to make predictions using a small set of labeled data and a larger set of unlabeled data.
Self-training is a popular semi-supervised technique that iteratively assigns "pseudo-labels" to unlabeled data and uses them to train a new model.
This paper provides a comprehensive survey of self-training methods for binary and multi-class classification, as well as related techniques like consistency-based approaches and transductive learning.

Plain English Explanation

In many real-world situations, we have access to a large amount of data, but only a small portion of it is labeled with the correct answers. Semi-supervised learning algorithms try to make use of this unlabeled data to improve the performance of a machine learning model.

One popular semi-supervised technique is called "self-training." The way self-training works is that the model first makes predictions on the unlabeled data, and then selects the predictions it is most confident about (based on the model's "margin" or output score). These confident predictions are then used as "pseudo-labels" to augment the original labeled dataset, and the model is retrained on the combined dataset. This process is repeated iteratively, with the model getting better and better as it learns from the pseudo-labeled examples.

The paper provides a thorough overview of self-training methods, including both binary and multi-class classification, as well as related approaches like consistency-based methods and transductive learning. The authors also examine how different design choices, such as the choice of pseudo-labels and ensemble diversity, can impact the performance of these self-training algorithms.

Technical Explanation

The paper presents a comprehensive survey of self-training methods for both binary and multi-class classification tasks. Self-training is a semi-supervised learning technique that iteratively assigns "pseudo-labels" to unlabeled data and uses them to train a new classifier.

The authors first describe the working principle of self-training algorithms. These models learn a classifier by assigning pseudo-labels to unlabeled samples that have a margin (output score or confidence) greater than a certain threshold. The pseudo-labeled examples are then used to enrich the labeled training data, and a new classifier is trained on the combined dataset.

The paper also covers two related approaches to self-training: consistency-based methods and transductive learning. Consistency-based methods aim to ensure that the model produces similar outputs for perturbed versions of the same input, while transductive learning directly optimizes the model's performance on the test set.

The authors examine the impact of various self-training features, such as the choice of pseudo-labels and the use of ensemble diversity, on the performance of these methods. They evaluate the different techniques using a range of general and image classification benchmarks.

Critical Analysis

The paper provides a thorough and insightful survey of self-training methods and related techniques. The authors have done a commendable job of covering the key ideas, design choices, and experimental results in a clear and comprehensive manner.

One potential limitation of the work is that it does not delve deeply into the theoretical underpinnings of self-training. While the authors discuss the general working principle, a more rigorous analysis of the convergence properties and optimality conditions of these algorithms could provide additional insights.

Additionally, the paper does not extensively explore the potential pitfalls and failure modes of self-training. For example, the authors mention the risk of "confirmation bias," where the model becomes overly confident in its initial predictions and fails to learn from the more challenging unlabeled examples. Discussing these challenges and proposing potential remedies could further strengthen the survey.

Overall, this paper serves as an excellent starting point for researchers and practitioners interested in understanding the state-of-the-art in semi-supervised learning, particularly the active and rapidly evolving field of self-training algorithms.

Conclusion

This paper provides a comprehensive survey of self-training methods for semi-supervised learning, covering both binary and multi-class classification tasks, as well as related techniques like consistency-based approaches and transductive learning. The authors thoroughly examine the impact of various design choices on the performance of these algorithms, using a range of benchmark datasets.

The paper offers a valuable resource for researchers and practitioners working in the field of semi-supervised learning, highlighting the key ideas, strengths, and limitations of self-training methods. By synthesizing the existing literature and identifying areas for future research, the authors have contributed to the ongoing advancement of this important field.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Incremental Self-training for Semi-supervised Learning

Jifeng Guo, Zhulin Liu, Tong Zhang, C. L. Philip Chen

Semi-supervised learning provides a solution to reduce the dependency of machine learning on labeled data. As one of the efficient semi-supervised techniques, self-training (ST) has received increasing attention. Several advancements have emerged to address challenges associated with noisy pseudo-labels. Previous works on self-training acknowledge the importance of unlabeled data but have not delved into their efficient utilization, nor have they paid attention to the problem of high time consumption caused by iterative learning. This paper proposes Incremental Self-training (IST) for semi-supervised learning to fill these gaps. Unlike ST, which processes all data indiscriminately, IST processes data in batches and priority assigns pseudo-labels to unlabeled samples with high certainty. Then, it processes the data around the decision boundary after the model is stabilized, enhancing classifier performance. Our IST is simple yet effective and fits existing self-training-based semi-supervised learning methods. We verify the proposed IST on five datasets and two types of backbone, effectively improving the recognition accuracy and learning speed. Significantly, it outperforms state-of-the-art competitors on three challenging image classification tasks.

4/22/2024

cs.LG

Self-Training for Sample-Efficient Active Learning for Text Classification with Pre-Trained Language Models

Christopher Schroder, Gerhard Heyer

Active learning is an iterative labeling process that is used to obtain a small labeled subset, despite the absence of labeled data, thereby enabling to train a model for supervised tasks such as text classification. While active learning has made considerable progress in recent years due to improvements provided by pre-trained language models, there is untapped potential in the often neglected unlabeled portion of the data, although it is available in considerably larger quantities than the usually small set of labeled data. Here we investigate how self-training, a semi-supervised approach where a model is used to obtain pseudo-labels from the unlabeled data, can be used to improve the efficiency of active learning for text classification. Starting with an extensive reproduction of four previous self-training approaches, some of which are evaluated for the first time in the context of active learning or natural language processing, we devise HAST, a new and effective self-training strategy, which is evaluated on four text classification benchmarks, on which it outperforms the reproduced self-training approaches and reaches classification results comparable to previous experiments for three out of four datasets, using only 25% of the data.

6/14/2024

cs.CL cs.AI cs.LG

🔍

A review on discriminative self-supervised learning methods

Nikolaos Giakoumoglou, Tania Stathaki

In the field of computer vision, self-supervised learning has emerged as a method to extract robust features from unlabeled data, where models derive labels autonomously from the data itself, without the need for manual annotation. This paper provides a comprehensive review of discriminative approaches of self-supervised learning within the domain of computer vision, examining their evolution and current status. Through an exploration of various methods including contrastive, self-distillation, knowledge distillation, feature decorrelation, and clustering techniques, we investigate how these approaches leverage the abundance of unlabeled data. Finally, we have comparison of self-supervised learning methods on the standard ImageNet classification benchmark.

5/9/2024

cs.CV cs.AI

🏋️

Self-training via Metric Learning for Source-Free Domain Adaptation of Semantic Segmentation

Ibrahim Batuhan Akkaya, Ugur Halici

Unsupervised source-free domain adaptation methods aim to train a model for the target domain utilizing a pretrained source-domain model and unlabeled target-domain data, particularly when accessibility to source data is restricted due to intellectual property or privacy concerns. Traditional methods usually use self-training with pseudo-labeling, which is often subjected to thresholding based on prediction confidence. However, such thresholding limits the effectiveness of self-training due to insufficient supervision. This issue becomes more severe in a source-free setting, where supervision comes solely from the predictions of the pre-trained source model. In this study, we propose a novel approach by incorporating a mean-teacher model, wherein the student network is trained using all predictions from the teacher network. Instead of employing thresholding on predictions, we introduce a method to weight the gradients calculated from pseudo-labels based on the reliability of the teacher's predictions. To assess reliability, we introduce a novel approach using proxy-based metric learning. Our method is evaluated in synthetic-to-real and cross-city scenarios, demonstrating superior performance compared to existing state-of-the-art methods.

4/10/2024

cs.CV cs.LG