Incremental Self-training for Semi-supervised Learning

2404.12398

Published 4/22/2024 by Jifeng Guo, Zhulin Liu, Tong Zhang, C. L. Philip Chen

Incremental Self-training for Semi-supervised Learning

Abstract

Semi-supervised learning provides a solution to reduce the dependency of machine learning on labeled data. As one of the efficient semi-supervised techniques, self-training (ST) has received increasing attention. Several advancements have emerged to address challenges associated with noisy pseudo-labels. Previous works on self-training acknowledge the importance of unlabeled data but have not delved into their efficient utilization, nor have they paid attention to the problem of high time consumption caused by iterative learning. This paper proposes Incremental Self-training (IST) for semi-supervised learning to fill these gaps. Unlike ST, which processes all data indiscriminately, IST processes data in batches and priority assigns pseudo-labels to unlabeled samples with high certainty. Then, it processes the data around the decision boundary after the model is stabilized, enhancing classifier performance. Our IST is simple yet effective and fits existing self-training-based semi-supervised learning methods. We verify the proposed IST on five datasets and two types of backbone, effectively improving the recognition accuracy and learning speed. Significantly, it outperforms state-of-the-art competitors on three challenging image classification tasks.

Create account to get full access

Overview

This paper proposes a novel incremental self-training approach for semi-supervised learning.
The method incrementally trains a model on both labeled and unlabeled data, using the model's predictions on the unlabeled data to iteratively expand the training set.
The authors demonstrate the effectiveness of their approach on various benchmark datasets, showing improvements over standard semi-supervised learning techniques.

Plain English Explanation

The paper describes a new way to train machine learning models using both labeled and unlabeled data. Typically, machine learning models are trained on labeled data, where the correct answers are provided. However, labeled data can be expensive and time-consuming to obtain.

Semi-supervised learning aims to address this by also using unlabeled data, which is often more readily available. The key idea is to use the model's own predictions on the unlabeled data to help guide the training process.

The authors' approach, called "incremental self-training," takes this a step further. Instead of using the unlabeled data all at once, it gradually incorporates the unlabeled data into the training process, iteratively updating the model and using its increasingly reliable predictions to expand the training set.

This allows the model to learn more effectively from the unlabeled data, as it can leverage the information from the labeled data to make better use of the unlabeled data at each step. The authors show that this leads to improved performance on a variety of benchmark tasks compared to standard semi-supervised approaches.

Technical Explanation

The key innovation of the proposed approach is the incremental nature of the self-training process. Rather than using all the unlabeled data at once, the method iteratively selects a subset of the unlabeled data based on the model's confidence in its predictions.

At each iteration, the model is trained on the labeled data and the high-confidence unlabeled data selected from the previous iteration. The model's predictions are then used to select a new set of high-confidence unlabeled examples to add to the training set for the next iteration.

This incremental approach allows the model to gradually learn from the unlabeled data, leveraging the information from the labeled data to make better use of the unlabeled data at each step. The authors demonstrate the effectiveness of this approach on several benchmark semi-supervised learning tasks, including image classification and text classification.

Critical Analysis

The authors provide a thorough analysis of their incremental self-training approach, including extensive experiments and comparisons to other semi-supervised learning methods. The results appear to be strong, with the incremental approach outperforming standard semi-supervised techniques across multiple datasets and tasks.

However, the paper does not discuss potential limitations or caveats of the method. For example, the performance of the approach may depend on the quality and diversity of the unlabeled data, and it's not clear how robust the method would be to noisy or outlier predictions from the model.

Additionally, the computational cost of the iterative training process is not addressed. As the model is retrained multiple times, this could become prohibitively expensive, especially for large-scale problems.

Further research could explore ways to improve the efficiency of the incremental approach, such as by selectively updating only parts of the model or by incorporating active learning strategies to reduce the number of model updates required. Overall, the paper presents a promising new direction for semi-supervised learning, but additional investigation into its limitations and potential improvements would be valuable.

Conclusion

This paper introduces a novel incremental self-training approach for semi-supervised learning, which gradually expands the training set by incorporating the model's high-confidence predictions on unlabeled data. The authors demonstrate the effectiveness of this method on various benchmark tasks, showing improvements over standard semi-supervised techniques.

The incremental nature of the approach allows the model to learn more effectively from the unlabeled data by leveraging the information from the labeled data at each step. While the paper does not address potential limitations, the proposed method represents an exciting advancement in the field of semi-supervised learning with promising applications in domains where labeled data is scarce.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

👀

Self-Training: A Survey

Massih-Reza Amini, Vasilii Feofanov, Loic Pauletto, Lies Hadjadj, Emilie Devijver, Yury Maximov

Semi-supervised algorithms aim to learn prediction functions from a small set of labeled observations and a large set of unlabeled observations. Because this framework is relevant in many applications, they have received a lot of interest in both academia and industry. Among the existing techniques, self-training methods have undoubtedly attracted greater attention in recent years. These models are designed to find the decision boundary on low density regions without making additional assumptions about the data distribution, and use the unsigned output score of a learned classifier, or its margin, as an indicator of confidence. The working principle of self-training algorithms is to learn a classifier iteratively by assigning pseudo-labels to the set of unlabeled training samples with a margin greater than a certain threshold. The pseudo-labeled examples are then used to enrich the labeled training data and to train a new classifier in conjunction with the labeled training set. In this paper, we present self-training methods for binary and multi-class classification; as well as their variants and two related approaches, namely consistency-based approaches and transductive learning. We examine the impact of significant self-training features on various methods, using different general and image classification benchmarks, and we discuss our ideas for future research in self-training. To the best of our knowledge, this is the first thorough and complete survey on this subject.

5/28/2024

cs.LG

🗣️

A replica analysis of Self-Training of Linear Classifier

Takashi Takahashi

Self-training (ST) is a simple yet effective semi-supervised learning method. However, why and how ST improves generalization performance by using potentially erroneous pseudo-labels is still not well understood. To deepen the understanding of ST, we derive and analyze a sharp characterization of the behavior of iterative ST when training a linear classifier by minimizing the ridge-regularized convex loss on binary Gaussian mixtures, in the asymptotic limit where input dimension and data size diverge proportionally. The results show that ST improves generalization in different ways depending on the number of iterations. When the number of iterations is small, ST improves generalization performance by fitting the model to relatively reliable pseudo-labels and updating the model parameters by a large amount at each iteration. This suggests that ST works intuitively. On the other hand, with many iterations, ST can gradually improve the direction of the classification plane by updating the model parameters incrementally, using soft labels and small regularization. It is argued that this is because the small update of ST can extract information from the data in an almost noiseless way. However, in the presence of label imbalance, the generalization performance of ST underperforms supervised learning with true labels. To overcome this, two heuristics are proposed to enable ST to achieve nearly compatible performance with supervised learning even with significant label imbalance.

5/8/2024

stat.ML cs.LG

Rethinking Self-training for Semi-supervised Landmark Detection: A Selection-free Approach

Haibo Jin, Haoxuan Che, Hao Chen

Self-training is a simple yet effective method for semi-supervised learning, during which pseudo-label selection plays an important role for handling confirmation bias. Despite its popularity, applying self-training to landmark detection faces three problems: 1) The selected confident pseudo-labels often contain data bias, which may hurt model performance; 2) It is not easy to decide a proper threshold for sample selection as the localization task can be sensitive to noisy pseudo-labels; 3) coordinate regression does not output confidence, making selection-based self-training infeasible. To address the above issues, we propose Self-Training for Landmark Detection (STLD), a method that does not require explicit pseudo-label selection. Instead, STLD constructs a task curriculum to deal with confirmation bias, which progressively transitions from more confident to less confident tasks over the rounds of self-training. Pseudo pretraining and shrink regression are two essential components for such a curriculum, where the former is the first task of the curriculum for providing a better model initialization and the latter is further added in the later rounds to directly leverage the pseudo-labels in a coarse-to-fine manner. Experiments on three facial and one medical landmark detection benchmark show that STLD outperforms the existing methods consistently in both semi- and omni-supervised settings.

4/9/2024

cs.CV

Self-Training for Sample-Efficient Active Learning for Text Classification with Pre-Trained Language Models

Christopher Schroder, Gerhard Heyer

Active learning is an iterative labeling process that is used to obtain a small labeled subset, despite the absence of labeled data, thereby enabling to train a model for supervised tasks such as text classification. While active learning has made considerable progress in recent years due to improvements provided by pre-trained language models, there is untapped potential in the often neglected unlabeled portion of the data, although it is available in considerably larger quantities than the usually small set of labeled data. Here we investigate how self-training, a semi-supervised approach where a model is used to obtain pseudo-labels from the unlabeled data, can be used to improve the efficiency of active learning for text classification. Starting with an extensive reproduction of four previous self-training approaches, some of which are evaluated for the first time in the context of active learning or natural language processing, we devise HAST, a new and effective self-training strategy, which is evaluated on four text classification benchmarks, on which it outperforms the reproduced self-training approaches and reaches classification results comparable to previous experiments for three out of four datasets, using only 25% of the data.

6/14/2024

cs.CL cs.AI cs.LG