Published 4/24/2024 by Thu Nguyen, Tuan L. Vo, P{aa}l Halvorsen, Michael A. Riegler
Missing data is a common problem in practical settings. Various imputation methods have been developed to deal with missing data. However, even though the label is usually available in the training data, the common practice of imputation usually only relies on the input and ignores the label. In this work, we illustrate how stacking the label into the input can significantly improve the imputation of the input. In addition, we propose a classification strategy that initializes the predicted test label with missing values and stacks the label with the input for imputation. This allows imputing the label and the input at the same time. Also, the technique is capable of handling data training with missing labels without any prior imputation and is applicable to continuous, categorical, or mixed-type data. Experiments show promising results in terms of accuracy.

This paper presents a novel approach for imputing missing data using training labels and classification via label imputation. The key idea is to leverage the available labeled data to improve the imputation of missing values, which can be particularly useful in scenarios with high rates of missing data.

Related Works

The paper situates its work within the broader context of missing data imputation techniques. It discusses how traditional imputation methods, such as mean imputation and multiple imputation, may not be adequate when dealing with complex, high-dimensional datasets. The authors also highlight the potential limitations of iterative graph-based imputation and the need for approaches that can handle different missing data mechanisms.

Preliminary: missForest algorithm

The paper builds upon the missForest algorithm, which is a random forest-based method for imputing missing values in mixed-type datasets. The authors explain how missForest works and how it can be extended to leverage available training labels for improved imputation.

Imputation using training labels and classification via label imputation

The core of the paper's contribution is the proposed approach, which combines the missForest algorithm with a classification step to leverage training labels. The key idea is to use the available labeled data to train a classifier, which is then used to impute missing labels. This imputed label information is then fed back into the missForest algorithm to improve the imputation of other missing values in the dataset.

The authors provide a detailed technical explanation of the proposed method, including the algorithm steps and the underlying intuition. They also discuss the potential advantages of this approach, such as its ability to handle complex, high-dimensional datasets with different missing data mechanisms.

Evaluation and Experiments

The paper presents an extensive evaluation of the proposed method using both synthetic and real-world datasets. The authors compare the performance of their approach to other state-of-the-art imputation techniques, demonstrating its effectiveness in terms of various evaluation metrics.


In conclusion, this paper introduces a novel approach for imputing missing data by leveraging available training labels and classification. The proposed method shows promising results in improving imputation accuracy, particularly in scenarios with high rates of missing data. The authors suggest that this approach could have significant practical applications in fields such as healthcare, finance, and social sciences, where missing data is a common challenge.

