Split Conformal Prediction under Data Contamination

Read original: arXiv:2407.07700 - Published 7/18/2024 by Jase Clarkson, Wenkai Xu, Mihai Cucuringu, Gesine Reinert

Split Conformal Prediction under Data Contamination

Overview

This paper explores the impact of data contamination on split conformal prediction, a technique for providing probabilistic predictions with guaranteed error rates.
The authors investigate how the presence of corrupted or mislabeled data points can affect the validity and efficiency of split conformal prediction.
They propose a modified split conformal prediction approach that is more robust to data contamination and demonstrate its effectiveness through theoretical analysis and empirical evaluation.

Plain English Explanation

Split conformal prediction is a powerful statistical technique that allows machine learning models to make predictions with a guaranteed level of accuracy. Essentially, it provides a way to quantify the uncertainty in a model's predictions, which is important for many real-world applications.

However, in practice, datasets may contain contaminated or mislabeled data points, which can negatively impact the performance of split conformal prediction. This paper looks at this issue and proposes a modified approach that is more robust to such data contamination.

The key idea is to split the dataset into two parts - one for training the model, and one for evaluating the predictions. By carefully analyzing the behavior of the model on the evaluation set, the authors show how to construct prediction intervals that maintain the desired error rate, even in the presence of corrupted data.

This is an important contribution because it makes split conformal prediction more reliable and trustworthy, especially in applications where data quality is a concern, such as link to "Conditional Validity for Heteroskedastic Conformal Regression", link to "Robust Conformal Prediction Using Privileged Information", or link to "Self-Consistent Conformal Prediction".

Technical Explanation

The paper starts by reviewing the standard split conformal prediction framework for exchangeable data, where the dataset is divided into a training set and a calibration set. The model is trained on the training set, and the calibration set is used to determine the prediction intervals that ensure the desired error rate.

The authors then investigate how this approach is affected by the presence of contaminated or mislabeled data points. They show that the standard split conformal prediction method can fail to maintain the desired error rate in the presence of such data contamination.

To address this issue, the authors propose a modified split conformal prediction method that is more robust to data contamination. The key idea is to use a trimmed version of the calibration set, where a certain fraction of the most extreme (in terms of the model's predictions) data points are discarded before constructing the prediction intervals.

Through theoretical analysis, the authors demonstrate that this trimmed split conformal prediction approach can maintain the desired error rate, even when a significant portion of the data is contaminated. They also provide guidelines for selecting the appropriate trimming fraction based on the expected level of data contamination.

The paper concludes with an empirical evaluation of the proposed method on both synthetic and real-world datasets, showing that it outperforms the standard split conformal prediction approach in terms of both validity (maintaining the error rate) and efficiency (producing tighter prediction intervals) under data contamination.

Critical Analysis

The paper makes a valuable contribution by addressing an important practical issue in the application of split conformal prediction - the impact of data contamination. The proposed trimmed split conformal prediction method is a straightforward and effective solution that can significantly improve the reliability of the technique in real-world scenarios.

One potential limitation of the approach is that it requires some prior knowledge or estimation of the expected level of data contamination in order to select the appropriate trimming fraction. In situations where this information is not available, the method may be less effective. Additional research could explore ways to adaptively determine the optimal trimming fraction based on the observed data, as in link to "Conformal Prediction for Deep Classifier via Label Ranking".

Another area for further research could be to investigate the performance of the trimmed split conformal prediction method under different types of data contamination, such as link to "Conformal Prediction Score that is Robust to" systematic biases or adversarial attacks. This could help to further strengthen the robustness of the technique and its applicability in diverse real-world settings.

Overall, this paper presents a valuable contribution to the field of conformal prediction and demonstrates the importance of considering data quality issues in the design of reliable machine learning systems.

Conclusion

This paper tackles the challenge of data contamination in split conformal prediction, a widely used technique for providing probabilistic predictions with guaranteed error rates. The authors propose a modified split conformal prediction approach that is more robust to the presence of corrupted or mislabeled data points.

The key innovation is the use of a trimmed version of the calibration set, where a fraction of the most extreme data points are discarded before constructing the prediction intervals. This simple yet effective modification allows the method to maintain the desired error rate, even when a significant portion of the data is contaminated.

The theoretical analysis and empirical evaluation presented in the paper show the effectiveness of the proposed approach, making it a valuable tool for practitioners who need to apply split conformal prediction in real-world scenarios where data quality is a concern. This research represents an important step forward in enhancing the reliability and trustworthiness of machine learning models, which is crucial for their widespread adoption and responsible use in high-stakes applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Split Conformal Prediction under Data Contamination

Jase Clarkson, Wenkai Xu, Mihai Cucuringu, Gesine Reinert

Conformal prediction is a non-parametric technique for constructing prediction intervals or sets from arbitrary predictive models under the assumption that the data is exchangeable. It is popular as it comes with theoretical guarantees on the marginal coverage of the prediction sets and the split conformal prediction variant has a very low computational cost compared to model training. We study the robustness of split conformal prediction in a data contamination setting, where we assume a small fraction of the calibration scores are drawn from a different distribution than the bulk. We quantify the impact of the corrupted data on the coverage and efficiency of the constructed sets when evaluated on clean test points, and verify our results with numerical experiments. Moreover, we propose an adjustment in the classification setting which we call Contamination Robust Conformal Prediction, and verify the efficacy of our approach using both synthetic and real datasets.

7/18/2024

Conformal Predictions under Markovian Data

Fr'ed'eric Zheng, Alexandre Proutiere

We study the split Conformal Prediction method when applied to Markovian data. We quantify the gap in terms of coverage induced by the correlations in the data (compared to exchangeable data). This gap strongly depends on the mixing properties of the underlying Markov chain, and we prove that it typically scales as $sqrt{t_mathrm{mix}ln(n)/n}$ (where $t_mathrm{mix}$ is the mixing time of the chain). We also derive upper bounds on the impact of the correlations on the size of the prediction set. Finally we present $K$-split CP, a method that consists in thinning the calibration dataset and that adapts to the mixing properties of the chain. Its coverage gap is reduced to $t_mathrm{mix}/(nln(n))$ without really affecting the size of the prediction set. We finally test our algorithms on synthetic and real-world datasets.

7/23/2024

↗️

Conditional validity of heteroskedastic conformal regression

Nicolas Dewolf, Bernard De Baets, Willem Waegeman

Conformal prediction, and split conformal prediction as a specific implementation, offer a distribution-free approach to estimating prediction intervals with statistical guarantees. Recent work has shown that split conformal prediction can produce state-of-the-art prediction intervals when focusing on marginal coverage, i.e. on a calibration dataset the method produces on average prediction intervals that contain the ground truth with a predefined coverage level. However, such intervals are often not adaptive, which can be problematic for regression problems with heteroskedastic noise. This paper tries to shed new light on how prediction intervals can be constructed, using methods such as normalized and Mondrian conformal prediction, in such a way that they adapt to the heteroskedasticity of the underlying process. Theoretical and experimental results are presented in which these methods are compared in a systematic way. In particular, it is shown how the conditional validity of a chosen conformal predictor can be related to (implicit) assumptions about the data-generating distribution.

5/1/2024

Robust Conformal Prediction Using Privileged Information

Shai Feldman, Yaniv Romano

We develop a method to generate prediction sets with a guaranteed coverage rate that is robust to corruptions in the training data, such as missing or noisy variables. Our approach builds on conformal prediction, a powerful framework to construct prediction sets that are valid under the i.i.d assumption. Importantly, naively applying conformal prediction does not provide reliable predictions in this setting, due to the distribution shift induced by the corruptions. To account for the distribution shift, we assume access to privileged information (PI). The PI is formulated as additional features that explain the distribution shift, however, they are only available during training and absent at test time. We approach this problem by introducing a novel generalization of weighted conformal prediction and support our method with theoretical coverage guarantees. Empirical experiments on both real and synthetic datasets indicate that our approach achieves a valid coverage rate and constructs more informative predictions compared to existing methods, which are not supported by theoretical guarantees.

6/11/2024