On Fixing the Right Problems in Predictive Analytics: AUC Is Not the Problem

2404.06989

Published 4/11/2024 by Ryan S. Baker, Nigel Bosch, Stephen Hutt, Andres F. Zambrano, Alex J. Bowers

🖼️

Abstract

Recently, ACM FAccT published an article by Kwegyir-Aggrey and colleagues (2023), critiquing the use of AUC ROC in predictive analytics in several domains. In this article, we offer a critique of that article. Specifically, we highlight technical inaccuracies in that paper's comparison of metrics, mis-specification of the interpretation and goals of AUC ROC, the article's use of the accuracy metric as a gold standard for comparison to AUC ROC, and the article's application of critiques solely to AUC ROC for concerns that would apply to the use of any metric. We conclude with a re-framing of the very valid concerns raised in that article, and discuss how the use of AUC ROC can remain a valid and appropriate practice in a well-informed predictive analytics approach taking those concerns into account. We conclude by discussing the combined use of multiple metrics, including machine learning bias metrics, and AUC ROC's place in such an approach. Like broccoli, AUC ROC is healthy, but also like broccoli, researchers and practitioners in our field shouldn't eat a diet of only AUC ROC.

Get summaries of the top AI research delivered straight to your inbox:

Overview

The paper critiques the use of AUC ROC (Area Under the Receiver Operating Characteristic Curve) in predictive analytics across several domains.
The authors highlight technical inaccuracies in the comparison of metrics, misinterpretation of the goals and meaning of AUC ROC, and the use of accuracy as a gold standard for comparison.
The authors argue that the critiques applied to AUC ROC in the previous paper are applicable to the use of any metric, not just AUC ROC.
The authors propose a re-framing of the valid concerns raised, and discuss how the use of AUC ROC can remain a valid and appropriate practice in a well-informed predictive analytics approach.

Plain English Explanation

The paper discusses the use of a metric called AUC ROC (Area Under the Receiver Operating Characteristic Curve) in predictive analytics, which is a way of measuring the performance of a model in making predictions. The authors of the previous paper had criticized the use of AUC ROC, and this paper responds to those criticisms.

The authors of this paper say that the previous paper had some technical inaccuracies in how it compared different metrics, and also misunderstood what AUC ROC is actually meant to measure. The previous paper also used another metric, called accuracy, as a "gold standard" to compare against AUC ROC, but the authors of this paper argue that the concerns raised about AUC ROC would apply to the use of any metric, not just AUC ROC.

The authors of this paper suggest a different way of looking at the concerns raised in the previous paper, and explain how AUC ROC can still be a useful and appropriate tool in predictive analytics, as long as it is used thoughtfully and in combination with other metrics, including metrics that measure bias in machine learning models.

The key idea is that just like broccoli, AUC ROC is a healthy tool, but researchers and practitioners shouldn't rely on it exclusively. A balanced "diet" of different performance metrics, including AUC ROC, is important for getting a complete picture of a model's performance.

Technical Explanation

The paper provides a critique of a previous article published in ACM FAccT that criticized the use of AUC ROC (Area Under the Receiver Operating Characteristic Curve) in predictive analytics across several domains.

The authors of this paper highlight several technical inaccuracies in the previous paper's comparison of metrics. They argue that the previous paper misspecified the interpretation and goals of AUC ROC, and incorrectly used the accuracy metric as a "gold standard" for comparison to AUC ROC.

Furthermore, the authors contend that the critiques applied to AUC ROC in the previous paper are actually applicable to the use of any metric, not just AUC ROC. This is similar to the idea explored in "Schrödinger's Threshold: When AUC Doesn't Predict Accuracy".

The authors propose a re-framing of the valid concerns raised in the previous article, and discuss how the use of AUC ROC can remain a valid and appropriate practice in a well-informed predictive analytics approach. They suggest the combined use of multiple metrics, including machine learning bias metrics like those discussed in "From Algorithms to Action: Improving Patient Care", and explain AUC ROC's place within such a holistic approach.

Critical Analysis

The authors make a compelling case for the continued use of AUC ROC in predictive analytics, while acknowledging the valid concerns raised in the previous paper. They highlight technical inaccuracies in the prior work and argue that the critiques apply to the use of any metric, not just AUC ROC.

However, the authors could have delved deeper into the specific limitations and potential drawbacks of AUC ROC that were mentioned in the previous paper. For example, they could have addressed the concern that AUC ROC may not accurately reflect the real-world performance of a model, as discussed in papers like "Rethinking Uncertainty Estimation in Semantic Segmentation".

Additionally, the authors could have explored alternative metrics, such as those that measure cryptographic hardness or score estimation, and how they might complement or improve upon the use of AUC ROC in certain contexts.

Overall, the authors provide a well-reasoned defense of AUC ROC, while acknowledging the need for a more comprehensive and balanced approach to performance evaluation in predictive analytics.

Conclusion

This paper offers a thoughtful critique of the previous critique of AUC ROC, highlighting technical inaccuracies and arguing that the concerns raised apply to the use of any metric, not just AUC ROC. The authors propose a re-framing of the valid concerns and discuss how AUC ROC can remain a useful tool in predictive analytics when used as part of a broader, well-informed approach.

The key takeaway is that AUC ROC, like broccoli, is a healthy metric that should not be the sole focus of a predictive analytics "diet." A balanced approach, incorporating multiple performance metrics and considerations of bias, is necessary to gain a comprehensive understanding of model performance and ensure responsible use of these technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

A Closer Look at AUROC and AUPRC under Class Imbalance

Matthew B. A. McDermott (Harvard Medical School), Lasse Hyldig Hansen (Aarhus University), Haoran Zhang (Massachusetts Institute of Technology), Giovanni Angelotti (IRCCS Humanitas Research Hospital), Jack Gallifant (Massachusetts Institute of Technology)

In machine learning (ML), a widespread adage is that the area under the precision-recall curve (AUPRC) is a superior metric for model comparison to the area under the receiver operating characteristic (AUROC) for binary classification tasks with class imbalance. This paper challenges this notion through novel mathematical analysis, illustrating that AUROC and AUPRC can be concisely related in probabilistic terms. We demonstrate that AUPRC, contrary to popular belief, is not superior in cases of class imbalance and might even be a harmful metric, given its inclination to unduly favor model improvements in subpopulations with more frequent positive labels. This bias can inadvertently heighten algorithmic disparities. Prompted by these insights, a thorough review of existing ML literature was conducted, utilizing large language models to analyze over 1.5 million papers from arXiv. Our investigation focused on the prevalence and substantiation of the purported AUPRC superiority. The results expose a significant deficit in empirical backing and a trend of misattributions that have fuelled the widespread acceptance of AUPRC's supposed advantages. Our findings represent a dual contribution: a significant technical advancement in understanding metric behaviors and a stark warning about unchecked assumptions in the ML community. All experiments are accessible at https://github.com/mmcdermott/AUC_is_all_you_need.

4/19/2024

cs.LG

Schroedinger's Threshold: When the AUC doesn't predict Accuracy

Juri Opitz

The Area Under Curve measure (AUC) seems apt to evaluate and compare diverse models, possibly without calibration. An important example of AUC application is the evaluation and benchmarking of models that predict faithfulness of generated text. But we show that the AUC yields an academic and optimistic notion of accuracy that can misalign with the actual accuracy observed in application, yielding significant changes in benchmark rankings. To paint a more realistic picture of downstream model performance (and prepare a model for actual application), we explore different calibration modes, testing calibration data and method.

4/5/2024

cs.CL

Multiclass ROC

Liang Wang, Luis Carvalho

Model evaluation is of crucial importance in modern statistics application. The construction of ROC and calculation of AUC have been widely used for binary classification evaluation. Recent research generalizing the ROC/AUC analysis to multi-class classification has problems in at least one of the four areas: 1. failure to provide sensible plots 2. being sensitive to imbalanced data 3. unable to specify mis-classification cost and 4. unable to provide evaluation uncertainty quantification. Borrowing from a binomial matrix factorization model, we provide an evaluation metric summarizing the pair-wise multi-class True Positive Rate (TPR) and False Positive Rate (FPR) with one-dimensional vector representation. Visualization on the representation vector measures the relative speed of increment between TPR and FPR across all the classes pairs, which in turns provides a ROC plot for the multi-class counterpart. An integration over those factorized vector provides a binary AUC-equivalent summary on the classifier performance. Mis-clasification weights specification and bootstrapped confidence interval are also enabled to accommodate a variety of of evaluation criteria. To support our findings, we conducted extensive simulation studies and compared our method to the pair-wise averaged AUC statistics on benchmark datasets.

4/23/2024

stat.ML cs.LG

🚀

Robust performance metrics for imbalanced classification problems

Hajo Holzmann, Bernhard Klar

We show that established performance metrics in binary classification, such as the F-score, the Jaccard similarity coefficient or Matthews' correlation coefficient (MCC), are not robust to class imbalance in the sense that if the proportion of the minority class tends to $0$, the true positive rate (TPR) of the Bayes classifier under these metrics tends to $0$ as well. Thus, in imbalanced classification problems, these metrics favour classifiers which ignore the minority class. To alleviate this issue we introduce robust modifications of the F-score and the MCC for which, even in strongly imbalanced settings, the TPR is bounded away from $0$. We numerically illustrate the behaviour of the various performance metrics in simulations as well as on a credit default data set. We also discuss connections to the ROC and precision-recall curves and give recommendations on how to combine their usage with performance metrics.

4/12/2024

stat.ML cs.LG