Questionable practices in machine learning

Read original: arXiv:2407.12220 - Published 7/18/2024 by Gavin Leech, Juan J. Vazquez, Misha Yagudin, Niclas Kupper, Laurence Aitchison

Questionable practices in machine learning

Overview

This paper examines questionable practices that can arise in machine learning (ML) research, such as overfitting, publication bias, and misleading evaluations.
The authors highlight the importance of addressing these issues to ensure the reliability and integrity of ML-driven science.
They draw connections to related work on topics like unraveling overoptimism and publication bias in ML-driven science, lessons for reliable machine learning, and the importance of embracing negative results in ML.

Plain English Explanation

The paper discusses problematic practices that can creep into machine learning research. One issue is overfitting, where models perform exceptionally well on the data they were trained on, but fail to generalize to new, unseen data. This can lead to overconfident claims about a model's capabilities.

Another concern is publication bias, where researchers are more likely to publish positive results that show their methods working well, while negative or inconclusive findings often go unpublished. This skews the literature and gives an unrealistic impression of the field's progress.

The paper also highlights misleading evaluations, where the metrics used to assess a model's performance may not actually capture its true capabilities or real-world applicability. For example, popular benchmarks for evaluating privacy defenses in ML have been shown to be unreliable.

By addressing these problematic practices, the authors argue that the field of machine learning can become more rigorous, reliable, and transparent - leading to better uncertainty quantification in large language models and other advances.

Technical Explanation

The paper begins by discussing the rise of machine learning as a powerful tool for scientific discovery, but notes that this has also led to the emergence of questionable research practices. The authors highlight three key issues:

Overfitting: The authors explain how machine learning models can become overly specialized to the training data, leading to inflated performance metrics that do not reflect real-world generalization. They draw connections to related work on unraveling overoptimism and publication bias in ML-driven science.
Publication bias: The paper discusses the tendency for positive results to be more likely to be published, while negative or inconclusive findings often go unreported. This can skew the scientific literature and give an unrealistic impression of progress in the field. The authors relate this to lessons for reliable machine learning and the importance of embracing negative results.
Misleading evaluations: The authors examine how the metrics used to assess machine learning models, particularly in the context of privacy defenses, can be misleading and fail to capture real-world performance. They discuss the issues with evaluating machine learning privacy defenses and the need for more robust evaluation methods.

Critical Analysis

The paper raises valid concerns about the potential for questionable practices to undermine the reliability and integrity of machine learning research. The authors provide a nuanced and well-reasoned critique, acknowledging the field's rapid progress while also highlighting important caveats and areas for improvement.

One potential limitation of the research is that it focuses primarily on issues within the machine learning research community, without delving deeply into the broader societal implications of these practices. For example, the authors could have explored how misleading evaluations and publication biases might impact real-world deployments of machine learning systems and their effects on individuals and communities.

Additionally, while the paper makes a strong case for addressing these problematic practices, it could be strengthened by providing more concrete recommendations or frameworks for how the research community can work to mitigate them. Further research in this direction could help translate the authors' insights into actionable steps for improving the reliability and transparency of machine learning-driven science.

Conclusion

This paper sheds important light on the emergence of questionable practices in machine learning research, such as overfitting, publication bias, and misleading evaluations. By drawing connections to related work and highlighting the need for more rigorous and transparent approaches, the authors make a compelling case for addressing these issues to ensure the integrity and reliability of ML-driven scientific discoveries.

As the field of machine learning continues to advance, it will be crucial for researchers, practitioners, and the broader public to remain vigilant and critical in their assessment of the methods and findings presented. Addressing the problematic practices outlined in this paper can help pave the way for a more trustworthy and impactful future for machine learning and its applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Questionable practices in machine learning

Gavin Leech, Juan J. Vazquez, Misha Yagudin, Niclas Kupper, Laurence Aitchison

Evaluating modern ML models is hard. The strong incentive for researchers and companies to report a state-of-the-art result on some metric often leads to questionable research practices (QRPs): bad practices which fall short of outright research fraud. We describe 43 such practices which can undermine reported results, giving examples where possible. Our list emphasises the evaluation of large language models (LLMs) on public benchmarks. We also discuss irreproducible research practices, i.e. decisions that make it difficult or impossible for other researchers to reproduce, build on or audit previous research.

7/18/2024

🔄

Unraveling overoptimism and publication bias in ML-driven science

Pouria Saidi, Gautam Dasarathy, Visar Berisha

Machine Learning (ML) is increasingly used across many disciplines with impressive reported results. However, recent studies suggest published performance of ML models are often overoptimistic. Validity concerns are underscored by findings of an inverse relationship between sample size and reported accuracy in published ML models, contrasting with the theory of learning curves where accuracy should improve or remain stable with increasing sample size. This paper investigates factors contributing to overoptimism in ML-driven science, focusing on overfitting and publication bias. We introduce a novel stochastic model for observed accuracy, integrating parametric learning curves and the aforementioned biases. We construct an estimator that corrects for these biases in observed data. Theoretical and empirical results show that our framework can estimate the underlying learning curve, providing realistic performance assessments from published results. Applying the model to meta-analyses of classifications of neurological conditions, we estimate the inherent limits of ML-based prediction in each domain.

7/15/2024

✨

Between Randomness and Arbitrariness: Some Lessons for Reliable Machine Learning at Scale

A. Feder Cooper

To develop rigorous knowledge about ML models -- and the systems in which they are embedded -- we need reliable measurements. But reliable measurement is fundamentally challenging, and touches on issues of reproducibility, scalability, uncertainty quantification, epistemology, and more. This dissertation addresses criteria needed to take reliability seriously: both criteria for designing meaningful metrics, and for methodologies that ensure that we can dependably and efficiently measure these metrics at scale and in practice. In doing so, this dissertation articulates a research vision for a new field of scholarship at the intersection of machine learning, law, and policy. Within this frame, we cover topics that fit under three different themes: (1) quantifying and mitigating sources of arbitrariness in ML, (2) taming randomness in uncertainty estimation and optimization algorithms, in order to achieve scalability without sacrificing reliability, and (3) providing methods for evaluating generative-AI systems, with specific focuses on quantifying memorization in language models and training latent diffusion models on open-licensed data. By making contributions in these three themes, this dissertation serves as an empirical proof by example that research on reliable measurement for machine learning is intimately and inescapably bound up with research in law and policy. These different disciplines pose similar research questions about reliable measurement in machine learning. They are, in fact, two complementary sides of the same research vision, which, broadly construed, aims to construct machine-learning systems that cohere with broader societal values.

8/13/2024

📊

How to avoid machine learning pitfalls: a guide for academic researchers

Michael A. Lones

Mistakes in machine learning practice are commonplace, and can result in a loss of confidence in the findings and products of machine learning. This guide outlines common mistakes that occur when using machine learning, and what can be done to avoid them. Whilst it should be accessible to anyone with a basic understanding of machine learning techniques, it focuses on issues that are of particular concern within academic research, such as the need to do rigorous comparisons and reach valid conclusions. It covers five stages of the machine learning process: what to do before model building, how to reliably build models, how to robustly evaluate models, how to compare models fairly, and how to report results.

8/30/2024