Least Squares Regression Can Exhibit Under-Parameterized Double Descent

2305.14689

Published 6/4/2024 by Xinyue Li, Rishi Sonthalia

↗️

Abstract

The relationship between the number of training data points, the number of parameters, and the generalization capabilities has been widely studied. Previous work has shown that double descent can occur in the over-parameterized regime, and believe that the standard bias-variance trade-off holds in the under-parameterized regime. These works provide multiple reasons for the existence of the peak. We postulate that the location of the peak depends on the technical properties of both the spectrum as well as the eigenvectors of the sample covariance. We present two simple examples that provably exhibit double descent in the under-parameterized regime and do not seem to occur for reasons provided in prior work.

Create account to get full access

Overview

The paper explores the relationship between the number of training data points, the number of model parameters, and the generalization capabilities of machine learning models.
Previous work has shown that "double descent" can occur in the over-parameterized regime, and that the standard bias-variance trade-off holds in the under-parameterized regime.
The authors postulate that the location of the "peak" in the double descent curve depends on the technical properties of both the spectrum and the eigenvectors of the sample covariance.
The paper presents two simple examples that exhibit double descent in the under-parameterized regime, which the authors claim do not seem to occur for the reasons provided in prior work.

Plain English Explanation

The research paper discusses the complex relationship between the size of a machine learning model, the amount of training data available, and the model's ability to generalize well to new, unseen data.

Previous studies have found that when a model has more parameters than the number of training data points (the "over-parameterized" regime), a phenomenon called "double descent" can occur. This means the model's performance on new data first gets worse as more parameters are added, then gets better again.

In the "under-parameterized" regime, where the model has fewer parameters than training examples, the standard "bias-variance trade-off" is believed to hold - adding more parameters reduces the model's bias but increases its variance.

The authors of this paper propose that the specific point where the double descent curve "peaks" depends on the mathematical properties of the covariance matrix of the training data. They provide two simple examples that exhibit double descent even in the under-parameterized regime, which they claim are not fully explained by the reasons given in previous research.

Technical Explanation

The paper investigates the relationship between model complexity, as measured by the number of parameters, and the model's generalization performance. Previous work has shown that in the over-parameterized regime, a phenomenon called "double descent" can occur, where performance first gets worse as more parameters are added, then improves again. In the under-parameterized regime, the standard bias-variance trade-off is believed to hold.

The authors hypothesize that the location of the "peak" in the double descent curve depends on the technical properties of both the spectrum and the eigenvectors of the sample covariance matrix. To test this, they present two simple examples that exhibit double descent in the under-parameterized regime and do not seem to occur for the reasons provided in prior work.

Critical Analysis

The paper provides an interesting new perspective on the complex relationship between model complexity and generalization performance. By exploring examples of double descent in the under-parameterized regime, the authors challenge some of the prevailing explanations for this phenomenon.

However, the paper does not delve deeply into the potential limitations or caveats of their analysis. For example, it's unclear how generalizable the two examples are to more realistic machine learning problems. Additionally, the authors do not address potential issues with the technical assumptions or simplifications made in their analysis.

There may also be opportunities to further investigate the impact of model width and density on generalization, or to explore the connections between double descent and other interpolation phenomena, as mentioned in prior research.

Overall, the paper presents a thought-provoking new direction for understanding the enigma of double descent, but more research is needed to fully unravel this complex issue and its implications for common intuitions in transfer learning.

Conclusion

This research paper challenges some of the prevailing explanations for the phenomenon of "double descent" in machine learning models, where performance first gets worse as model complexity increases, then improves again. The authors propose that the specific point at which the double descent curve "peaks" depends on the technical properties of the covariance matrix of the training data.

By presenting two simple examples that exhibit double descent in the under-parameterized regime, the paper opens up new avenues for exploring the complex relationship between model complexity, training data, and generalization capabilities. While more research is needed to fully understand the implications, this work represents an important step in unraveling the enigma of double descent and its potential impact on machine learning.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Unraveling the Enigma of Double Descent: An In-depth Analysis through the Lens of Learned Feature Space

Yufei Gu, Xiaoqing Zheng, Tomaso Aste

4/26/2024

cs.LG

Multiple Descents in Unsupervised Learning: The Role of Noise, Domain Shift and Anomalies

Kobi Rahimi, Tom Tirer, Ofir Lindenbaum

The phenomenon of double descent has recently gained attention in supervised learning. It challenges the conventional wisdom of the bias-variance trade-off by showcasing a surprising behavior. As the complexity of the model increases, the test error initially decreases until reaching a certain point where the model starts to overfit the train set, causing the test error to rise. However, deviating from classical theory, the error exhibits another decline when exceeding a certain degree of over-parameterization. We study the presence of double descent in unsupervised learning, an area that has received little attention and is not yet fully understood. We conduct extensive experiments using under-complete auto-encoders (AEs) for various applications, such as dealing with noisy data, domain shifts, and anomalies. We use synthetic and real data and identify model-wise, epoch-wise, and sample-wise double descent for all the aforementioned applications. Finally, we assessed the usability of the AEs for detecting anomalies and mitigating the domain shift between datasets. Our findings indicate that over-parameterized models can improve performance not only in terms of reconstruction, but also in enhancing capabilities for the downstream task.

6/18/2024

cs.LG stat.ML

🐍

Double Descent and Other Interpolation Phenomena in GANs

Lorenzo Luzi, Yehuda Dar, Richard Baraniuk

We study overparameterization in generative adversarial networks (GANs) that can interpolate the training data. We show that overparameterization can improve generalization performance and accelerate the training process. We study the generalization error as a function of latent space dimension and identify two main behaviors, depending on the learning setting. First, we show that overparameterized generative models that learn distributions by minimizing a metric or $f$-divergence do not exhibit double descent in generalization errors; specifically, all the interpolating solutions achieve the same generalization error. Second, we develop a novel pseudo-supervised learning approach for GANs where the training utilizes pairs of fabricated (noise) inputs in conjunction with real output samples. Our pseudo-supervised setting exhibits double descent (and in some cases, triple descent) of generalization errors. We combine pseudo-supervision with overparameterization (i.e., overly large latent space dimension) to accelerate training while matching or even surpassing generalization performance without pseudo-supervision. While our analysis focuses mostly on linear models, we also apply important insights for improving generalization of nonlinear, multilayer GANs.

5/2/2024

cs.LG

🤿

Class-wise Activation Unravelling the Engima of Deep Double Descent

Yufei Gu

Double descent presents a counter-intuitive aspect within the machine learning domain, and researchers have observed its manifestation in various models and tasks. While some theoretical explanations have been proposed for this phenomenon in specific contexts, an accepted theory for its occurring mechanism in deep learning remains yet to be established. In this study, we revisited the phenomenon of double descent and discussed the conditions of its occurrence. This paper introduces the concept of class-activation matrices and a methodology for estimating the effective complexity of functions, on which we unveil that over-parameterized models exhibit more distinct and simpler class patterns in hidden activations compared to under-parameterized ones. We further looked into the interpolation of noisy labelled data among clean representations and demonstrated overfitting w.r.t. expressive capacity. By comprehensively analysing hypotheses and presenting corresponding empirical evidence that either validates or contradicts these hypotheses, we aim to provide fresh insights into the phenomenon of double descent and benign over-parameterization and facilitate future explorations. By comprehensively studying different hypotheses and the corresponding empirical evidence either supports or challenges these hypotheses, our goal is to offer new insights into the phenomena of double descent and benign over-parameterization, thereby enabling further explorations in the field. The source code is available at https://github.com/Yufei-Gu-451/sparse-generalization.git.

5/14/2024

cs.LG