Beyond Implicit Bias: The Insignificance of SGD Noise in Online Learning

Read original: arXiv:2306.08590 - Published 6/10/2024 by Nikhil Vyas, Depen Morwani, Rosie Zhao, Gal Kaplun, Sham Kakade, Boaz Barak

Beyond Implicit Bias: The Insignificance of SGD Noise in Online Learning

Overview

This paper investigates the impact of stochastic gradient descent (SGD) noise in online learning, challenging the common belief that such noise is a major source of bias.
The researchers show that SGD noise has a negligible effect on the learning dynamics, even in the limit of small batch sizes or high learning rates.
The findings suggest that other factors, such as implicit bias, may be more important in shaping the behavior of online learning algorithms.

Plain English Explanation

Online learning algorithms, like stochastic gradient descent (SGD), are commonly used to train machine learning models. These algorithms work by making small updates to the model's parameters based on small batches of training data, which introduces a certain amount of "noise" into the learning process.

The authors of this paper argue that this noise from SGD is not as significant as many researchers have believed. They show that even with very small batch sizes or high learning rates, which would increase the amount of noise, the learning dynamics are not significantly affected. Instead, other factors, like the implicit biases built into the algorithm, may play a more important role in shaping the model's behavior.

This is an important finding because it challenges the common view that SGD noise is a major source of problems in online learning. If the noise is not as significant as previously thought, it suggests that researchers should focus more on addressing other issues, like the inherent biases in the algorithms themselves, rather than just trying to reduce the noise.

Technical Explanation

The paper provides a theoretical analysis of the impact of SGD noise on online learning dynamics. The researchers use a singular limit analysis to study the behavior of SGD in the limit of small batch sizes or high learning rates, where the noise should be most pronounced.

Their analysis shows that the noise term in the SGD update equation has a negligible effect on the learning dynamics, even in these extreme regimes. Instead, the authors find that the implicit biases of the algorithm, such as the choice of initialization and the geometry of the loss landscape, play a much more significant role in shaping the model's behavior.

The researchers also draw connections to previous work on the importance of batch size and the marginal value of momentum in online learning, showing how their findings relate to and build upon these earlier insights.

Critical Analysis

The paper provides a compelling analysis that challenges the common narrative around the importance of SGD noise in online learning. The researchers' use of a rigorous mathematical framework, including the singular limit analysis, lends credibility to their findings.

However, it is important to note that the analysis is based on certain simplifying assumptions, such as the use of quadratic loss functions and Gaussian noise. While the authors argue that their results are likely to generalize to more complex settings, further empirical validation on a wider range of models and tasks would be valuable.

Additionally, the paper does not explore the potential interactions between SGD noise and other factors, such as the choice of optimization algorithm or the structure of the neural network. It would be interesting to see how the relative importance of noise and implicit bias may change in different settings.

Overall, this paper provides an important counterpoint to the prevailing view on the role of SGD noise in online learning. By shifting the focus towards other factors, such as implicit biases, it opens up new avenues for research and optimization of machine learning algorithms.

Conclusion

This paper challenges the common belief that stochastic gradient descent (SGD) noise is a major source of bias in online learning. Through a rigorous theoretical analysis, the researchers show that the impact of SGD noise is actually quite small, even in extreme regimes with very small batch sizes or high learning rates.

Instead, the authors argue that other factors, such as the implicit biases built into the learning algorithms, play a much more significant role in shaping the behavior of online models. This finding has important implications for the field of machine learning, as it suggests that researchers should focus more on addressing these inherent biases rather than just trying to reduce the noise in the training process.

Overall, this paper provides a valuable new perspective on the dynamics of online learning, and encourages the community to think more critically about the factors that influence the performance of machine learning models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Beyond Implicit Bias: The Insignificance of SGD Noise in Online Learning

Nikhil Vyas, Depen Morwani, Rosie Zhao, Gal Kaplun, Sham Kakade, Boaz Barak

The success of SGD in deep learning has been ascribed by prior works to the implicit bias induced by finite batch sizes (SGD noise). While prior works focused on offline learning (i.e., multiple-epoch training), we study the impact of SGD noise on online (i.e., single epoch) learning. Through an extensive empirical analysis of image and language data, we demonstrate that small batch sizes do not confer any implicit bias advantages in online learning. In contrast to offline learning, the benefits of SGD noise in online learning are strictly computational, facilitating more cost-effective gradient steps. This suggests that SGD in the online regime can be construed as taking noisy steps along the golden path of the noiseless gradient descent algorithm. We study this hypothesis and provide supporting evidence in loss and function space. Our findings challenge the prevailing understanding of SGD and offer novel insights into its role in online learning.

6/10/2024

Bias in Motion: Theoretical Insights into the Dynamics of Bias in SGD Training

Anchit Jain, Rozhin Nobahari, Aristide Baratin, Stefano Sarao Mannelli

Machine learning systems often acquire biases by leveraging undesired features in the data, impacting accuracy variably across different sub-populations. Current understanding of bias formation mostly focuses on the initial and final stages of learning, leaving a gap in knowledge regarding the transient dynamics. To address this gap, this paper explores the evolution of bias in a teacher-student setup modeling different data sub-populations with a Gaussian-mixture model. We provide an analytical description of the stochastic gradient descent dynamics of a linear classifier in this setting, which we prove to be exact in high dimension. Notably, our analysis reveals how different properties of sub-populations influence bias at different timescales, showing a shifting preference of the classifier during training. Applying our findings to fairness and robustness, we delineate how and when heterogeneous data and spurious features can generate and amplify bias. We empirically validate our results in more complex scenarios by training deeper networks on synthetic and real datasets, including CIFAR10, MNIST, and CelebA.

5/29/2024

🔗

Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks

Feng Chen, Daniel Kunin, Atsushi Yamamura, Surya Ganguli

In this work, we reveal a strong implicit bias of stochastic gradient descent (SGD) that drives overly expressive networks to much simpler subnetworks, thereby dramatically reducing the number of independent parameters, and improving generalization. To reveal this bias, we identify invariant sets, or subsets of parameter space that remain unmodified by SGD. We focus on two classes of invariant sets that correspond to simpler (sparse or low-rank) subnetworks and commonly appear in modern architectures. Our analysis uncovers that SGD exhibits a property of stochastic attractivity towards these simpler invariant sets. We establish a sufficient condition for stochastic attractivity based on a competition between the loss landscape's curvature around the invariant set and the noise introduced by stochastic gradients. Remarkably, we find that an increased level of noise strengthens attractivity, leading to the emergence of attractive invariant sets associated with saddle-points or local maxima of the train loss. We observe empirically the existence of attractive invariant sets in trained deep neural networks, implying that SGD dynamics often collapses to simple subnetworks with either vanishing or redundant neurons. We further demonstrate how this simplifying process of stochastic collapse benefits generalization in a linear teacher-student framework. Finally, through this analysis, we mechanistically explain why early training with large learning rates for extended periods benefits subsequent generalization.

5/30/2024

Online Learning and Information Exponents: On The Importance of Batch size, and Time/Complexity Tradeoffs

Luca Arnaboldi, Yatin Dandi, Florent Krzakala, Bruno Loureiro, Luca Pesce, Ludovic Stephan

We study the impact of the batch size $n_b$ on the iteration time $T$ of training two-layer neural networks with one-pass stochastic gradient descent (SGD) on multi-index target functions of isotropic covariates. We characterize the optimal batch size minimizing the iteration time as a function of the hardness of the target, as characterized by the information exponents. We show that performing gradient updates with large batches $n_b lesssim d^{frac{ell}{2}}$ minimizes the training time without changing the total sample complexity, where $ell$ is the information exponent of the target to be learned citep{arous2021online} and $d$ is the input dimension. However, larger batch sizes than $n_b gg d^{frac{ell}{2}}$ are detrimental for improving the time complexity of SGD. We provably overcome this fundamental limitation via a different training protocol, textit{Correlation loss SGD}, which suppresses the auto-correlation terms in the loss function. We show that one can track the training progress by a system of low-dimensional ordinary differential equations (ODEs). Finally, we validate our theoretical results with numerical experiments.

6/5/2024