High dimensional analysis reveals conservative sharpening and a stochastic edge of stability

2404.19261

Published 5/1/2024 by Atish Agarwala, Jeffrey Pennington

🤯

Abstract

Recent empirical and theoretical work has shown that the dynamics of the large eigenvalues of the training loss Hessian have some remarkably robust features across models and datasets in the full batch regime. There is often an early period of progressive sharpening where the large eigenvalues increase, followed by stabilization at a predictable value known as the edge of stability. Previous work showed that in the stochastic setting, the eigenvalues increase more slowly - a phenomenon we call conservative sharpening. We provide a theoretical analysis of a simple high-dimensional model which shows the origin of this slowdown. We also show that there is an alternative stochastic edge of stability which arises at small batch size that is sensitive to the trace of the Neural Tangent Kernel rather than the large Hessian eigenvalues. We conduct an experimental study which highlights the qualitative differences from the full batch phenomenology, and suggests that controlling the stochastic edge of stability can help optimization.

Create account to get full access

Overview

Recent research has shown that the large eigenvalues of the training loss Hessian exhibit robust patterns across different models and datasets in the full batch setting.
There is often an initial period of "progressive sharpening" where the large eigenvalues increase, followed by stabilization at a predictable value called the "edge of stability".
In the stochastic setting, the eigenvalues increase more slowly, a phenomenon known as "conservative sharpening".
The paper provides a theoretical analysis of a simple high-dimensional model that explains the origin of this slowdown.
It also shows that there is an alternative "stochastic edge of stability" that arises at small batch sizes, which is sensitive to the trace of the Neural Tangent Kernel rather than the large Hessian eigenvalues.
An experimental study is conducted to highlight the qualitative differences from the full batch phenomenology and suggests that controlling the stochastic edge of stability can help optimization.

Plain English Explanation

When training machine learning models, the shape of the "loss function" (a measure of how well the model is performing) can provide valuable insights. Specifically, the large eigenvalues of the Hessian matrix (a measure of the curvature of the loss function) have been found to exhibit some interesting patterns.

In the full batch setting, where the entire dataset is used for each training update, the large eigenvalues often start by increasing ("progressive sharpening") and then stabilize at a predictable value known as the "edge of stability". This edge of stability is an important concept, as it can help determine the optimal training hyperparameters.

However, in the more common stochastic setting, where only a small subset of the data is used for each update, the eigenvalues increase more slowly. This "conservative sharpening" phenomenon is explained in the paper using a simple high-dimensional model.

The paper also reveals an alternative "stochastic edge of stability" that arises when using small batch sizes. This edge is more sensitive to the Neural Tangent Kernel (a measure of the model's sensitivity to changes in the input) than the large Hessian eigenvalues.

Through experiments, the researchers show that the stochastic edge of stability differs qualitatively from the full batch case, and that controlling this stochastic edge can potentially help improve the optimization of machine learning models.

Technical Explanation

The paper provides a theoretical analysis of a simple high-dimensional model that explains the origin of the "conservative sharpening" phenomenon observed in the stochastic setting, where the large eigenvalues of the training loss Hessian increase more slowly compared to the full batch regime.

The researchers show that there is an alternative "stochastic edge of stability" that arises at small batch sizes, which is sensitive to the trace of the Neural Tangent Kernel rather than the large Hessian eigenvalues. This is in contrast to the full batch case, where the edge of stability is determined by the large Hessian eigenvalues.

The experimental study conducted in the paper highlights the qualitative differences between the stochastic and full batch settings, and suggests that controlling the stochastic edge of stability can help improve optimization. This builds on previous work on sharpness-aware minimization, noise stability and optimization of flat minima, and high-probability convergence bounds for nonlinear stochastic gradient descent.

Critical Analysis

The paper provides a comprehensive theoretical and experimental analysis of the dynamics of the large eigenvalues of the training loss Hessian in both the full batch and stochastic settings. The insights it offers, such as the existence of a stochastic edge of stability that is sensitive to the Neural Tangent Kernel, are valuable for understanding optimization dynamics in deep learning.

However, the paper's analysis is limited to a simple high-dimensional model, and it remains to be seen how well the theoretical predictions hold in more complex, realistic deep learning settings. Additionally, the paper does not explore the potential implications of the stochastic edge of stability for practical model training, such as how it might inform the selection of batch size or other hyperparameters.

Further research could investigate the relationship between the stochastic edge of stability and other concepts like flat minima and the singular limit of gradient descent with noise injection. Exploring these connections could lead to a more holistic understanding of optimization dynamics in deep learning.

Conclusion

This paper provides a valuable theoretical and experimental analysis of the dynamics of the large eigenvalues of the training loss Hessian in both the full batch and stochastic settings. It reveals the existence of a "conservative sharpening" phenomenon in the stochastic case, as well as an alternative "stochastic edge of stability" that is sensitive to the Neural Tangent Kernel rather than the large Hessian eigenvalues.

The insights offered in this paper can help researchers and practitioners gain a deeper understanding of optimization dynamics in deep learning, potentially leading to improved training methods and hyperparameter selection strategies. While the analysis is limited to a simple model, the findings serve as an important step towards a more comprehensive theory of optimization in complex machine learning systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Sharpness-Aware Minimization and the Edge of Stability

Philip M. Long, Peter L. Bartlett

Recent experiments have shown that, often, when training a neural network with gradient descent (GD) with a step size $eta$, the operator norm of the Hessian of the loss grows until it approximately reaches $2/eta$, after which it fluctuates around this value. The quantity $2/eta$ has been called the edge of stability based on consideration of a local quadratic approximation of the loss. We perform a similar calculation to arrive at an edge of stability for Sharpness-Aware Minimization (SAM), a variant of GD which has been shown to improve its generalization. Unlike the case for GD, the resulting SAM-edge depends on the norm of the gradient. Using three deep learning training tasks, we see empirically that SAM operates on the edge of stability identified by this analysis.

4/10/2024

cs.LG cs.NE stat.ML

Training on the Edge of Stability Is Caused by Layerwise Jacobian Alignment

Mark Lowell, Catharine Kastner

During neural network training, the sharpness of the Hessian matrix of the training loss rises until training is on the edge of stability. As a result, even nonstochastic gradient descent does not accurately model the underlying dynamical system defined by the gradient flow of the training loss. We use an exponential Euler solver to train the network without entering the edge of stability, so that we accurately approximate the true gradient descent dynamics. We demonstrate experimentally that the increase in the sharpness of the Hessian matrix is caused by the layerwise Jacobian matrices of the network becoming aligned, so that a small change in the network preactivations near the inputs of the network can cause a large change in the outputs of the network. We further demonstrate that the degree of alignment scales with the size of the dataset by a power law with a coefficient of determination between 0.74 and 0.98.

6/4/2024

stat.ML cs.LG

🎯

Adaptive Gradient Methods at the Edge of Stability

Jeremy M. Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E. Dahl, Justin Gilmer

Very little is known about the training dynamics of adaptive gradient methods like Adam in deep learning. In this paper, we shed light on the behavior of these algorithms in the full-batch and sufficiently large batch settings. Specifically, we empirically demonstrate that during full-batch training, the maximum eigenvalue of the preconditioned Hessian typically equilibrates at a certain numerical value -- the stability threshold of a gradient descent algorithm. For Adam with step size $eta$ and $beta_1 = 0.9$, this stability threshold is $38/eta$. Similar effects occur during minibatch training, especially as the batch size grows. Yet, even though adaptive methods train at the ``Adaptive Edge of Stability'' (AEoS), their behavior in this regime differs in a significant way from that of non-adaptive methods at the EoS. Whereas non-adaptive algorithms at the EoS are blocked from entering high-curvature regions of the loss landscape, adaptive gradient methods at the AEoS can keep advancing into high-curvature regions, while adapting the preconditioner to compensate. Our findings can serve as a foundation for the community's future understanding of adaptive gradient methods in deep learning.

4/17/2024

cs.LG

🏋️

A PDE-based Explanation of Extreme Numerical Sensitivities and Edge of Stability in Training Neural Networks

Yuxin Sun, Dong Lao, Ganesh Sundaramoorthi, Anthony Yezzi

We discover restrained numerical instabilities in current training practices of deep networks with stochastic gradient descent (SGD), and its variants. We show numerical error (on the order of the smallest floating point bit and thus the most extreme or limiting numerical perturbations induced from floating point arithmetic in training deep nets can be amplified significantly and result in significant test accuracy variance (sensitivities), comparable to the test accuracy variance due to stochasticity in SGD. We show how this is likely traced to instabilities of the optimization dynamics that are restrained, i.e., localized over iterations and regions of the weight tensor space. We do this by presenting a theoretical framework using numerical analysis of partial differential equations (PDE), and analyzing the gradient descent PDE of convolutional neural networks (CNNs). We show that it is stable only under certain conditions on the learning rate and weight decay. We show that rather than blowing up when the conditions are violated, the instability can be restrained. We show this is a consequence of the non-linear PDE associated with the gradient descent of the CNN, whose local linearization changes when over-driving the step size of the discretization, resulting in a stabilizing effect. We link restrained instabilities to the recently discovered Edge of Stability (EoS) phenomena, in which the stable step size predicted by classical theory is exceeded while continuing to optimize the loss and still converging. Because restrained instabilities occur at the EoS, our theory provides new insights and predictions about the EoS, in particular, the role of regularization and the dependence on the network complexity.

6/13/2024

cs.LG cs.NA