Adaptive Gradient Methods at the Edge of Stability

2207.14484

Published 4/17/2024 by Jeremy M. Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E. Dahl and 1 other

cs.LG

🎯

Abstract

Very little is known about the training dynamics of adaptive gradient methods like Adam in deep learning. In this paper, we shed light on the behavior of these algorithms in the full-batch and sufficiently large batch settings. Specifically, we empirically demonstrate that during full-batch training, the maximum eigenvalue of the preconditioned Hessian typically equilibrates at a certain numerical value -- the stability threshold of a gradient descent algorithm. For Adam with step size $eta$ and $beta_1 = 0.9$, this stability threshold is $38/eta$. Similar effects occur during minibatch training, especially as the batch size grows. Yet, even though adaptive methods train at the ``Adaptive Edge of Stability'' (AEoS), their behavior in this regime differs in a significant way from that of non-adaptive methods at the EoS. Whereas non-adaptive algorithms at the EoS are blocked from entering high-curvature regions of the loss landscape, adaptive gradient methods at the AEoS can keep advancing into high-curvature regions, while adapting the preconditioner to compensate. Our findings can serve as a foundation for the community's future understanding of adaptive gradient methods in deep learning.

Create account to get full access

Overview

This paper sheds light on the behavior of adaptive gradient methods like Adam during deep learning training.
The authors found that the maximum eigenvalue of the preconditioned Hessian typically equilibrates at a certain numerical value - the "stability threshold" - during full-batch training.
Similar effects were observed in mini-batch training, especially as batch size increases.
Adaptive methods operate at the "Adaptive Edge of Stability" (AEoS), which differs from the "Edge of Stability" (EoS) of non-adaptive methods.
Whereas non-adaptive algorithms are blocked from entering high-curvature regions at the EoS, adaptive gradient methods can keep advancing into these regions while adapting the preconditioner to compensate.

Plain English Explanation

The paper examines how adaptive gradient methods, like the popular Adam algorithm, behave during deep learning training. These methods automatically adjust the learning rate for each parameter, which can help speed up training.

The researchers found that during full-batch training (where the entire dataset is used for each update), the adaptive methods tend to reach a specific numerical value for the "maximum eigenvalue" of the Hessian matrix. This Hessian matrix describes the curvature of the loss function being optimized. The authors call this value the "stability threshold" - it's the point at which the optimization algorithm becomes stable.

For the Adam algorithm with a certain set of parameters, this stability threshold is about 38 divided by the learning rate. Similar effects were seen in mini-batch training (where only a subset of the data is used for each update), especially as the batch size increased.

The key insight is that adaptive methods operate at the "Adaptive Edge of Stability" (AEoS), which is different from the "Edge of Stability" (EoS) that non-adaptive methods like gradient descent reach. At the EoS, non-adaptive algorithms are prevented from entering regions of high curvature in the loss landscape. In contrast, adaptive methods at the AEoS can keep advancing into these high-curvature regions, while adjusting the preconditioner (a way of scaling the gradients) to compensate.

The authors believe their findings can help the research community better understand how adaptive gradient methods behave during deep learning training, which could lead to improvements in training stability and generalization.

Technical Explanation

The paper empirically investigates the training dynamics of adaptive gradient methods, such as Adam, in the full-batch and sufficiently large batch settings. The authors observe that during full-batch training, the maximum eigenvalue of the preconditioned Hessian typically equilibrates at a certain numerical value, which they refer to as the "stability threshold" of a gradient descent algorithm.

For Adam with a step size of $\eta$ and $\beta_1 = 0.9$ , this stability threshold is found to be $38/\eta$ . Similar effects occur during mini-batch training, especially as the batch size grows.

The key difference between adaptive methods and non-adaptive methods is that adaptive methods operate at the "Adaptive Edge of Stability" (AEoS), whereas non-adaptive algorithms like gradient descent reach the "Edge of Stability" (EoS). At the EoS, non-adaptive algorithms are blocked from entering high-curvature regions of the loss landscape. In contrast, adaptive gradient methods at the AEoS can keep advancing into high-curvature regions, while adapting the preconditioner to compensate.

The authors designed experiments to study these phenomena and provide empirical evidence to support their claims. The findings in this paper can serve as a foundation for the research community's future understanding of adaptive gradient methods in deep learning.

Critical Analysis

The paper provides valuable insights into the training dynamics of adaptive gradient methods, like Adam, which are widely used in deep learning. The authors' observations about the stability threshold and the "Adaptive Edge of Stability" offer a new perspective on how these algorithms behave during optimization.

One potential limitation of the study is that it focuses on the full-batch and sufficiently large batch settings, which may not fully capture the behavior of adaptive methods in the more common mini-batch training regime. The authors acknowledge this and suggest that similar effects are observed in mini-batch training, but further investigation may be needed to fully understand the implications.

Additionally, the paper does not delve into the potential implications of adaptive methods' ability to advance into high-curvature regions of the loss landscape. While this behavior may be beneficial in some cases, it could also lead to potential issues, such as poor generalization or adversarial vulnerability. Further research is needed to explore these potential tradeoffs and develop a more comprehensive understanding of the advantages and drawbacks of adaptive gradient methods.

Overall, this paper provides a valuable contribution to the understanding of adaptive gradient methods in deep learning, and the findings can serve as a foundation for future research in this area.

Conclusion

This paper sheds light on the training dynamics of adaptive gradient methods, such as Adam, in the full-batch and sufficiently large batch settings. The authors' key findings include the observation of a "stability threshold" for the maximum eigenvalue of the preconditioned Hessian, and the discovery that adaptive methods operate at the "Adaptive Edge of Stability" (AEoS), which differs from the "Edge of Stability" (EoS) reached by non-adaptive methods.

These insights can help the research community better understand the behavior of adaptive gradient methods, which are widely used in deep learning. The findings could lead to improvements in training stability, generalization, and the development of more advanced optimization algorithms. Further research is needed to fully explore the implications of adaptive methods' ability to advance into high-curvature regions of the loss landscape and the potential tradeoffs involved.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤯

High dimensional analysis reveals conservative sharpening and a stochastic edge of stability

Atish Agarwala, Jeffrey Pennington

Recent empirical and theoretical work has shown that the dynamics of the large eigenvalues of the training loss Hessian have some remarkably robust features across models and datasets in the full batch regime. There is often an early period of progressive sharpening where the large eigenvalues increase, followed by stabilization at a predictable value known as the edge of stability. Previous work showed that in the stochastic setting, the eigenvalues increase more slowly - a phenomenon we call conservative sharpening. We provide a theoretical analysis of a simple high-dimensional model which shows the origin of this slowdown. We also show that there is an alternative stochastic edge of stability which arises at small batch size that is sensitive to the trace of the Neural Tangent Kernel rather than the large Hessian eigenvalues. We conduct an experimental study which highlights the qualitative differences from the full batch phenomenology, and suggests that controlling the stochastic edge of stability can help optimization.

5/1/2024

cs.LG

Sharpness-Aware Minimization and the Edge of Stability

Philip M. Long, Peter L. Bartlett

Recent experiments have shown that, often, when training a neural network with gradient descent (GD) with a step size $eta$, the operator norm of the Hessian of the loss grows until it approximately reaches $2/eta$, after which it fluctuates around this value. The quantity $2/eta$ has been called the edge of stability based on consideration of a local quadratic approximation of the loss. We perform a similar calculation to arrive at an edge of stability for Sharpness-Aware Minimization (SAM), a variant of GD which has been shown to improve its generalization. Unlike the case for GD, the resulting SAM-edge depends on the norm of the gradient. Using three deep learning training tasks, we see empirically that SAM operates on the edge of stability identified by this analysis.

4/10/2024

cs.LG cs.NE stat.ML

Training on the Edge of Stability Is Caused by Layerwise Jacobian Alignment

Mark Lowell, Catharine Kastner

During neural network training, the sharpness of the Hessian matrix of the training loss rises until training is on the edge of stability. As a result, even nonstochastic gradient descent does not accurately model the underlying dynamical system defined by the gradient flow of the training loss. We use an exponential Euler solver to train the network without entering the edge of stability, so that we accurately approximate the true gradient descent dynamics. We demonstrate experimentally that the increase in the sharpness of the Hessian matrix is caused by the layerwise Jacobian matrices of the network becoming aligned, so that a small change in the network preactivations near the inputs of the network can cause a large change in the outputs of the network. We further demonstrate that the degree of alignment scales with the size of the dataset by a power law with a coefficient of determination between 0.74 and 0.98.

6/4/2024

stat.ML cs.LG

🏋️

A PDE-based Explanation of Extreme Numerical Sensitivities and Edge of Stability in Training Neural Networks

Yuxin Sun, Dong Lao, Ganesh Sundaramoorthi, Anthony Yezzi

We discover restrained numerical instabilities in current training practices of deep networks with stochastic gradient descent (SGD), and its variants. We show numerical error (on the order of the smallest floating point bit and thus the most extreme or limiting numerical perturbations induced from floating point arithmetic in training deep nets can be amplified significantly and result in significant test accuracy variance (sensitivities), comparable to the test accuracy variance due to stochasticity in SGD. We show how this is likely traced to instabilities of the optimization dynamics that are restrained, i.e., localized over iterations and regions of the weight tensor space. We do this by presenting a theoretical framework using numerical analysis of partial differential equations (PDE), and analyzing the gradient descent PDE of convolutional neural networks (CNNs). We show that it is stable only under certain conditions on the learning rate and weight decay. We show that rather than blowing up when the conditions are violated, the instability can be restrained. We show this is a consequence of the non-linear PDE associated with the gradient descent of the CNN, whose local linearization changes when over-driving the step size of the discretization, resulting in a stabilizing effect. We link restrained instabilities to the recently discovered Edge of Stability (EoS) phenomena, in which the stable step size predicted by classical theory is exceeded while continuing to optimize the loss and still converging. Because restrained instabilities occur at the EoS, our theory provides new insights and predictions about the EoS, in particular, the role of regularization and the dependence on the network complexity.

6/13/2024

cs.LG cs.NA