A PDE-based Explanation of Extreme Numerical Sensitivities and Edge of Stability in Training Neural Networks

2206.02001

Published 6/13/2024 by Yuxin Sun, Dong Lao, Ganesh Sundaramoorthi, Anthony Yezzi

🏋️

Abstract

We discover restrained numerical instabilities in current training practices of deep networks with stochastic gradient descent (SGD), and its variants. We show numerical error (on the order of the smallest floating point bit and thus the most extreme or limiting numerical perturbations induced from floating point arithmetic in training deep nets can be amplified significantly and result in significant test accuracy variance (sensitivities), comparable to the test accuracy variance due to stochasticity in SGD. We show how this is likely traced to instabilities of the optimization dynamics that are restrained, i.e., localized over iterations and regions of the weight tensor space. We do this by presenting a theoretical framework using numerical analysis of partial differential equations (PDE), and analyzing the gradient descent PDE of convolutional neural networks (CNNs). We show that it is stable only under certain conditions on the learning rate and weight decay. We show that rather than blowing up when the conditions are violated, the instability can be restrained. We show this is a consequence of the non-linear PDE associated with the gradient descent of the CNN, whose local linearization changes when over-driving the step size of the discretization, resulting in a stabilizing effect. We link restrained instabilities to the recently discovered Edge of Stability (EoS) phenomena, in which the stable step size predicted by classical theory is exceeded while continuing to optimize the loss and still converging. Because restrained instabilities occur at the EoS, our theory provides new insights and predictions about the EoS, in particular, the role of regularization and the dependence on the network complexity.

Create account to get full access

Overview

The paper investigates numerical instabilities in the training of deep neural networks using stochastic gradient descent (SGD) and its variants.
It shows that numerical errors from floating-point arithmetic can be significantly amplified during training, leading to large variations in test accuracy.
The paper proposes a theoretical framework using partial differential equations (PDEs) to analyze the gradient descent dynamics of convolutional neural networks (CNNs).
It demonstrates that the CNN gradient descent PDE is only stable under certain conditions on the learning rate and weight decay, and that when these conditions are violated, the instability can be "restrained" rather than blowing up.
The paper links these "restrained instabilities" to the recently discovered "Edge of Stability" (EoS) phenomenon, where the stable step size predicted by classical theory is exceeded while the network continues to optimize and converge.

Plain English Explanation

When training deep neural networks using stochastic gradient descent (SGD) and similar optimization methods, there can be numerical instabilities that lead to significant variations in the final test accuracy. This is due to the way computers handle small, precise numbers (known as floating-point arithmetic) during the training process.

The researchers behind this paper have developed a theoretical framework to better understand these instabilities. They used a mathematical tool called partial differential equations (PDEs) to analyze the gradient descent dynamics of convolutional neural networks (CNNs). This analysis showed that the CNN gradient descent PDE is only stable under certain conditions on the learning rate (the step size taken during optimization) and weight decay (a technique to prevent overfitting).

Importantly, the researchers found that when these stability conditions are violated, the instability doesn't necessarily cause the training to "blow up" and fail completely. Instead, the instability is "restrained" or localized, which can still allow the network to continue optimizing and converging to a solution. This restrained instability phenomenon is linked to the "Edge of Stability" (EoS), where the network can still learn effectively even when the step size exceeds the classical stability limit.

The paper's insights into these restrained instabilities provide new understanding of the EoS and the role of regularization (such as weight decay) in deep network optimization. This knowledge could lead to better methods for designing and training stable neural networks that are more robust to numerical issues.

Technical Explanation

The researchers use a theoretical framework based on the analysis of partial differential equations (PDEs) to study the gradient descent dynamics of convolutional neural networks (CNNs). They show that the CNN gradient descent PDE is only stable under certain conditions on the learning rate and weight decay.

Specifically, they demonstrate that when these stability conditions are violated, the instability does not necessarily cause the training to diverge or "blow up." Instead, the instability is "restrained" or localized, meaning it is confined to certain regions of the weight tensor space and iterations of the optimization process.

The paper links these restrained instabilities to the recently discovered "Edge of Stability" (EoS) phenomenon, where the stable step size predicted by classical optimization theory is exceeded while the network continues to optimize and converge. The researchers provide new insights into the EoS, including the role of regularization and the dependence on network complexity.

The authors suggest that these restrained instabilities are a consequence of the non-linear PDE associated with the gradient descent of the CNN, where the local linearization changes when the step size of the discretization is over-driven, resulting in a stabilizing effect.

The theoretical framework presented in this paper could inform the development of more robust and stable neural network training methods, particularly in the context of the EoS and the role of regularization.

Critical Analysis

The paper provides a novel and insightful theoretical analysis of numerical instabilities in deep neural network training, which is an important and understudied issue in the field. The researchers' use of PDE analysis to study the gradient descent dynamics of CNNs is a unique and potentially valuable approach.

One limitation of the work is that it is primarily focused on the theoretical analysis and does not include extensive experimental validation of the proposed framework. While the theoretical insights are compelling, it would be helpful to see how well the predictions from the PDE analysis align with empirical observations of training dynamics in practice.

Additionally, the paper does not delve into the practical implications of the restrained instability phenomenon or provide clear guidance on how to leverage this understanding to improve neural network training. [Further research could explore more concrete design principles or optimization techniques that take advantage of the insights from this work.

Overall, this paper makes a valuable contribution to the understanding of numerical issues in deep learning and opens up new avenues for research into more stable and robust training methods. However, additional empirical validation and translation of the theoretical findings into practical solutions would strengthen the impact of this work.

Conclusion

This paper presents a novel theoretical framework based on partial differential equation (PDE) analysis to study the numerical instabilities that can arise in the training of deep neural networks using stochastic gradient descent (SGD) and its variants.

The key insights from this work are:

Numerical errors from floating-point arithmetic can be significantly amplified during training, leading to large variations in test accuracy.
The gradient descent PDE of convolutional neural networks (CNNs) is only stable under certain conditions on the learning rate and weight decay.
When these stability conditions are violated, the instability can be "restrained" rather than causing the training to diverge completely.
These restrained instabilities are linked to the "Edge of Stability" (EoS) phenomenon, where the network can continue to optimize and converge even when exceeding the classical stability limit.

The theoretical framework developed in this paper provides new understanding of the EoS and the role of regularization in deep network optimization. This knowledge could inform the development of more robust and stable neural network training methods, potentially leading to more reliable and consistent performance in real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤯

High dimensional analysis reveals conservative sharpening and a stochastic edge of stability

Atish Agarwala, Jeffrey Pennington

Recent empirical and theoretical work has shown that the dynamics of the large eigenvalues of the training loss Hessian have some remarkably robust features across models and datasets in the full batch regime. There is often an early period of progressive sharpening where the large eigenvalues increase, followed by stabilization at a predictable value known as the edge of stability. Previous work showed that in the stochastic setting, the eigenvalues increase more slowly - a phenomenon we call conservative sharpening. We provide a theoretical analysis of a simple high-dimensional model which shows the origin of this slowdown. We also show that there is an alternative stochastic edge of stability which arises at small batch size that is sensitive to the trace of the Neural Tangent Kernel rather than the large Hessian eigenvalues. We conduct an experimental study which highlights the qualitative differences from the full batch phenomenology, and suggests that controlling the stochastic edge of stability can help optimization.

5/1/2024

cs.LG

Training on the Edge of Stability Is Caused by Layerwise Jacobian Alignment

Mark Lowell, Catharine Kastner

During neural network training, the sharpness of the Hessian matrix of the training loss rises until training is on the edge of stability. As a result, even nonstochastic gradient descent does not accurately model the underlying dynamical system defined by the gradient flow of the training loss. We use an exponential Euler solver to train the network without entering the edge of stability, so that we accurately approximate the true gradient descent dynamics. We demonstrate experimentally that the increase in the sharpness of the Hessian matrix is caused by the layerwise Jacobian matrices of the network becoming aligned, so that a small change in the network preactivations near the inputs of the network can cause a large change in the outputs of the network. We further demonstrate that the degree of alignment scales with the size of the dataset by a power law with a coefficient of determination between 0.74 and 0.98.

6/4/2024

stat.ML cs.LG

Stabilizing Policy Gradients for Stochastic Differential Equations via Consistency with Perturbation Process

Xiangxin Zhou, Liang Wang, Yichi Zhou

Considering generating samples with high rewards, we focus on optimizing deep neural networks parameterized stochastic differential equations (SDEs), the advanced generative models with high expressiveness, with policy gradient, the leading algorithm in reinforcement learning. Nevertheless, when applying policy gradients to SDEs, since the policy gradient is estimated on a finite set of trajectories, it can be ill-defined, and the policy behavior in data-scarce regions may be uncontrolled. This challenge compromises the stability of policy gradients and negatively impacts sample complexity. To address these issues, we propose constraining the SDE to be consistent with its associated perturbation process. Since the perturbation process covers the entire space and is easy to sample, we can mitigate the aforementioned problems. Our framework offers a general approach allowing for a versatile selection of policy gradient methods to effectively and efficiently train SDEs. We evaluate our algorithm on the task of structure-based drug design and optimize the binding affinity of generated ligand molecules. Our method achieves the best Vina score -9.07 on the CrossDocked2020 dataset.

6/27/2024

cs.LG

🎯

Adaptive Gradient Methods at the Edge of Stability

Jeremy M. Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E. Dahl, Justin Gilmer

Very little is known about the training dynamics of adaptive gradient methods like Adam in deep learning. In this paper, we shed light on the behavior of these algorithms in the full-batch and sufficiently large batch settings. Specifically, we empirically demonstrate that during full-batch training, the maximum eigenvalue of the preconditioned Hessian typically equilibrates at a certain numerical value -- the stability threshold of a gradient descent algorithm. For Adam with step size $eta$ and $beta_1 = 0.9$, this stability threshold is $38/eta$. Similar effects occur during minibatch training, especially as the batch size grows. Yet, even though adaptive methods train at the ``Adaptive Edge of Stability'' (AEoS), their behavior in this regime differs in a significant way from that of non-adaptive methods at the EoS. Whereas non-adaptive algorithms at the EoS are blocked from entering high-curvature regions of the loss landscape, adaptive gradient methods at the AEoS can keep advancing into high-curvature regions, while adapting the preconditioner to compensate. Our findings can serve as a foundation for the community's future understanding of adaptive gradient methods in deep learning.

4/17/2024

cs.LG