Sharpness-Aware Minimization and the Edge of Stability

2309.12488

Published 4/10/2024 by Philip M. Long, Peter L. Bartlett

Sharpness-Aware Minimization and the Edge of Stability

Abstract

Recent experiments have shown that, often, when training a neural network with gradient descent (GD) with a step size $eta$, the operator norm of the Hessian of the loss grows until it approximately reaches $2/eta$, after which it fluctuates around this value. The quantity $2/eta$ has been called the edge of stability based on consideration of a local quadratic approximation of the loss. We perform a similar calculation to arrive at an edge of stability for Sharpness-Aware Minimization (SAM), a variant of GD which has been shown to improve its generalization. Unlike the case for GD, the resulting SAM-edge depends on the norm of the gradient. Using three deep learning training tasks, we see empirically that SAM operates on the edge of stability identified by this analysis.

Create account to get full access

Overview

This paper explores a new optimization technique called "Sharpness-Aware Minimization" (SAM) and its implications for machine learning model stability.
The authors derive the SAM algorithm and demonstrate its ability to find flatter, more stable minima in the loss landscape compared to standard gradient descent.
Experiments show that SAM can improve the robustness and generalization of machine learning models, especially in the face of distribution shift or adversarial attacks.
The paper provides insights into the "edge of stability" for neural networks, exploring the delicate balance between model expressivity and stability.

Plain English Explanation

The paper examines a new optimization technique called "Sharpness-Aware Minimization" (SAM) and how it can help make machine learning models more stable and robust. Gradient descent is a common algorithm used to train machine learning models, but it can sometimes find solutions that are too "sharp" or sensitive to small changes in the input. This can make the models less stable and prone to errors, especially when faced with new or adversarial data.

The authors of this paper derived the SAM algorithm, which tries to find "flatter" minima in the loss landscape during training. These flatter minima correspond to more stable and robust models that are less sensitive to small perturbations. Experiments showed that models trained with SAM were able to generalize better and were more resilient to distribution shift or adversarial attacks compared to models trained with standard gradient descent.

The paper also explores the idea of the "edge of stability" for neural networks. This refers to the delicate balance between a model's expressivity (its ability to capture complex patterns in the data) and its stability (its ability to generalize well and be robust to changes). The authors suggest that SAM can help push models closer to this edge, allowing them to be more expressive without becoming unstable.

Technical Explanation

The key contribution of this paper is the derivation and analysis of the "Sharpness-Aware Minimization" (SAM) algorithm for training machine learning models. SAM is an optimization technique that aims to find flatter, more stable minima in the loss landscape compared to standard gradient descent.

The authors start by formulating the SAM optimization problem, which involves minimizing the maximum loss within a small neighborhood around each training example. This encourages the model to find solutions that are robust to small perturbations in the input, rather than solutions that are overly sensitive to specific data points.

The paper then provides a detailed derivation of the SAM algorithm, which involves alternating between a "inner maximization" step to find the worst-case loss within the neighborhood, and an "outer minimization" step to update the model parameters. The authors show that SAM can be efficiently implemented using only a few additional gradient computations compared to standard gradient descent.

Experiments on various machine learning tasks, including image classification and language modeling, demonstrate the benefits of SAM. Models trained with SAM exhibit better generalization performance, improved robustness to distribution shift, and increased resilience to adversarial attacks compared to models trained with standard gradient descent.

The authors also provide an in-depth analysis of the "edge of stability" for neural networks, exploring the tradeoff between model expressivity and stability. They show that SAM can help push models closer to this edge, allowing them to capture more complex patterns in the data without becoming overly sensitive or unstable.

Critical Analysis

The paper presents a compelling optimization technique in SAM and provides strong experimental evidence of its benefits for improving model robustness and generalization. However, there are a few potential limitations and areas for further research:

Computational Overhead: The SAM algorithm requires additional gradient computations compared to standard gradient descent, which could increase the training time for large-scale models. The authors discuss ways to mitigate this, but the practical implications for real-world deployments should be further investigated.
Hyperparameter Sensitivity: The performance of SAM appears to be sensitive to the choice of hyperparameters, such as the size of the neighborhood around each training example. Developing more robust or automated hyperparameter tuning methods could enhance the usability of SAM.
Theoretical Understanding: While the paper provides a detailed derivation of the SAM algorithm, the underlying theoretical guarantees and the precise relationship between sharpness, stability, and generalization could be further explored. Strengthening the theoretical foundations could lead to even more principled and effective variants of SAM.
Applicability to Other Domains: The experiments in the paper focus primarily on computer vision and natural language processing tasks. Investigating the performance of SAM on a wider range of machine learning problems, such as graph-based tasks or reinforcement learning, could broaden the impact of this work.

Despite these potential areas for improvement, the SAM algorithm represents a significant contribution to the field of machine learning optimization. By explicitly considering the sharpness of the loss landscape, the authors have developed a technique that can lead to more stable and robust models, which is an important step towards building reliable and trustworthy AI systems.

Conclusion

This paper introduces "Sharpness-Aware Minimization" (SAM), a novel optimization technique for training machine learning models. The key insight of SAM is to minimize the maximum loss within a small neighborhood around each training example, rather than just the average loss. This encourages the model to find flatter, more stable minima in the loss landscape, leading to improved generalization and robustness.

Experimental results demonstrate the benefits of SAM across a range of tasks, including image classification, language modeling, and adversarial robustness. The paper also provides insights into the "edge of stability" for neural networks, suggesting that SAM can help push models closer to this delicate balance between expressivity and stability.

While the paper presents a compelling optimization method, there are still some areas for further exploration, such as reducing the computational overhead, improving hyperparameter tuning, and investigating the broader applicability of SAM. Nevertheless, the work represents an important contribution to the field of machine learning, with the potential to enable more reliable and trustworthy AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤯

High dimensional analysis reveals conservative sharpening and a stochastic edge of stability

Atish Agarwala, Jeffrey Pennington

Recent empirical and theoretical work has shown that the dynamics of the large eigenvalues of the training loss Hessian have some remarkably robust features across models and datasets in the full batch regime. There is often an early period of progressive sharpening where the large eigenvalues increase, followed by stabilization at a predictable value known as the edge of stability. Previous work showed that in the stochastic setting, the eigenvalues increase more slowly - a phenomenon we call conservative sharpening. We provide a theoretical analysis of a simple high-dimensional model which shows the origin of this slowdown. We also show that there is an alternative stochastic edge of stability which arises at small batch size that is sensitive to the trace of the Neural Tangent Kernel rather than the large Hessian eigenvalues. We conduct an experimental study which highlights the qualitative differences from the full batch phenomenology, and suggests that controlling the stochastic edge of stability can help optimization.

5/1/2024

cs.LG

Sharpness-Aware Minimization Enhances Feature Quality via Balanced Learning

Jacob Mitchell Springer, Vaishnavh Nagarajan, Aditi Raghunathan

Sharpness-Aware Minimization (SAM) has emerged as a promising alternative optimizer to stochastic gradient descent (SGD). The originally-proposed motivation behind SAM was to bias neural networks towards flatter minima that are believed to generalize better. However, recent studies have shown conflicting evidence on the relationship between flatness and generalization, suggesting that flatness does fully explain SAM's success. Sidestepping this debate, we identify an orthogonal effect of SAM that is beneficial out-of-distribution: we argue that SAM implicitly balances the quality of diverse features. SAM achieves this effect by adaptively suppressing well-learned features which gives remaining features opportunity to be learned. We show that this mechanism is beneficial in datasets that contain redundant or spurious features where SGD falls for the simplicity bias and would not otherwise learn all available features. Our insights are supported by experiments on real data: we demonstrate that SAM improves the quality of features in datasets containing redundant or spurious features, including CelebA, Waterbirds, CIFAR-MNIST, and DomainBed.

6/3/2024

cs.LG

Efficient Sharpness-Aware Minimization for Molecular Graph Transformer Models

Yili Wang, Kaixiong Zhou, Ninghao Liu, Ying Wang, Xin Wang

Sharpness-aware minimization (SAM) has received increasing attention in computer vision since it can effectively eliminate the sharp local minima from the training trajectory and mitigate generalization degradation. However, SAM requires two sequential gradient computations during the optimization of each step: one to obtain the perturbation gradient and the other to obtain the updating gradient. Compared with the base optimizer (e.g., Adam), SAM doubles the time overhead due to the additional perturbation gradient. By dissecting the theory of SAM and observing the training gradient of the molecular graph transformer, we propose a new algorithm named GraphSAM, which reduces the training cost of SAM and improves the generalization performance of graph transformer models. There are two key factors that contribute to this result: (i) textit{gradient approximation}: we use the updating gradient of the previous step to approximate the perturbation gradient at the intermediate steps smoothly (textbf{increases efficiency}); (ii) textit{loss landscape approximation}: we theoretically prove that the loss landscape of GraphSAM is limited to a small range centered on the expected loss of SAM (textbf{guarantees generalization performance}). The extensive experiments on six datasets with different tasks demonstrate the superiority of GraphSAM, especially in optimizing the model update process. The code is in:https://github.com/YL-wang/GraphSAM/tree/graphsam

6/21/2024

cs.LG

A Universal Class of Sharpness-Aware Minimization Algorithms

Behrooz Tahmasebi, Ashkan Soleymani, Dara Bahri, Stefanie Jegelka, Patrick Jaillet

Recently, there has been a surge in interest in developing optimization algorithms for overparameterized models as achieving generalization is believed to require algorithms with suitable biases. This interest centers on minimizing sharpness of the original loss function; the Sharpness-Aware Minimization (SAM) algorithm has proven effective. However, most literature only considers a few sharpness measures, such as the maximum eigenvalue or trace of the training loss Hessian, which may not yield meaningful insights for non-convex optimization scenarios like neural networks. Additionally, many sharpness measures are sensitive to parameter invariances in neural networks, magnifying significantly under rescaling parameters. Motivated by these challenges, we introduce a new class of sharpness measures in this paper, leading to new sharpness-aware objective functions. We prove that these measures are textit{universally expressive}, allowing any function of the training loss Hessian matrix to be represented by appropriate hyperparameters. Furthermore, we show that the proposed objective functions explicitly bias towards minimizing their corresponding sharpness measures, and how they allow meaningful applications to models with parameter invariances (such as scale-invariances). Finally, as instances of our proposed general framework, we present textit{Frob-SAM} and textit{Det-SAM}, which are specifically designed to minimize the Frobenius norm and the determinant of the Hessian of the training loss, respectively. We also demonstrate the advantages of our general framework through extensive experiments.

6/11/2024

cs.LG