A Universal Class of Sharpness-Aware Minimization Algorithms

2406.03682

Published 6/11/2024 by Behrooz Tahmasebi, Ashkan Soleymani, Dara Bahri, Stefanie Jegelka, Patrick Jaillet

A Universal Class of Sharpness-Aware Minimization Algorithms

Abstract

Recently, there has been a surge in interest in developing optimization algorithms for overparameterized models as achieving generalization is believed to require algorithms with suitable biases. This interest centers on minimizing sharpness of the original loss function; the Sharpness-Aware Minimization (SAM) algorithm has proven effective. However, most literature only considers a few sharpness measures, such as the maximum eigenvalue or trace of the training loss Hessian, which may not yield meaningful insights for non-convex optimization scenarios like neural networks. Additionally, many sharpness measures are sensitive to parameter invariances in neural networks, magnifying significantly under rescaling parameters. Motivated by these challenges, we introduce a new class of sharpness measures in this paper, leading to new sharpness-aware objective functions. We prove that these measures are textit{universally expressive}, allowing any function of the training loss Hessian matrix to be represented by appropriate hyperparameters. Furthermore, we show that the proposed objective functions explicitly bias towards minimizing their corresponding sharpness measures, and how they allow meaningful applications to models with parameter invariances (such as scale-invariances). Finally, as instances of our proposed general framework, we present textit{Frob-SAM} and textit{Det-SAM}, which are specifically designed to minimize the Frobenius norm and the determinant of the Hessian of the training loss, respectively. We also demonstrate the advantages of our general framework through extensive experiments.

Create account to get full access

Overview

This paper introduces a novel class of optimization algorithms called Sharpness-Aware Minimization (SAM), which aims to find "flat" minima in the loss landscape of machine learning models.
Flat minima are thought to generalize better than sharp minima, as they are less sensitive to small perturbations in the inputs.
The authors demonstrate the effectiveness of SAM on a variety of tasks, including image classification, language modeling, and reinforcement learning.

Plain English Explanation

The goal of this research is to develop optimization algorithms that can train machine learning models to perform well not just on the data they were trained on, but also on new, unseen data. The key insight is that models that reach "flat" minima in the loss landscape - where the loss function is relatively constant over a wide region - tend to generalize better than models that reach "sharp" minima - where the loss function changes rapidly.

To achieve this, the authors propose a new class of optimization algorithms called Sharpness-Aware Minimization (SAM). These algorithms don't just try to minimize the loss function, but also try to find minima that are as "flat" as possible. This helps the model learn features that are more robust and less sensitive to small changes in the input data.

The authors demonstrate the effectiveness of SAM on a variety of tasks, such as image classification, language modeling, and reinforcement learning. They show that models trained using SAM outperform models trained using standard optimization techniques, especially when the training data is limited or noisy.

Technical Explanation

The core idea behind Sharpness-Aware Minimization (SAM) is to incorporate a measure of the "sharpness" of the loss function into the optimization process. Specifically, the authors define the "sharpness" of a point in the loss landscape as the maximum change in the loss function when the model parameters are perturbed by a small amount.

The SAM optimization algorithm then tries to find model parameters that minimize both the loss function and the sharpness measure. This encourages the model to converge to flat minima, which are less sensitive to small changes in the inputs and are therefore more likely to generalize well to new data.

The authors provide a general framework for implementing SAM, which can be combined with a variety of existing optimization algorithms, such as gradient descent, genetic programming, and edge stability. They also show that SAM is related to adversarial training and random perturbations, which are other techniques for improving model generalization.

Critical Analysis

The authors provide a thorough analysis of the SAM algorithm and its performance on a variety of tasks. However, there are a few potential limitations and areas for further research:

The authors only consider a specific form of the sharpness measure, based on the maximum change in the loss function. It's possible that other definitions of sharpness could lead to even better results.
The computational overhead of SAM is higher than standard optimization algorithms, as it requires additional gradient computations. This may limit its practical applicability, especially for large-scale models.
The paper does not provide a deep theoretical understanding of why flat minima lead to better generalization. More work is needed to fully explain this phenomenon.
The authors only evaluate SAM on a limited set of tasks and datasets. It would be valuable to see how it performs on a wider range of machine learning problems.

Overall, the SAM algorithm represents an interesting and promising approach to improving model generalization, and the authors have made a valuable contribution to the field of machine learning optimization.

Conclusion

The Sharpness-Aware Minimization (SAM) algorithm introduced in this paper offers a novel way to train machine learning models that are more robust and generalizable. By incorporating a measure of the "sharpness" of the loss function into the optimization process, SAM encourages the model to converge to flat minima, which are less sensitive to small perturbations in the input data.

The authors demonstrate the effectiveness of SAM on a variety of tasks, including image classification, language modeling, and reinforcement learning. Their results show that models trained using SAM outperform those trained using standard optimization techniques, particularly when the training data is limited or noisy.

While the paper identifies some potential limitations and areas for further research, the SAM algorithm represents an important step forward in the quest to build machine learning models that can reliably perform well on real-world data. As the field of AI continues to advance, techniques like SAM will play a crucial role in developing models that are not only powerful, but also robust and trustworthy.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Sharpness-Aware Minimization Enhances Feature Quality via Balanced Learning

Jacob Mitchell Springer, Vaishnavh Nagarajan, Aditi Raghunathan

Sharpness-Aware Minimization (SAM) has emerged as a promising alternative optimizer to stochastic gradient descent (SGD). The originally-proposed motivation behind SAM was to bias neural networks towards flatter minima that are believed to generalize better. However, recent studies have shown conflicting evidence on the relationship between flatness and generalization, suggesting that flatness does fully explain SAM's success. Sidestepping this debate, we identify an orthogonal effect of SAM that is beneficial out-of-distribution: we argue that SAM implicitly balances the quality of diverse features. SAM achieves this effect by adaptively suppressing well-learned features which gives remaining features opportunity to be learned. We show that this mechanism is beneficial in datasets that contain redundant or spurious features where SGD falls for the simplicity bias and would not otherwise learn all available features. Our insights are supported by experiments on real data: we demonstrate that SAM improves the quality of features in datasets containing redundant or spurious features, including CelebA, Waterbirds, CIFAR-MNIST, and DomainBed.

6/3/2024

cs.LG

Sharpness-Aware Minimization in Genetic Programming

Illya Bakurov, Nathan Haut, Wolfgang Banzhaf

Sharpness-Aware Minimization (SAM) was recently introduced as a regularization procedure for training deep neural networks. It simultaneously minimizes the fitness (or loss) function and the so-called fitness sharpness. The latter serves as a measure of the nonlinear behavior of a solution and does so by finding solutions that lie in neighborhoods having uniformly similar loss values across all fitness cases. In this contribution, we adapt SAM for tree Genetic Programming (TGP) by exploring the semantic neighborhoods of solutions using two simple approaches. By capitalizing upon perturbing input and output of program trees, sharpness can be estimated and used as a second optimization criterion during the evolution. To better understand the impact of this variant of SAM on TGP, we collect numerous indicators of the evolutionary process, including generalization ability, complexity, diversity, and a recently proposed genotype-phenotype mapping to study the amount of redundancy in trees. The experimental results demonstrate that using any of the two proposed SAM adaptations in TGP allows (i) a significant reduction of tree sizes in the population and (ii) a decrease in redundancy of the trees. When assessed on real-world benchmarks, the generalization ability of the elite solutions does not deteriorate.

5/20/2024

cs.NE cs.LG

Efficient Sharpness-Aware Minimization for Molecular Graph Transformer Models

Yili Wang, Kaixiong Zhou, Ninghao Liu, Ying Wang, Xin Wang

Sharpness-aware minimization (SAM) has received increasing attention in computer vision since it can effectively eliminate the sharp local minima from the training trajectory and mitigate generalization degradation. However, SAM requires two sequential gradient computations during the optimization of each step: one to obtain the perturbation gradient and the other to obtain the updating gradient. Compared with the base optimizer (e.g., Adam), SAM doubles the time overhead due to the additional perturbation gradient. By dissecting the theory of SAM and observing the training gradient of the molecular graph transformer, we propose a new algorithm named GraphSAM, which reduces the training cost of SAM and improves the generalization performance of graph transformer models. There are two key factors that contribute to this result: (i) textit{gradient approximation}: we use the updating gradient of the previous step to approximate the perturbation gradient at the intermediate steps smoothly (textbf{increases efficiency}); (ii) textit{loss landscape approximation}: we theoretically prove that the loss landscape of GraphSAM is limited to a small range centered on the expected loss of SAM (textbf{guarantees generalization performance}). The extensive experiments on six datasets with different tasks demonstrate the superiority of GraphSAM, especially in optimizing the model update process. The code is in:https://github.com/YL-wang/GraphSAM/tree/graphsam

6/21/2024

cs.LG

Critical Influence of Overparameterization on Sharpness-aware Minimization

Sungbin Shin, Dongyeop Lee, Maksym Andriushchenko, Namhoon Lee

Training an overparameterized neural network can yield minimizers of different generalization capabilities despite the same level of training loss. Meanwhile, with evidence that suggests a strong correlation between the sharpness of minima and their generalization errors, increasing efforts have been made to develop optimization methods to explicitly find flat minima as more generalizable solutions. Despite its contemporary relevance to overparameterization, however, this sharpness-aware minimization (SAM) strategy has not been studied much yet as to exactly how it is affected by overparameterization. Hence, in this work, we analyze SAM under overparameterization of varying degrees and present both empirical and theoretical results that indicate a critical influence of overparameterization on SAM. At first, we conduct extensive numerical experiments across vision, language, graph, and reinforcement learning domains and show that SAM consistently improves with overparameterization. Next, we attribute this phenomenon to the interplay between the enlarged solution space and increased implicit bias from overparameterization. Further, we prove multiple theoretical benefits of overparameterization for SAM to attain (i) minima with more uniform Hessian moments compared to SGD, (ii) much faster convergence at a linear rate, and (iii) lower test error for two-layer networks. Last but not least, we discover that the effect of overparameterization is more significantly pronounced in practical settings of label noise and sparsity, and yet, sufficient regularization is necessary.

6/21/2024

cs.LG stat.ML