Gradient Descent with Polyak's Momentum Finds Flatter Minima via Large Catapults

Read original: arXiv:2311.15051 - Published 5/30/2024 by Prin Phunyaphibarn, Junghyun Lee, Bohan Wang, Huishuai Zhang, Chulhee Yun

Gradient Descent with Polyak's Momentum Finds Flatter Minima via Large Catapults

Overview

The paper explores the behavior of momentum-based gradient descent methods, specifically focusing on the phenomenon of "large catapults" during the optimization process.
The authors use a motivating example of linear diagonal networks to demonstrate how momentum can lead to these large, oscillatory updates.
They conduct an empirical study to investigate the impact of various techniques, such as warmup and gradient normalization, on mitigating the catapult effect.

Plain English Explanation

In machine learning, researchers often use optimization algorithms like gradient descent to train neural networks. These algorithms work by gradually adjusting the network's parameters to minimize a specific error or loss function. One popular technique is to add "momentum" to the updates, which helps the algorithm navigate the optimization landscape more effectively.

However, the paper's authors found that momentum can sometimes lead to large, oscillatory updates that they call "large catapults." These catapults can cause the optimization process to become unstable and even diverge, making it difficult to train the network effectively.

To address this issue, the authors explore the use of techniques like "warmup" and "gradient normalization." Warmup involves gradually increasing the momentum over the course of the training process, while gradient normalization adjusts the scale of the updates to prevent them from becoming too large.

Through a series of experiments on linear diagonal networks, the authors demonstrate that these techniques can help mitigate the catapult effect and lead to more stable and effective optimization. This work provides valuable insights into the behavior of momentum-based gradient descent methods and can help researchers and engineers design more robust and reliable machine learning models.

Technical Explanation

The paper focuses on the phenomenon of "large catapults" that can occur when using momentum-based gradient descent methods, such as Momentum Gradient Descent (MGD) and Nesterov Accelerated Gradient (NAG), to train neural networks.

The authors start by introducing a motivating example of linear diagonal networks, which they use to illustrate how momentum can lead to these large, oscillatory updates during the optimization process. They show that the catapult effect is more pronounced in higher-dimensional networks and is influenced by the choice of hyperparameters, such as the learning rate and momentum coefficient.

To better understand and mitigate the catapult effect, the authors conduct an empirical study on a variety of techniques, including:

Warmup: Gradually increasing the momentum coefficient over the course of the training process, rather than using a fixed value.
Gradient Normalization: Normalizing the gradients to prevent them from becoming too large and causing instability.

The authors' experiments demonstrate that these techniques can effectively reduce the magnitude and frequency of the large catapults, leading to more stable and efficient optimization. They also discuss the relationship between the catapult effect and the "marginal value of momentum" concept, as well as the connection to other related research, such as Momentum-based Gradient Descent Methods on Lie Groups, Marginal Value of Momentum in Small Learning Rate SGD, and Quadratic Models for Understanding Catapult Dynamics in Neural Networks.

Critical Analysis

The paper provides a thorough and well-designed empirical study on the behavior of momentum-based gradient descent methods, specifically focusing on the catapult effect. The authors' use of the linear diagonal network as a motivating example is a clever way to illustrate the underlying dynamics and make the phenomenon more accessible.

One potential limitation of the study is that it is primarily focused on linear diagonal networks, which may not fully capture the complexity of real-world neural network architectures. While the insights gained from this simpler model can provide valuable guidance, it would be interesting to see the authors extend their analysis to more complex network topologies and non-linear activation functions.

Additionally, the paper does not delve into the theoretical underpinnings of the catapult effect in depth. While the authors discuss connections to related research, a more rigorous mathematical analysis could help strengthen the theoretical foundation and provide a deeper understanding of the phenomenon.

Another area for further exploration would be the impact of the catapult effect on the generalization performance of trained models. The authors mention that the catapults can lead to instability and divergence, but it would be informative to understand how these large, oscillatory updates might affect the model's ability to generalize to unseen data.

Overall, the paper presents a compelling and well-executed study that contributes to our understanding of the behavior of momentum-based optimization methods. The insights gained from this research can inform the design of more robust and efficient machine learning algorithms, as evidenced by related work such as Random Scaling Momentum for Non-smooth Non-convex Optimization and Stochastic Normalized Gradient Descent with Momentum for Large Batch Optimization.

Conclusion

The paper's exploration of the "large catapult" phenomenon in momentum-based gradient descent methods provides valuable insights for the machine learning community. By demonstrating the impact of techniques like warmup and gradient normalization, the authors offer practical solutions to mitigate the instability and divergence caused by these large, oscillatory updates.

The findings from this research can help researchers and engineers design more robust and efficient optimization algorithms, leading to improved performance and reliability in a wide range of machine learning applications. As the field continues to advance, understanding the nuanced behavior of optimization methods like this will be increasingly important for developing cutting-edge models and systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Gradient Descent with Polyak's Momentum Finds Flatter Minima via Large Catapults

Prin Phunyaphibarn, Junghyun Lee, Bohan Wang, Huishuai Zhang, Chulhee Yun

Although gradient descent with Polyak's momentum is widely used in modern machine and deep learning, a concrete understanding of its effects on the training trajectory remains elusive. In this work, we empirically show that for linear diagonal networks and nonlinear neural networks, momentum gradient descent with a large learning rate displays large catapults, driving the iterates towards much flatter minima than those found by gradient descent. We hypothesize that the large catapult is caused by momentum prolonging the self-stabilization effect (Damian et al., 2023). We provide theoretical and empirical support for our hypothesis in a simple toy example and empirical evidence supporting our hypothesis for linear diagonal networks.

5/30/2024

🔗

Momentum-based gradient descent methods for Lie groups

C'edric M. Campos, David Mart'in de Diego, Jos'e Torrente

Polyak's Heavy Ball (PHB; Polyak, 1964), a.k.a. Classical Momentum, and Nesterov's Accelerated Gradient (NAG; Nesterov, 1983) are well know examples of momentum-descent methods for optimization. While the latter outperforms the former, solely generalizations of PHB-like methods to nonlinear spaces have been described in the literature. We propose here a generalization of NAG-like methods for Lie group optimization based on the variational one-to-one correspondence between classical and accelerated momentum methods (Campos et al., 2023). Numerical experiments are shown.

4/16/2024

🏋️

Catapults in SGD: spikes in the training loss and their impact on generalization through feature learning

Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, Mikhail Belkin

In this paper, we first present an explanation regarding the common occurrence of spikes in the training loss when neural networks are trained with stochastic gradient descent (SGD). We provide evidence that the spikes in the training loss of SGD are catapults, an optimization phenomenon originally observed in GD with large learning rates in [Lewkowycz et al. 2020]. We empirically show that these catapults occur in a low-dimensional subspace spanned by the top eigenvectors of the tangent kernel, for both GD and SGD. Second, we posit an explanation for how catapults lead to better generalization by demonstrating that catapults promote feature learning by increasing alignment with the Average Gradient Outer Product (AGOP) of the true predictor. Furthermore, we demonstrate that a smaller batch size in SGD induces a larger number of catapults, thereby improving AGOP alignment and test performance.

6/7/2024

✨

The Marginal Value of Momentum for Small Learning Rate SGD

Runzhe Wang, Sadhika Malladi, Tianhao Wang, Kaifeng Lyu, Zhiyuan Li

Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without stochastic gradient noise. In stochastic optimization, such as training neural networks, folklore suggests that momentum may help deep learning optimization by reducing the variance of the stochastic gradient update, but previous theoretical analyses do not find momentum to offer any provable acceleration. Theoretical results in this paper clarify the role of momentum in stochastic settings where the learning rate is small and gradient noise is the dominant source of instability, suggesting that SGD with and without momentum behave similarly in the short and long time horizons. Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training regimes where the optimal learning rate is not very large, including small- to medium-batch training from scratch on ImageNet and fine-tuning language models on downstream tasks.

4/17/2024