Catapults in SGD: spikes in the training loss and their impact on generalization through feature learning

Read original: arXiv:2306.04815 - Published 6/7/2024 by Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, Mikhail Belkin

🏋️

Overview

The paper presents an explanation for the common occurrence of spikes in the training loss when neural networks are trained using stochastic gradient descent (SGD).
It provides evidence that these spikes, known as "catapults," are an optimization phenomenon originally observed in gradient descent (GD) with large learning rates.
The paper demonstrates that these catapults occur in a low-dimensional subspace spanned by the top eigenvectors of the tangent kernel, for both GD and SGD.
It also proposes an explanation for how catapults lead to better generalization by showing that they promote feature learning by increasing alignment with the Average Gradient Outer Product (AGOP) of the true predictor.
Additionally, the paper suggests that a smaller batch size in SGD induces a larger number of catapults, thereby improving AGOP alignment and test performance.

Plain English Explanation

When training neural networks using a technique called stochastic gradient descent (SGD), researchers have observed that the training loss (a measure of how well the model is learning) often exhibits spikes, or sudden increases, during the training process. This paper aims to explain why these spikes, or "catapults," occur.

The researchers provide evidence that these catapults are a phenomenon that has been observed in a related technique called gradient descent (GD), particularly when the learning rate (a parameter that controls how much the model adjusts itself during training) is large. They show that these catapults occur in a specific subspace of the neural network's parameters, which is defined by the top eigenvectors of the "tangent kernel" (a mathematical representation of the network's structure).

Importantly, the paper suggests that these catapults actually help the neural network learn better features, which can lead to improved performance on test data. The key insight is that the catapults increase the alignment between the network's gradients and the "average gradient outer product" (AGOP) of the true predictor, which is a measure of the underlying patterns in the data.

Furthermore, the researchers find that using a smaller batch size (a parameter that determines how much data is used to compute each update during training) in SGD leads to more catapults, which in turn improves the AGOP alignment and ultimately results in better test performance.

Technical Explanation

The paper first presents an analysis of the spikes in the training loss that commonly occur when training neural networks using stochastic gradient descent (SGD). The researchers provide evidence that these spikes, referred to as "catapults," are an optimization phenomenon that was originally observed in gradient descent (GD) with large learning rates.

Through empirical analysis, the authors demonstrate that these catapults occur in a low-dimensional subspace spanned by the top eigenvectors of the tangent kernel, for both GD and SGD. The tangent kernel is a mathematical representation of the neural network's structure that captures the local curvature of the loss landscape.

The paper then proposes an explanation for how these catapults lead to better generalization. The researchers show that catapults promote feature learning by increasing the alignment between the network's gradients and the "Average Gradient Outer Product" (AGOP) of the true predictor. AGOP is a measure of the underlying patterns in the data.

Furthermore, the paper suggests that using a smaller batch size in SGD induces a larger number of catapults, which in turn improves the AGOP alignment and leads to better test performance. This finding is related to the concept of grokking, where a smaller batch size can help the network learn more meaningful features.

Critical Analysis

The paper provides a compelling explanation for the common occurrence of training loss spikes in neural networks trained with SGD. The researchers' analysis of the catapult phenomenon and its connection to the tangent kernel subspace is insightful and helps to deepen our understanding of the inner workings of neural network optimization.

One potential limitation of the research is that it primarily focuses on the training dynamics and does not delve deeply into the implications for model generalization beyond the observed AGOP alignment improvements. It would be interesting to see further exploration of how these catapult dynamics might relate to other aspects of model performance, such as flat minima or stochastic collapse.

Additionally, the paper does not address the potential practical challenges of leveraging these catapult dynamics in real-world applications, such as the difficulty of tuning hyperparameters like batch size to control the frequency of catapults. Further research may be needed to explore more robust and scalable techniques for harnessing the benefits of catapults in neural network training.

Conclusion

This paper presents a novel explanation for the common occurrence of spikes in the training loss of neural networks trained with stochastic gradient descent (SGD). The researchers provide evidence that these "catapult" phenomena are related to the low-dimensional subspace defined by the top eigenvectors of the tangent kernel, and they demonstrate how these catapults can promote better feature learning and generalization by increasing the alignment with the Average Gradient Outer Product (AGOP) of the true predictor.

The findings in this paper contribute to our understanding of the complex optimization dynamics in neural networks and suggest that carefully controlling factors like batch size may be a promising avenue for improving model performance. As the field of deep learning continues to evolve, research like this helps to shed light on the intricate mechanisms underlying successful training of these powerful AI models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏋️

Catapults in SGD: spikes in the training loss and their impact on generalization through feature learning

Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, Mikhail Belkin

In this paper, we first present an explanation regarding the common occurrence of spikes in the training loss when neural networks are trained with stochastic gradient descent (SGD). We provide evidence that the spikes in the training loss of SGD are catapults, an optimization phenomenon originally observed in GD with large learning rates in [Lewkowycz et al. 2020]. We empirically show that these catapults occur in a low-dimensional subspace spanned by the top eigenvectors of the tangent kernel, for both GD and SGD. Second, we posit an explanation for how catapults lead to better generalization by demonstrating that catapults promote feature learning by increasing alignment with the Average Gradient Outer Product (AGOP) of the true predictor. Furthermore, we demonstrate that a smaller batch size in SGD induces a larger number of catapults, thereby improving AGOP alignment and test performance.

6/7/2024

🤔

Quadratic models for understanding catapult dynamics of neural networks

Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, Mikhail Belkin

While neural networks can be approximated by linear models as their width increases, certain properties of wide neural networks cannot be captured by linear models. In this work we show that recently proposed Neural Quadratic Models can exhibit the catapult phase [Lewkowycz et al. 2020] that arises when training such models with large learning rates. We then empirically show that the behaviour of neural quadratic models parallels that of neural networks in generalization, especially in the catapult phase regime. Our analysis further demonstrates that quadratic models can be an effective tool for analysis of neural networks.

5/3/2024

Gradient Descent with Polyak's Momentum Finds Flatter Minima via Large Catapults

Prin Phunyaphibarn, Junghyun Lee, Bohan Wang, Huishuai Zhang, Chulhee Yun

Although gradient descent with Polyak's momentum is widely used in modern machine and deep learning, a concrete understanding of its effects on the training trajectory remains elusive. In this work, we empirically show that for linear diagonal networks and nonlinear neural networks, momentum gradient descent with a large learning rate displays large catapults, driving the iterates towards much flatter minima than those found by gradient descent. We hypothesize that the large catapult is caused by momentum prolonging the self-stabilization effect (Damian et al., 2023). We provide theoretical and empirical support for our hypothesis in a simple toy example and empirical evidence supporting our hypothesis for linear diagonal networks.

5/30/2024

🔗

Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks

Feng Chen, Daniel Kunin, Atsushi Yamamura, Surya Ganguli

In this work, we reveal a strong implicit bias of stochastic gradient descent (SGD) that drives overly expressive networks to much simpler subnetworks, thereby dramatically reducing the number of independent parameters, and improving generalization. To reveal this bias, we identify invariant sets, or subsets of parameter space that remain unmodified by SGD. We focus on two classes of invariant sets that correspond to simpler (sparse or low-rank) subnetworks and commonly appear in modern architectures. Our analysis uncovers that SGD exhibits a property of stochastic attractivity towards these simpler invariant sets. We establish a sufficient condition for stochastic attractivity based on a competition between the loss landscape's curvature around the invariant set and the noise introduced by stochastic gradients. Remarkably, we find that an increased level of noise strengthens attractivity, leading to the emergence of attractive invariant sets associated with saddle-points or local maxima of the train loss. We observe empirically the existence of attractive invariant sets in trained deep neural networks, implying that SGD dynamics often collapses to simple subnetworks with either vanishing or redundant neurons. We further demonstrate how this simplifying process of stochastic collapse benefits generalization in a linear teacher-student framework. Finally, through this analysis, we mechanistically explain why early training with large learning rates for extended periods benefits subsequent generalization.

5/30/2024