Why Do You Grok? A Theoretical Analysis of Grokking Modular Addition

Read original: arXiv:2407.12332 - Published 7/18/2024 by Mohamad Amin Mohamadi, Zhiyuan Li, Lei Wu, Danica J. Sutherland

🌀

Overview

This paper presents a theoretical explanation for the "grokking" phenomenon, where deep learning models can generalize well after initially overfitting on a problem.
The researchers focus on the specific problem of modular addition and show that early in training, no permutation-equivariant model can achieve small error without seeing a large fraction of the data.
However, they demonstrate that two-layer quadratic networks can generalize well with fewer training points, and these networks can be found using gradient descent with L-infinity regularization.
The paper provides empirical evidence that these networks, as well as simple Transformers, only leave the "kernel regime" and begin to generalize after initially overfitting.

Plain English Explanation

In machine learning, there is a phenomenon called "grokking" where a model starts out struggling to learn a task, but then suddenly begins to generalize very well, long after it has "overfit" or memorized the training data. This paper tries to explain why this happens, focusing on the specific problem of modular addition.

The researchers first show that in the early stages of training, when the model is in the "kernel regime," no model that is equivariant to permutations of the input (meaning it treats the input symmetrically) can do well on modular addition without seeing a large fraction of all possible training examples. This makes sense - the model needs to see a lot of examples to learn the underlying pattern.

However, the researchers then demonstrate that more complex "two-layer quadratic networks" can actually generalize well with far fewer training examples, after they've initially overfit the training data. They show that these networks can be found using gradient descent with a technique called "L-infinity regularization" that encourages the network weights to have small absolute values.

The paper also provides experimental evidence that these two-layer quadratic networks, as well as simple Transformer models, only start to generalize well after they've initially overfit the training data and left the "kernel regime." This supports the idea that "grokking" is a consequence of the model's behavior transitioning from the initial "kernel-like" phase to a more complex, "limiting" phase as training progresses.

Technical Explanation

The key insight of this paper is that the transition from the "kernel regime" to a more complex "limiting regime" underpins the phenomenon of "grokking" in deep learning models.

First, the researchers prove a negative result: they show that in the kernel regime, no permutation-equivariant model can achieve small population error on modular addition unless it sees a constant fraction of all possible training examples. This suggests that models must escape the kernel regime to generalize well with fewer training points.

The researchers then demonstrate that two-layer quadratic networks can in fact generalize well with substantially fewer training points, after initially overfitting. They prove that such networks exist and can be found using gradient descent with small L-infinity regularization, which encourages the network weights to have small absolute values.

Empirically, the researchers show that these two-layer quadratic networks, as well as simple Transformer models, only leave the kernel regime and begin to generalize well after initially overfitting the training data. This supports the idea that the transition from kernel-like behavior to a more complex limiting regime underpins the "grokking" phenomenon.

Critical Analysis

The researchers provide a compelling theoretical and empirical account of the "grokking" phenomenon, particularly for the problem of modular addition. However, there are a few caveats to consider:

The theoretical results rely on the model being permutation-equivariant, which may not hold for all architectures used in practice. It would be valuable to extend the analysis to a broader class of models.
The construction of the two-layer quadratic networks that can generalize well with fewer training points assumes the ability to find them using gradient descent with L-infinity regularization. In practice, this may be challenging, and the existence of such networks in more complex settings is an open question.
The empirical evidence focuses on relatively simple problems and model architectures. It remains to be seen whether the insights apply equally well to more sophisticated deep learning tasks and models used in real-world applications.

Nevertheless, this paper provides an important step towards understanding the fundamental principles underlying the "grokking" phenomenon in deep learning. Further research in this direction could yield valuable insights for improving the sample efficiency and generalization capabilities of deep neural networks.

Conclusion

This paper presents a theoretical framework for understanding the "grokking" phenomenon in deep learning, where models can generalize well long after initially overfitting on a problem. The researchers show that this transition from "kernel-like" behavior to a more complex "limiting regime" is key to the grokking effect, at least for the specific problem of modular addition.

The findings have important implications for improving the sample efficiency and generalization performance of deep neural networks, which are crucial for many real-world applications. By better understanding the underlying mechanisms behind grokking, researchers may be able to develop new training techniques and architectural designs that can more reliably and efficiently learn complex tasks from limited data.

Overall, this paper represents an important contribution to the ongoing efforts to unravel the mysteries of deep learning and paves the way for further advancements in the field.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌀

Why Do You Grok? A Theoretical Analysis of Grokking Modular Addition

Mohamad Amin Mohamadi, Zhiyuan Li, Lei Wu, Danica J. Sutherland

We present a theoretical explanation of the ``grokking'' phenomenon, where a model generalizes long after overfitting,for the originally-studied problem of modular addition. First, we show that early in gradient descent, when the ``kernel regime'' approximately holds, no permutation-equivariant model can achieve small population error on modular addition unless it sees at least a constant fraction of all possible data points. Eventually, however, models escape the kernel regime. We show that two-layer quadratic networks that achieve zero training loss with bounded $ell_{infty}$ norm generalize well with substantially fewer training points, and further show such networks exist and can be found by gradient descent with small $ell_{infty}$ regularization. We further provide empirical evidence that these networks as well as simple Transformers, leave the kernel regime only after initially overfitting. Taken together, our results strongly support the case for grokking as a consequence of the transition from kernel-like behavior to limiting behavior of gradient descent on deep networks.

7/18/2024

Grokking as the Transition from Lazy to Rich Training Dynamics

Tanishq Kumar, Blake Bordelon, Samuel J. Gershman, Cengiz Pehlevan

We propose that the grokking phenomenon, where the train loss of a neural network decreases much earlier than its test loss, can arise due to a neural network transitioning from lazy training dynamics to a rich, feature learning regime. To illustrate this mechanism, we study the simple setting of vanilla gradient descent on a polynomial regression problem with a two layer neural network which exhibits grokking without regularization in a way that cannot be explained by existing theories. We identify sufficient statistics for the test loss of such a network, and tracking these over training reveals that grokking arises in this setting when the network first attempts to fit a kernel regression solution with its initial features, followed by late-time feature learning where a generalizing solution is identified after train loss is already low. We find that the key determinants of grokking are the rate of feature learning -- which can be controlled precisely by parameters that scale the network output -- and the alignment of the initial features with the target function $y(x)$. We argue this delayed generalization arises when (1) the top eigenvectors of the initial neural tangent kernel and the task labels $y(x)$ are misaligned, but (2) the dataset size is large enough so that it is possible for the network to generalize eventually, but not so large that train loss perfectly tracks test loss at all epochs, and (3) the network begins training in the lazy regime so does not learn features immediately. We conclude with evidence that this transition from lazy (linear model) to rich training (feature learning) can control grokking in more general settings, like on MNIST, one-layer Transformers, and student-teacher networks.

4/12/2024

Deep Grokking: Would Deep Neural Networks Generalize Better?

Simin Fan, Razvan Pascanu, Martin Jaggi

Recent research on the grokking phenomenon has illuminated the intricacies of neural networks' training dynamics and their generalization behaviors. Grokking refers to a sharp rise of the network's generalization accuracy on the test set, which occurs long after an extended overfitting phase, during which the network perfectly fits the training set. While the existing research primarily focus on shallow networks such as 2-layer MLP and 1-layer Transformer, we explore grokking on deep networks (e.g. 12-layer MLP). We empirically replicate the phenomenon and find that deep neural networks can be more susceptible to grokking than its shallower counterparts. Meanwhile, we observe an intriguing multi-stage generalization phenomenon when increase the depth of the MLP model where the test accuracy exhibits a secondary surge, which is scarcely seen on shallow models. We further uncover compelling correspondences between the decreasing of feature ranks and the phase transition from overfitting to the generalization stage during grokking. Additionally, we find that the multi-stage generalization phenomenon often aligns with a double-descent pattern in feature ranks. These observations suggest that internal feature rank could serve as a more promising indicator of the model's generalization behavior compared to the weight-norm. We believe our work is the first one to dive into grokking in deep neural networks, and investigate the relationship of feature rank and generalization performance.

5/31/2024

Deep Networks Always Grok and Here is Why

Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk

Grokking, or delayed generalization, is a phenomenon where generalization in a deep neural network (DNN) occurs long after achieving near zero training error. Previous studies have reported the occurrence of grokking in specific controlled settings, such as DNNs initialized with large-norm parameters or transformers trained on algorithmic datasets. We demonstrate that grokking is actually much more widespread and materializes in a wide range of practical settings, such as training of a convolutional neural network (CNN) on CIFAR10 or a Resnet on Imagenette. We introduce the new concept of delayed robustness, whereby a DNN groks adversarial examples and becomes robust, long after interpolation and/or generalization. We develop an analytical explanation for the emergence of both delayed generalization and delayed robustness based on the local complexity of a DNN's input-output mapping. Our local complexity measures the density of so-called linear regions (aka, spline partition regions) that tile the DNN input space and serves as a utile progress measure for training. We provide the first evidence that, for classification problems, the linear regions undergo a phase transition during training whereafter they migrate away from the training samples (making the DNN mapping smoother there) and towards the decision boundary (making the DNN mapping less smooth there). Grokking occurs post phase transition as a robust partition of the input space thanks to the linearization of the DNN mapping around the training points. Website: https://bit.ly/grok-adversarial

6/10/2024