Bridging Lottery ticket and Grokking: Is Weight Norm Sufficient to Explain Delayed Generalization?

Read original: arXiv:2310.19470 - Published 5/10/2024 by Gouki Minegishi, Yusuke Iwasawa, Yutaka Matsuo

🔗

Overview

The paper explores the "grokking" phenomenon in neural networks, where a network first reaches a memorization solution with perfect training accuracy but poor generalization, and then later reaches a perfectly generalized solution with further training.
The researchers aim to analyze the mechanism of grokking, identifying the process of finding "lottery tickets" (good sparse subnetworks) as the key to describing the transitional phase between memorization and generalization.
They refer to these subnetworks as "Grokking tickets" and show that they can drastically accelerate grokking compared to dense networks on various configurations.

Plain English Explanation

Neural networks are a type of machine learning model that can learn to perform complex tasks by analyzing large amounts of data. One of the surprising behaviors of neural networks is something called "grokking," where the network first learns to perfectly memorize the training data, but then later on, it actually learns to generalize and perform well on new, unseen data.

The researchers in this paper wanted to understand how this grokking process works. They hypothesized that the key is finding "good" sparse subnetworks within the larger neural network, which they call "Grokking tickets." These subnetworks are identified through a technique called magnitude pruning, where the least important connections in the network are removed.

The researchers found that by using these Grokking tickets, they could drastically speed up the grokking process compared to using the full, dense neural network. This suggests that the weight norms (the magnitudes of the connection strengths) are not enough to explain grokking on their own, and that finding the right sparse subnetworks is a critical factor in the transition from memorization to generalization.

Technical Explanation

The paper analyzes the mechanism behind the grokking phenomenon, where neural networks first reach a memorization solution with perfect training accuracy but poor generalization, and then later reach a perfectly generalized solution with further training.

The researchers hypothesize that the process of finding "lottery tickets" (good sparse subnetworks) is the key to describing the transitional phase between memorization and generalization. They refer to these subnetworks as "Grokking tickets," which are identified via magnitude pruning after the network has reached perfect generalization.

Using these Grokking tickets, the researchers show that the lottery tickets can drastically accelerate grokking compared to dense networks on various configurations, including multilayer perceptrons (MLPs), Transformers, arithmetic tasks, and image classification tasks. To verify that the Grokking tickets are more critical than just weight norms, they compare the performance of the good subnetworks to a dense network with the same L1 and L2 norms. The results demonstrate that the subnetworks generalize faster than the controlled dense model.

Further investigations reveal that grokking can be achieved even without weight decay, as long as the pruning rate is appropriate. However, the researchers found that the speedup does not occur when using tickets identified at the memorization solution, the transition between memorization and generalization, or when pruning networks at initialization (using techniques like Random pruning, Grasp, SNIP, and Synflow).

The results suggest that the weight norm of network parameters is not enough to explain the process of grokking, and that the importance of finding good subnetworks is crucial in describing the transition from memorization to generalization.

Critical Analysis

The paper provides a valuable contribution to the understanding of the grokking phenomenon in neural networks. By introducing the concept of "Grokking tickets" and demonstrating their importance in the transition from memorization to generalization, the researchers offer a promising approach to accelerating the grokking process.

However, the study does not fully address the underlying mechanisms that lead to the emergence of these Grokking tickets. While the researchers show that the weight norms alone are not sufficient to explain grokking, further investigation is needed to uncover the deeper factors that govern the formation and significance of these subnetworks.

Additionally, the paper focuses on a limited set of tasks and architectures, such as MLPs, Transformers, and simple arithmetic and image classification problems. It would be valuable to explore the generalization of these findings to a wider range of neural network models and real-world applications, which may uncover additional insights or limitations of the Grokking tickets approach.

Finally, the paper does not provide a clear explanation of the biological or cognitive implications of the grokking phenomenon. While the technical insights are valuable for the machine learning community, understanding the potential connections to human learning and cognition could broaden the impact and relevance of this research.

Conclusion

The grokking as transition from lazy to rich paper offers an important contribution to the understanding of generalization in neural networks. By identifying the "Grokking tickets" as a key factor in the transition from memorization to perfect generalization, the researchers provide a promising approach to accelerating this process.

The findings challenge the notion that weight norms alone can explain grokking, and instead highlight the importance of finding the right sparse subnetworks within the larger neural network. This suggests that the dichotomy between early and late phase implicit biases may play a crucial role in the grokking phenomenon, as the network transitions from a "lazy" memorization solution to a "rich" generalized one.

Further research is needed to fully elucidate the mechanisms underlying the emergence of these Grokking tickets, and to explore their broader applicability across different neural network architectures and real-world tasks. Ultimately, this work contributes to the ongoing demystification of lazy training in neural networks and the study of efficient training of Transformer models, offering valuable insights for the continued advancement of machine learning and artificial intelligence.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔗

Bridging Lottery ticket and Grokking: Is Weight Norm Sufficient to Explain Delayed Generalization?

Gouki Minegishi, Yusuke Iwasawa, Yutaka Matsuo

Grokking is one of the most surprising puzzles in neural network generalization: a network first reaches a memorization solution with perfect training accuracy and poor generalization, but with further training, it reaches a perfectly generalized solution. We aim to analyze the mechanism of grokking from the lottery ticket hypothesis, identifying the process to find the lottery tickets (good sparse subnetworks) as the key to describing the transitional phase between memorization and generalization. We refer to these subnetworks as ''Grokking tickets'', which is identified via magnitude pruning after perfect generalization. First, using ''Grokking tickets'', we show that the lottery tickets drastically accelerate grokking compared to the dense networks on various configurations (MLP and Transformer, and an arithmetic and image classification tasks). Additionally, to verify that ''Grokking ticket'' are a more critical factor than weight norms, we compared the ''good'' subnetworks with a dense network having the same L1 and L2 norms. Results show that the subnetworks generalize faster than the controlled dense model. In further investigations, we discovered that at an appropriate pruning rate, grokking can be achieved even without weight decay. We also show that speedup does not happen when using tickets identified at the memorization solution or transition between memorization and generalization or when pruning networks at the initialization (Random pruning, Grasp, SNIP, and Synflow). The results indicate that the weight norm of network parameters is not enough to explain the process of grokking, but the importance of finding good subnetworks to describe the transition from memorization to generalization. The implementation code can be accessed via this link: url{https://github.com/gouki510/Grokking-Tickets}.

5/10/2024

Deep Grokking: Would Deep Neural Networks Generalize Better?

Simin Fan, Razvan Pascanu, Martin Jaggi

Recent research on the grokking phenomenon has illuminated the intricacies of neural networks' training dynamics and their generalization behaviors. Grokking refers to a sharp rise of the network's generalization accuracy on the test set, which occurs long after an extended overfitting phase, during which the network perfectly fits the training set. While the existing research primarily focus on shallow networks such as 2-layer MLP and 1-layer Transformer, we explore grokking on deep networks (e.g. 12-layer MLP). We empirically replicate the phenomenon and find that deep neural networks can be more susceptible to grokking than its shallower counterparts. Meanwhile, we observe an intriguing multi-stage generalization phenomenon when increase the depth of the MLP model where the test accuracy exhibits a secondary surge, which is scarcely seen on shallow models. We further uncover compelling correspondences between the decreasing of feature ranks and the phase transition from overfitting to the generalization stage during grokking. Additionally, we find that the multi-stage generalization phenomenon often aligns with a double-descent pattern in feature ranks. These observations suggest that internal feature rank could serve as a more promising indicator of the model's generalization behavior compared to the weight-norm. We believe our work is the first one to dive into grokking in deep neural networks, and investigate the relationship of feature rank and generalization performance.

5/31/2024

Grokking as the Transition from Lazy to Rich Training Dynamics

Tanishq Kumar, Blake Bordelon, Samuel J. Gershman, Cengiz Pehlevan

We propose that the grokking phenomenon, where the train loss of a neural network decreases much earlier than its test loss, can arise due to a neural network transitioning from lazy training dynamics to a rich, feature learning regime. To illustrate this mechanism, we study the simple setting of vanilla gradient descent on a polynomial regression problem with a two layer neural network which exhibits grokking without regularization in a way that cannot be explained by existing theories. We identify sufficient statistics for the test loss of such a network, and tracking these over training reveals that grokking arises in this setting when the network first attempts to fit a kernel regression solution with its initial features, followed by late-time feature learning where a generalizing solution is identified after train loss is already low. We find that the key determinants of grokking are the rate of feature learning -- which can be controlled precisely by parameters that scale the network output -- and the alignment of the initial features with the target function $y(x)$. We argue this delayed generalization arises when (1) the top eigenvectors of the initial neural tangent kernel and the task labels $y(x)$ are misaligned, but (2) the dataset size is large enough so that it is possible for the network to generalize eventually, but not so large that train loss perfectly tracks test loss at all epochs, and (3) the network begins training in the lazy regime so does not learn features immediately. We conclude with evidence that this transition from lazy (linear model) to rich training (feature learning) can control grokking in more general settings, like on MNIST, one-layer Transformers, and student-teacher networks.

4/12/2024

Deep Networks Always Grok and Here is Why

Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk

Grokking, or delayed generalization, is a phenomenon where generalization in a deep neural network (DNN) occurs long after achieving near zero training error. Previous studies have reported the occurrence of grokking in specific controlled settings, such as DNNs initialized with large-norm parameters or transformers trained on algorithmic datasets. We demonstrate that grokking is actually much more widespread and materializes in a wide range of practical settings, such as training of a convolutional neural network (CNN) on CIFAR10 or a Resnet on Imagenette. We introduce the new concept of delayed robustness, whereby a DNN groks adversarial examples and becomes robust, long after interpolation and/or generalization. We develop an analytical explanation for the emergence of both delayed generalization and delayed robustness based on the local complexity of a DNN's input-output mapping. Our local complexity measures the density of so-called linear regions (aka, spline partition regions) that tile the DNN input space and serves as a utile progress measure for training. We provide the first evidence that, for classification problems, the linear regions undergo a phase transition during training whereafter they migrate away from the training samples (making the DNN mapping smoother there) and towards the decision boundary (making the DNN mapping less smooth there). Grokking occurs post phase transition as a robust partition of the input space thanks to the linearization of the DNN mapping around the training points. Website: https://bit.ly/grok-adversarial

6/10/2024