Learning time-scales in two-layers neural networks

2303.00055

Published 4/19/2024 by Raphael Berthier, Andrea Montanari, Kangjie Zhou

🧠

Abstract

Gradient-based learning in multi-layer neural networks displays a number of striking features. In particular, the decrease rate of empirical risk is non-monotone even after averaging over large batches. Long plateaus in which one observes barely any progress alternate with intervals of rapid decrease. These successive phases of learning often take place on very different time scales. Finally, models learnt in an early phase are typically simpler' or easier to learn' although in a way that is difficult to formalize. Although theoretical explanations of these phenomena have been put forward, each of them captures at best certain specific regimes. In this paper, we study the gradient flow dynamics of a wide two-layer neural network in high-dimension, when data are distributed according to a single-index model (i.e., the target function depends on a one-dimensional projection of the covariates). Based on a mixture of new rigorous results, non-rigorous mathematical derivations, and numerical simulations, we propose a scenario for the learning dynamics in this setting. In particular, the proposed evolution exhibits separation of timescales and intermittency. These behaviors arise naturally because the population gradient flow can be recast as a singularly perturbed dynamical system.

Create account to get full access

Overview

Gradient-based learning in multi-layer neural networks exhibits several striking features, including non-monotonic decreases in empirical risk, long plateaus with little progress followed by rapid decreases, and models learned in early phases being "simpler" or "easier to learn"
While theoretical explanations have been proposed, they each capture only certain specific regimes
This paper studies the gradient flow dynamics of a wide two-layer neural network in high-dimension, where data are distributed according to a single-index model

Plain English Explanation

When training large neural networks using gradient-based methods, the learning process often displays some peculiar behaviors. Even after averaging gradients over large batches of data, the error on the training data does not decrease smoothly over time. Instead, there are alternating phases - long periods where the error barely changes at all, followed by intervals where the error drops rapidly. The models learned in these early "slow" phases also tend to be simpler or easier to train than the final, high-performing models.

Researchers have tried to explain these phenomena theoretically, but each explanation only captures certain specific situations. In this paper, the authors focus on a particular setting - a wide, two-layer neural network trained on high-dimensional data that follows a "single-index" model (where the target function depends only on a one-dimensional projection of the input). Using a combination of rigorous analysis, mathematical derivations, and simulations, the authors propose a scenario that can explain the observed learning dynamics in this case. Specifically, they show that the gradient flow (the dynamics of how the network's parameters change during training) exhibits a separation of timescales and intermittency - the slow and fast phases of learning emerge naturally from the mathematical structure of the problem.

Technical Explanation

The paper studies the gradient flow dynamics of a wide, two-layer neural network in a high-dimensional setting where the data is distributed according to a single-index model. The authors leverage a mix of new rigorous results, non-rigorous mathematical derivations, and numerical simulations to propose a scenario that can explain the striking features of the learning dynamics observed in this setting.

Specifically, the authors show that the population gradient flow can be recast as a singularly perturbed dynamical system, which naturally leads to a separation of timescales and intermittency in the learning process. These behaviors - the long plateaus with little progress followed by rapid decreases in error, as well as the "simple" models learned in early phases being "easier to learn" - emerge from the mathematical structure of the problem.

The authors also draw connections to prior work on the dynamical stability and chaos of neural network training trajectories, as well as the phenomenon of grokking - the transition from a "lazy" to a "rich" regime during training.

Critical Analysis

The paper provides a compelling theoretical framework for understanding the complex learning dynamics observed in wide, two-layer neural networks trained on high-dimensional, single-index data. By recasting the gradient flow as a singularly perturbed dynamical system, the authors are able to derive a scenario that explains the separation of timescales and intermittency in the learning process.

However, it's important to note that the analysis is limited to this specific setting, and the results may not directly translate to other neural network architectures or data distributions. The authors acknowledge that extending the analysis to deeper networks or more general data models remains an open challenge.

Additionally, while the mathematical derivations provide valuable insights, they rely on a number of simplifying assumptions and approximations. The extent to which these assumptions hold in real-world deep learning scenarios is an area that warrants further investigation.

Overall, this paper makes an important contribution to our theoretical understanding of neural network training dynamics. By shedding light on the underlying mathematical structure of the problem, it opens up avenues for future research aimed at developing more robust and stable training algorithms.

Conclusion

This paper proposes a theoretical framework for understanding the complex learning dynamics observed in wide, two-layer neural networks trained on high-dimensional, single-index data. By reframing the gradient flow as a singularly perturbed dynamical system, the authors are able to explain the separation of timescales and intermittency that characterize the training process.

While the analysis is limited to this specific setting, the insights gained from this work have broader implications for the field of deep learning. By uncovering the mathematical structure underlying these phenomena, the paper paves the way for further research into more robust and stable training algorithms. As the field continues to grapple with the complexities of large-scale neural network optimization, studies like this one will be invaluable in guiding both practical and theoretical advancements.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

✨

A Theory of Non-Linear Feature Learning with One Gradient Step in Two-Layer Neural Networks

Behrad Moniri, Donghwan Lee, Hamed Hassani, Edgar Dobriban

Feature learning is thought to be one of the fundamental reasons for the success of deep neural networks. It is rigorously known that in two-layer fully-connected neural networks under certain conditions, one step of gradient descent on the first layer can lead to feature learning; characterized by the appearance of a separated rank-one component -- spike -- in the spectrum of the feature matrix. However, with a constant gradient descent step size, this spike only carries information from the linear component of the target function and therefore learning non-linear components is impossible. We show that with a learning rate that grows with the sample size, such training in fact introduces multiple rank-one components, each corresponding to a specific polynomial feature. We further prove that the limiting large-dimensional and large sample training and test errors of the updated neural networks are fully characterized by these spikes. By precisely analyzing the improvement in the training and test errors, we demonstrate that these non-linear features can enhance learning.

6/18/2024

stat.ML cs.LG

A Dynamical Model of Neural Scaling Laws

Blake Bordelon, Alexander Atanasov, Cengiz Pehlevan

On a variety of tasks, the performance of neural networks predictably improves with training time, dataset size and model size across many orders of magnitude. This phenomenon is known as a neural scaling law. Of fundamental importance is the compute-optimal scaling law, which reports the performance as a function of units of compute when choosing model sizes optimally. We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization. This reproduces many observations about neural scaling laws. First, our model makes a prediction about why the scaling of performance with training time and with model size have different power law exponents. Consequently, the theory predicts an asymmetric compute-optimal scaling rule where the number of training steps are increased faster than model parameters, consistent with recent empirical observations. Second, it has been observed that early in training, networks converge to their infinite-width dynamics at a rate $1/textit{width}$ but at late time exhibit a rate $textit{width}^{-c}$, where $c$ depends on the structure of the architecture and task. We show that our model exhibits this behavior. Lastly, our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.

6/26/2024

stat.ML cs.LG

Mean-field Analysis on Two-layer Neural Networks from a Kernel Perspective

Shokichi Takakura, Taiji Suzuki

In this paper, we study the feature learning ability of two-layer neural networks in the mean-field regime through the lens of kernel methods. To focus on the dynamics of the kernel induced by the first layer, we utilize a two-timescale limit, where the second layer moves much faster than the first layer. In this limit, the learning problem is reduced to the minimization problem over the intrinsic kernel. Then, we show the global convergence of the mean-field Langevin dynamics and derive time and particle discretization error. We also demonstrate that two-layer neural networks can learn a union of multiple reproducing kernel Hilbert spaces more efficiently than any kernel methods, and neural networks acquire data-dependent kernel which aligns with the target function. In addition, we develop a label noise procedure, which converges to the global optimum and show that the degrees of freedom appears as an implicit regularization.

4/9/2024

cs.LG stat.ML

Large Stepsize Gradient Descent for Non-Homogeneous Two-Layer Networks: Margin Improvement and Fast Optimization

Yuhang Cai, Jingfeng Wu, Song Mei, Michael Lindsey, Peter L. Bartlett

The typical training of neural networks using large stepsize gradient descent (GD) under the logistic loss often involves two distinct phases, where the empirical risk oscillates in the first phase but decreases monotonically in the second phase. We investigate this phenomenon in two-layer networks that satisfy a near-homogeneity condition. We show that the second phase begins once the empirical risk falls below a certain threshold, dependent on the stepsize. Additionally, we show that the normalized margin grows nearly monotonically in the second phase, demonstrating an implicit bias of GD in training non-homogeneous predictors. If the dataset is linearly separable and the derivative of the activation function is bounded away from zero, we show that the average empirical risk decreases, implying that the first phase must stop in finite steps. Finally, we demonstrate that by choosing a suitably large stepsize, GD that undergoes this phase transition is more efficient than GD that monotonically decreases the risk. Our analysis applies to networks of any width, beyond the well-known neural tangent kernel and mean-field regimes.

6/28/2024

stat.ML cs.LG