Asymptotics of feature learning in two-layer networks after one gradient-step

Read original: arXiv:2402.04980 - Published 6/5/2024 by Hugo Cui, Luca Pesce, Yatin Dandi, Florent Krzakala, Yue M. Lu, Lenka Zdeborov'a, Bruno Loureiro
Total Score

0

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • The paper investigates how two-layer neural networks learn features from data and improve over the kernel regime after a single gradient descent step.
  • The authors use a spiked Random Features (sRF) model to describe the trained network, building on recent work on Gaussian universality.
  • The paper provides an asymptotic characterization of the generalization error of the sRF in the high-dimensional limit, which closely matches the learning curves of the original network model.
  • This enables the authors to understand how adapting to the data is crucial for the network to efficiently learn non-linear functions in the direction of the gradient, which it cannot do at initialization.

Plain English Explanation

The paper looks at how two-layer neural networks learn features from data and improve over the "kernel regime" (a simpler model) after being trained with a single gradient descent step. The authors use a mathematical model called a "spiked Random Features (sRF) model" to describe the trained network, building on previous research.

Using this model, the paper provides a detailed, mathematical description of how the network's performance (its "generalization error") changes as the number of training examples, network width, and input dimension all increase. Importantly, this mathematical description closely matches the actual performance of the original neural network model.

This allows the authors to understand a key insight: for the neural network to efficiently learn complex, non-linear functions, it needs to "adapt" to the data it's trained on. At the start, the network can only learn simple, linear functions, but through training, it becomes able to learn the more complex, non-linear functions that are important for many real-world problems.

Technical Explanation

The paper builds on the insights from previous work on the learning dynamics of two-layer neural networks. Leveraging the connection to the spiked Random Features (sRF) model, the authors provide an exact asymptotic characterization of the generalization error of the sRF in the high-dimensional limit.

This high-dimensional analysis, which relies on recent progress in Gaussian universality, captures the learning curves of the original two-layer network model closely. This enables the authors to understand how the network's ability to adapt to the data is crucial for learning non-linear functions in the direction of the gradient, which it cannot do at initialization.

The paper demonstrates that the network starts in a "kernel regime" where it can only express linear functions, but through a single gradient descent step, it is able to adapt and learn more complex, non-linear functions. This simplicity bias in the initial network is overcome by the network's ability to adjust to the data during training.

Critical Analysis

The paper provides a rigorous mathematical analysis of the learning dynamics of two-layer neural networks, which is an important step in understanding the inner workings of these widely used models. The authors' use of the sRF model and the connection to Gaussian universality allows for a precise characterization of the generalization error in the high-dimensional limit.

One potential limitation of the work is that it focuses solely on the initial, single-step gradient descent regime. While this provides valuable insights, it would be interesting to see how the analysis extends to the case of multiple training iterations or different optimization methods.

Additionally, the paper does not address the impact of other architectural choices, such as the use of activation functions or the presence of skip connections, which can also play a crucial role in a network's ability to learn complex functions. Exploring these factors could further enhance our understanding of neural network learning.

Overall, the paper makes a significant contribution to the theoretical understanding of two-layer neural networks and their adaptive capabilities. The insights provided can inform the design of more effective neural network architectures and training procedures for a wide range of applications.

Conclusion

This paper offers a detailed, mathematical analysis of how two-layer neural networks learn features from data and improve their performance after a single gradient descent step. By modeling the trained network using a spiked Random Features (sRF) model and leveraging recent advances in Gaussian universality, the authors provide an exact characterization of the generalization error in the high-dimensional limit.

The key insight is that the network's ability to adapt to the data is crucial for learning complex, non-linear functions, which it cannot do at initialization. This work advances our theoretical understanding of neural network learning and could inform the development of more effective machine learning models and techniques.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Total Score

0

Asymptotics of feature learning in two-layer networks after one gradient-step

Hugo Cui, Luca Pesce, Yatin Dandi, Florent Krzakala, Yue M. Lu, Lenka Zdeborov'a, Bruno Loureiro

In this manuscript, we investigate the problem of how two-layer neural networks learn features from data, and improve over the kernel regime, after being trained with a single gradient descent step. Leveraging the insight from (Ba et al., 2022), we model the trained network by a spiked Random Features (sRF) model. Further building on recent progress on Gaussian universality (Dandi et al., 2023), we provide an exact asymptotic description of the generalization error of the sRF in the high-dimensional limit where the number of samples, the width, and the input dimension grow at a proportional rate. The resulting characterization for sRFs also captures closely the learning curves of the original network model. This enables us to understand how adapting to the data is crucial for the network to efficiently learn non-linear functions in the direction of the gradient -- where at initialization it can only express linear functions in this regime.

Read more

6/5/2024

Total Score

0

A Theory of Non-Linear Feature Learning with One Gradient Step in Two-Layer Neural Networks

Behrad Moniri, Donghwan Lee, Hamed Hassani, Edgar Dobriban

Feature learning is thought to be one of the fundamental reasons for the success of deep neural networks. It is rigorously known that in two-layer fully-connected neural networks under certain conditions, one step of gradient descent on the first layer can lead to feature learning; characterized by the appearance of a separated rank-one component -- spike -- in the spectrum of the feature matrix. However, with a constant gradient descent step size, this spike only carries information from the linear component of the target function and therefore learning non-linear components is impossible. We show that with a learning rate that grows with the sample size, such training in fact introduces multiple rank-one components, each corresponding to a specific polynomial feature. We further prove that the limiting large-dimensional and large sample training and test errors of the updated neural networks are fully characterized by these spikes. By precisely analyzing the improvement in the training and test errors, we demonstrate that these non-linear features can enhance learning.

Read more

6/18/2024

Asymptotics of Learning with Deep Structured (Random) Features
Total Score

0

Asymptotics of Learning with Deep Structured (Random) Features

Dominik Schroder, Daniil Dmitriev, Hugo Cui, Bruno Loureiro

For a large class of feature maps we provide a tight asymptotic characterisation of the test error associated with learning the readout layer, in the high-dimensional limit where the input dimension, hidden layer widths, and number of training samples are proportionally large. This characterization is formulated in terms of the population covariance of the features. Our work is partially motivated by the problem of learning with Gaussian rainbow neural networks, namely deep non-linear fully-connected networks with random but structured weights, whose row-wise covariances are further allowed to depend on the weights of previous layers. For such networks we also derive a closed-form formula for the feature covariance in terms of the weight matrices. We further find that in some cases our results can capture feature maps learned by deep, finite-width neural networks trained under gradient descent.

Read more

6/11/2024

🧠

Total Score

0

Learning time-scales in two-layers neural networks

Raphael Berthier, Andrea Montanari, Kangjie Zhou

Gradient-based learning in multi-layer neural networks displays a number of striking features. In particular, the decrease rate of empirical risk is non-monotone even after averaging over large batches. Long plateaus in which one observes barely any progress alternate with intervals of rapid decrease. These successive phases of learning often take place on very different time scales. Finally, models learnt in an early phase are typically `simpler' or `easier to learn' although in a way that is difficult to formalize. Although theoretical explanations of these phenomena have been put forward, each of them captures at best certain specific regimes. In this paper, we study the gradient flow dynamics of a wide two-layer neural network in high-dimension, when data are distributed according to a single-index model (i.e., the target function depends on a one-dimensional projection of the covariates). Based on a mixture of new rigorous results, non-rigorous mathematical derivations, and numerical simulations, we propose a scenario for the learning dynamics in this setting. In particular, the proposed evolution exhibits separation of timescales and intermittency. These behaviors arise naturally because the population gradient flow can be recast as a singularly perturbed dynamical system.

Read more

4/19/2024