Infinite Width Models That Work: Why Feature Learning Doesn't Matter as Much as You Think

Read original: arXiv:2406.18800 - Published 6/28/2024 by Luke Sernau

Infinite Width Models That Work: Why Feature Learning Doesn't Matter as Much as You Think

Overview

The paper examines the performance of "infinite width" neural network models, which have an infinite number of hidden units.
It challenges the common assumption that learning good features is crucial for a model's generalization performance.
The authors show that infinite width models can perform well without learning meaningful features, suggesting that feature learning may not be as important as commonly believed.

Plain English Explanation

In the world of machine learning, there is a common belief that a model's ability to learn good features from data is crucial for its performance on new, unseen data. This is known as "generalization" - the model's ability to apply what it has learned to new situations.

However, this paper challenges this assumption. The authors investigate the behavior of "infinite width" neural network models, which have an unlimited number of hidden units. Intuitively, you might expect these models to be able to learn very complex and meaningful features from the data.

Surprisingly, the authors found that these infinite width models can perform well without learning meaningful features at all. In fact, they show that the models can achieve good performance even when the weights are initialized to be completely random and orthogonal (unrelated) to the input data.

This suggests that feature learning may not be as important as we thought for a model's ability to generalize. The authors argue that other factors, such as the choice of optimization algorithm and hyperparameters, may play a bigger role in determining a model's performance.

Technical Explanation

The paper explores the behavior of neural networks with an "infinite" number of hidden units, known as "infinite width" models. These models are interesting because they allow the authors to study the role of feature learning in a setting where the model has the capacity to learn arbitrarily complex features.

The authors consider two main scenarios:

Random Orthogonal Weights: In this case, the initial weights of the model are set to be random and orthogonal to the input data. This means the model starts with no meaningful features learned.
Lazy Training Regime: Here, the authors use an optimization algorithm that encourages the model to learn features that are similar to the initial, random features, rather than learning new ones.

In both of these scenarios, the authors find that the infinite width models can still achieve good generalization performance, even though they are not learning meaningful features from the data. This challenges the common assumption that feature learning is crucial for a model's ability to generalize.

The authors provide theoretical analysis and experimental results to support their claims. They draw connections to related work, such as the "lazy training" regime and the performance of Bayesian deep linear models.

Critical Analysis

The paper makes a compelling case that feature learning may not be as important as commonly believed for a model's generalization performance. However, there are a few important caveats to consider:

Infinite Width Assumption: The paper focuses on the behavior of models with an infinite number of hidden units, which is a theoretical idealization. In practice, real-world models have finite width, and the importance of feature learning may be more pronounced in these cases.
Specific Optimization Regimes: The authors consider two specific optimization scenarios - random orthogonal weights and lazy training. While these provide interesting insights, the importance of feature learning may depend on the choice of optimization algorithm and hyperparameters in more general settings.
Task Dependency: The paper's findings may be more applicable to certain types of tasks and datasets than others. The importance of feature learning could vary depending on the complexity and structure of the problem being solved.
Practical Implications: While the paper challenges the conventional wisdom about feature learning, it remains to be seen how these insights translate to real-world machine learning applications. Further research is needed to understand the practical implications of these findings.

Conclusion

This paper presents a thought-provoking challenge to the common assumption that feature learning is crucial for a model's generalization performance. By studying the behavior of infinite width neural networks, the authors show that models can achieve good performance without learning meaningful features, suggesting that other factors may be more important.

These findings have the potential to reshape our understanding of how deep learning models work and what really matters for their success. While there are some important caveats to consider, the paper encourages us to think more critically about the role of feature learning and to explore alternative perspectives on what drives a model's ability to generalize.

As the field of machine learning continues to evolve, research like this that challenges our assumptions and pushes the boundaries of our understanding is invaluable. It paves the way for new and potentially more effective approaches to building intelligent systems that can learn and generalize in powerful ways.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Infinite Width Models That Work: Why Feature Learning Doesn't Matter as Much as You Think

Luke Sernau

Common infinite-width architectures such as Neural Tangent Kernels (NTKs) have historically shown weak performance compared to finite models. This has been attributed to the absence of feature learning. We show that this is not the case. In fact, we show that infinite width NTK models are able to access richer features than finite models by selecting relevant subfeatures from their (infinite) feature vector. In fact, we show experimentally that NTKs under-perform traditional finite models even when feature learning is artificially disabled. Instead, weak performance is due to the fact that existing constructions depend on weak optimizers like SGD. We provide an infinite width limit based on ADAM-like learning dynamics and demonstrate empirically that the resulting models erase this performance gap.

6/28/2024

Feature Learning and Generalization in Deep Networks with Orthogonal Weights

Hannah Day, Yonatan Kahn, Daniel A. Roberts

Fully-connected deep neural networks with weights initialized from independent Gaussian distributions can be tuned to criticality, which prevents the exponential growth or decay of signals propagating through the network. However, such networks still exhibit fluctuations that grow linearly with the depth of the network, which may impair the training of networks with width comparable to depth. We show analytically that rectangular networks with tanh activations and weights initialized from the ensemble of orthogonal matrices have corresponding preactivation fluctuations which are independent of depth, to leading order in inverse width. Moreover, we demonstrate numerically that, at initialization, all correlators involving the neural tangent kernel (NTK) and its descendants at leading order in inverse width -- which govern the evolution of observables during training -- saturate at a depth of $sim 20$, rather than growing without bound as in the case of Gaussian initializations. We speculate that this structure preserves finite-width feature learning while reducing overall noise, thus improving both generalization and training speed in deep networks with depth comparable to width. We provide some experimental justification by relating empirical measurements of the NTK to the superior performance of deep nonlinear orthogonal networks trained under full-batch gradient descent on the MNIST and CIFAR-10 classification tasks.

6/13/2024

🐍

The lazy (NTK) and rich ($mu$P) regimes: a gentle tutorial

Dhruva Karkada

A central theme of the modern machine learning paradigm is that larger neural networks achieve better performance on a variety of metrics. Theoretical analyses of these overparameterized models have recently centered around studying very wide neural networks. In this tutorial, we provide a nonrigorous but illustrative derivation of the following fact: in order to train wide networks effectively, there is only one degree of freedom in choosing hyperparameters such as the learning rate and the size of the initial weights. This degree of freedom controls the richness of training behavior: at minimum, the wide network trains lazily like a kernel machine, and at maximum, it exhibits feature learning in the so-called $mu$P regime. In this paper, we explain this richness scale, synthesize recent research results into a coherent whole, offer new perspectives and intuitions, and provide empirical evidence supporting our claims. In doing so, we hope to encourage further study of the richness scale, as it may be key to developing a scientific theory of feature learning in practical deep neural networks.

5/1/2024

✨

Feature learning in finite-width Bayesian deep linear networks with multiple outputs and convolutional layers

Federico Bassetti, Marco Gherardi, Alessandro Ingrosso, Mauro Pastore, Pietro Rotondo

Deep linear networks have been extensively studied, as they provide simplified models of deep learning. However, little is known in the case of finite-width architectures with multiple outputs and convolutional layers. In this manuscript, we provide rigorous results for the statistics of functions implemented by the aforementioned class of networks, thus moving closer to a complete characterization of feature learning in the Bayesian setting. Our results include: (i) an exact and elementary non-asymptotic integral representation for the joint prior distribution over the outputs, given in terms of a mixture of Gaussians; (ii) an analytical formula for the posterior distribution in the case of squared error loss function (Gaussian likelihood); (iii) a quantitative description of the feature learning infinite-width regime, using large deviation theory. From a physical perspective, deep architectures with multiple outputs or convolutional layers represent different manifestations of kernel shape renormalization, and our work provides a dictionary that translates this physics intuition and terminology into rigorous Bayesian statistics.

6/6/2024