Depth Degeneracy in Neural Networks: Vanishing Angles in Fully Connected ReLU Networks on Initialization

Read original: arXiv:2302.09712 - Published 8/16/2024 by Cameron Jakub, Mihai Nica

🧠

Overview

Deep neural networks have shown remarkable performance on various tasks, but many of their properties are not yet fully understood.
One such mystery is the "depth degeneracy" phenomenon, where the deeper the network, the closer it becomes to a constant function at initialization.
This paper examines the evolution of the angle between two inputs to a ReLU neural network as a function of the number of layers.

Plain English Explanation

In this paper, the researchers investigate a puzzling property of deep neural networks called the "depth degeneracy" phenomenon. This refers to the fact that as you make a neural network deeper, it tends to become closer and closer to a constant function when you first initialize the network's parameters.

The researchers look at how the angle between two different input vectors changes as the network gets deeper. By using some advanced mathematical techniques, they were able to derive precise formulas that describe how quickly this angle goes to zero as the depth increases.

Interestingly, these formulas capture some subtle fluctuations that are not visible when you analyze neural networks in the popular "infinite width" framework. This leads to some qualitatively different predictions compared to that approach.

The researchers validate their theoretical results through computer simulations, showing that their formulas accurately describe the behavior of real, finite-sized neural networks. They also explore how this depth degeneracy phenomenon can negatively impact the training process for real-world neural networks.

Technical Explanation

The key insight of this paper is that by examining the evolution of the angle between two inputs to a ReLU neural network as a function of depth, the researchers were able to derive precise mathematical formulas that capture the "depth degeneracy" phenomenon.

Using combinatorial expansions, they found formulas that describe how quickly this angle goes to zero as the network gets deeper. These formulas take into account fine-grained, "microscopic" fluctuations that are missed in the popular infinite width framework, leading to different predictions.

The researchers validated their theoretical results through Monte Carlo experiments, showing that their formulas accurately model the behavior of finite-sized neural networks. They also empirically investigated how the depth degeneracy phenomenon can negatively impact the training of real-world neural networks.

Interestingly, the formulas derived by the researchers are expressed in terms of the mixed moments of correlated Gaussian random variables passed through the ReLU activation function. The researchers also discovered a surprising combinatorial connection between these mixed moments and the Bessel numbers, which allowed them to explicitly evaluate these moments.

Critical Analysis

The key strength of this research is the rigorous mathematical analysis that allowed the researchers to derive precise formulas capturing the "depth degeneracy" phenomenon in neural networks. By going beyond the infinite width approximation, they were able to uncover subtle effects that lead to qualitatively different predictions.

However, a potential limitation is that the analysis is focused on the initialization phase of training, and does not directly address how the depth degeneracy phenomenon may evolve during the full training process. The researchers do provide some empirical observations on the impact on training, but a more comprehensive theoretical treatment of this would be valuable.

Additionally, while the researchers validate their results on finite-sized networks, it would be helpful to see more extensive empirical evaluation on a wider range of network architectures and tasks. This could help strengthen the generalizability of their findings.

Overall, this is a technically impressive piece of work that provides important mathematical insights into the inner workings of deep neural networks. Further research building on these results could lead to a deeper understanding of neural network training and potentially inform the design of more robust architectures.

Conclusion

This paper makes significant progress in understanding the "depth degeneracy" phenomenon in deep neural networks, where the network becomes closer to a constant function as the depth increases. By analyzing the evolution of the angle between inputs, the researchers derived precise mathematical formulas that capture subtle fluctuations missed in previous frameworks.

These results not only advance the theoretical understanding of neural networks, but also have implications for practical applications. The depth degeneracy phenomenon can negatively impact the training of real-world neural networks, so these insights could inform the development of more effective initialization methods and training techniques.

Overall, this work represents an important step forward in the quest to unravel the inner workings of deep learning systems, which is crucial for building more robust and reliable AI technologies in the future.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🧠

Depth Degeneracy in Neural Networks: Vanishing Angles in Fully Connected ReLU Networks on Initialization

Cameron Jakub, Mihai Nica

Despite remarkable performance on a variety of tasks, many properties of deep neural networks are not yet theoretically understood. One such mystery is the depth degeneracy phenomenon: the deeper you make your network, the closer your network is to a constant function on initialization. In this paper, we examine the evolution of the angle between two inputs to a ReLU neural network as a function of the number of layers. By using combinatorial expansions, we find precise formulas for how fast this angle goes to zero as depth increases. These formulas capture microscopic fluctuations that are not visible in the popular framework of infinite width limits, and leads to qualitatively different predictions. We validate our theoretical results with Monte Carlo experiments and show that our results accurately approximate finite network behaviour. review{We also empirically investigate how the depth degeneracy phenomenon can negatively impact training of real networks.} The formulas are given in terms of the mixed moments of correlated Gaussians passed through the ReLU function. We also find a surprising combinatorial connection between these mixed moments and the Bessel numbers that allows us to explicitly evaluate these moments.

8/16/2024

🧠

Towards Lower Bounds on the Depth of ReLU Neural Networks

Christoph Hertrich, Amitabh Basu, Marco Di Summa, Martin Skutella

We contribute to a better understanding of the class of functions that can be represented by a neural network with ReLU activations and a given architecture. Using techniques from mixed-integer optimization, polyhedral theory, and tropical geometry, we provide a mathematical counterbalance to the universal approximation theorems which suggest that a single hidden layer is sufficient for learning any function. In particular, we investigate whether the class of exactly representable functions strictly increases by adding more layers (with no restrictions on size). As a by-product of our investigations, we settle an old conjecture about piecewise linear functions by Wang and Sun (2005) in the affirmative. We also present upper bounds on the sizes of neural networks required to represent functions with logarithmic depth.

7/18/2024

Implicit Hypersurface Approximation Capacity in Deep ReLU Networks

Jonatan Vallin, Karl Larsson, Mats G. Larson

We develop a geometric approximation theory for deep feed-forward neural networks with ReLU activations. Given a $d$-dimensional hypersurface in $mathbb{R}^{d+1}$ represented as the graph of a $C^2$-function $phi$, we show that a deep fully-connected ReLU network of width $d+1$ can implicitly construct an approximation as its zero contour with a precision bound depending on the number of layers. This result is directly applicable to the binary classification setting where the sign of the network is trained as a classifier, with the network's zero contour as a decision boundary. Our proof is constructive and relies on the geometrical structure of ReLU layers provided in [doi:10.48550/arXiv.2310.03482]. Inspired by this geometrical description, we define a new equivalent network architecture that is easier to interpret geometrically, where the action of each hidden layer is a projection onto a polyhedral cone derived from the layer's parameters. By repeatedly adding such layers, with parameters chosen such that we project small parts of the graph of $phi$ from the outside in, we, in a controlled way, construct a network that implicitly approximates the graph over a ball of radius $R$. The accuracy of this construction is controlled by a discretization parameter $delta$ and we show that the tolerance in the resulting error bound scales as $(d-1)R^{3/2}delta^{1/2}$ and the required number of layers is of order $dbig(frac{32R}{delta}big)^{frac{d+1}{2}}$.

7/8/2024

Feature Learning and Generalization in Deep Networks with Orthogonal Weights

Hannah Day, Yonatan Kahn, Daniel A. Roberts

Fully-connected deep neural networks with weights initialized from independent Gaussian distributions can be tuned to criticality, which prevents the exponential growth or decay of signals propagating through the network. However, such networks still exhibit fluctuations that grow linearly with the depth of the network, which may impair the training of networks with width comparable to depth. We show analytically that rectangular networks with tanh activations and weights initialized from the ensemble of orthogonal matrices have corresponding preactivation fluctuations which are independent of depth, to leading order in inverse width. Moreover, we demonstrate numerically that, at initialization, all correlators involving the neural tangent kernel (NTK) and its descendants at leading order in inverse width -- which govern the evolution of observables during training -- saturate at a depth of $sim 20$, rather than growing without bound as in the case of Gaussian initializations. We speculate that this structure preserves finite-width feature learning while reducing overall noise, thus improving both generalization and training speed in deep networks with depth comparable to width. We provide some experimental justification by relating empirical measurements of the NTK to the superior performance of deep nonlinear orthogonal networks trained under full-batch gradient descent on the MNIST and CIFAR-10 classification tasks.

6/13/2024