Random ReLU Neural Networks as Non-Gaussian Processes

2405.10229

Published 5/17/2024 by Rahul Parhi, Pakshal Bohra, Ayoub El Biari, Mehrsa Pourya, Michael Unser

Random ReLU Neural Networks as Non-Gaussian Processes

Abstract

We consider a large class of shallow neural networks with randomly initialized parameters and rectified linear unit activation functions. We prove that these random neural networks are well-defined non-Gaussian processes. As a by-product, we demonstrate that these networks are solutions to stochastic differential equations driven by impulsive white noise (combinations of random Dirac measures). These processes are parameterized by the law of the weights and biases as well as the density of activation thresholds in each bounded region of the input domain. We prove that these processes are isotropic and wide-sense self-similar with Hurst exponent $3/2$. We also derive a remarkably simple closed-form expression for their autocovariance function. Our results are fundamentally different from prior work in that we consider a non-asymptotic viewpoint: The number of neurons in each bounded region of the input domain (i.e., the width) is itself a random variable with a Poisson law with mean proportional to the density parameter. Finally, we show that, under suitable hypotheses, as the expected width tends to infinity, these processes can converge in law not only to Gaussian processes, but also to non-Gaussian processes depending on the law of the weights. Our asymptotic results provide a new take on several classical results (wide networks converge to Gaussian processes) as well as some new ones (wide networks can converge to non-Gaussian processes).

Create account to get full access

Overview

This paper investigates the properties of random ReLU neural networks, which are neural networks with randomly initialized weights and ReLU activation functions.
The authors show that random ReLU neural networks can be modeled as non-Gaussian stochastic processes, in contrast to the commonly assumed Gaussian processes.
The paper explores the implications of this finding for the approximation power and training dynamics of random ReLU networks.

Plain English Explanation

Random ReLU neural networks are a type of artificial neural network where the weights (the numbers that determine how the inputs are transformed) are randomly chosen, and the activation function used is a ReLU (Rectified Linear Unit). This means that the output of the network is a non-linear function of the inputs, but the specific shape of this function is determined randomly rather than being learned from data.

The authors of this paper show that these random ReLU networks can be modeled as a special type of stochastic process, known as a non-Gaussian process. This is in contrast to the more commonly assumed Gaussian process model, which has been used to study the properties of other types of random neural networks.

The significance of this finding is that it suggests random ReLU networks may have different approximation capabilities and training dynamics compared to networks with Gaussian processes. For example, the paper on multi-layer random features has shown that Gaussian random networks can approximate a wide range of functions, while the paper on over-parameterized shallow ReLU networks has explored the training dynamics of random ReLU networks. By modeling random ReLU networks as non-Gaussian processes, this paper provides a new theoretical perspective on these topics.

Technical Explanation

The key technical contribution of this paper is to show that random ReLU neural networks can be modeled as non-Gaussian stochastic processes. This is in contrast to the commonly assumed Gaussian process model, which has been used to study the properties of other types of random neural networks, such as those analyzed in the paper on the spectral complexity of deep neural networks.

To derive this result, the authors use tools from the field of Stein's method, which provides a way to approximate the distribution of a random variable by comparing it to a reference distribution. Specifically, they show that the output of a random ReLU network can be well-approximated by a Gaussian random field, but with non-Gaussian higher-order moments, as described in the paper on Gaussian random field approximation.

This non-Gaussian behavior arises due to the ReLU activation function, which introduces asymmetry and higher-order nonlinearities into the network. The authors analyze the implications of this finding for the Wilsonian renormalization of neural network Gaussian processes, suggesting that the training dynamics and approximation power of random ReLU networks may differ from networks with Gaussian processes.

Critical Analysis

The paper provides a strong theoretical foundation for understanding the properties of random ReLU neural networks, but there are a few potential limitations and areas for further research:

The analysis focuses on random ReLU networks, but many practical neural networks use other activation functions, such as sigmoid or tanh. It would be interesting to see if similar non-Gaussian behavior arises in these other types of random networks.
The paper only considers the infinite-width limit of random ReLU networks, whereas most practical networks have a finite number of neurons. Extending the analysis to finite-width networks could provide additional insights.
The implications of the non-Gaussian behavior for the practical performance of random ReLU networks are not fully explored. Further empirical studies or connections to existing work on the approximation power and training dynamics of these networks would be valuable.
The paper relies heavily on technical tools from Stein's method and stochastic process theory. While the authors provide clear explanations, the mathematical complexity may limit the accessibility of the results to a broader audience.

Conclusion

This paper offers a novel perspective on the theoretical properties of random ReLU neural networks by modeling them as non-Gaussian stochastic processes. This finding suggests that the approximation capabilities and training dynamics of these networks may differ from those with the more commonly assumed Gaussian process structure.

By providing a deeper understanding of the underlying mathematical structure of random ReLU networks, this work opens up new avenues for research into the design and optimization of these models. The insights could also have implications for the development of more robust and interpretable neural network architectures, which is an active area of research in machine learning.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🧠

Large Deviations of Gaussian Neural Networks with ReLU activation

Quirin Vogel

We prove a large deviation principle for deep neural networks with Gaussian weights and (at most linearly growing) activation functions. This generalises earlier work, in which bounded and continuous activation functions were considered. In practice, linearly growing activation functions such as ReLU are most commonly used. We furthermore simplify previous expressions for the rate function and a give power-series expansions for the ReLU case.

5/28/2024

stat.ML cs.LG

🤿

Quantitative CLTs in Deep Neural Networks

Stefano Favaro, Boris Hanin, Domenico Marinucci, Ivan Nourdin, Giovanni Peccati

We study the distribution of a fully connected neural network with random Gaussian weights and biases in which the hidden layer widths are proportional to a large constant $n$. Under mild assumptions on the non-linearity, we obtain quantitative bounds on normal approximations valid at large but finite $n$ and any fixed network depth. Our theorems show both for the finite-dimensional distributions and the entire process, that the distance between a random fully connected network (and its derivatives) to the corresponding infinite width Gaussian process scales like $n^{-gamma}$ for $gamma>0$, with the exponent depending on the metric used to measure discrepancy. Our bounds are strictly stronger in terms of their dependence on network width than any previously available in the literature; in the one-dimensional case, we also prove that they are optimal, i.e., we establish matching lower bounds.

6/18/2024

cs.LG cs.AI stat.ML

🧠

Multi-layer random features and the approximation power of neural networks

Rustem Takhanov

A neural architecture with randomly initialized weights, in the infinite width limit, is equivalent to a Gaussian Random Field whose covariance function is the so-called Neural Network Gaussian Process kernel (NNGP). We prove that a reproducing kernel Hilbert space (RKHS) defined by the NNGP contains only functions that can be approximated by the architecture. To achieve a certain approximation error the required number of neurons in each layer is defined by the RKHS norm of the target function. Moreover, the approximation can be constructed from a supervised dataset by a random multi-layer representation of an input vector, together with training of the last layer's weights. For a 2-layer NN and a domain equal to an $n-1$-dimensional sphere in ${mathbb R}^n$, we compare the number of neurons required by Barron's theorem and by the multi-layer features construction. We show that if eigenvalues of the integral operator of the NNGP decay slower than $k^{-n-frac{2}{3}}$ where $k$ is an order of an eigenvalue, then our theorem guarantees a more succinct neural network approximation than Barron's theorem. We also make some computational experiments to verify our theoretical findings. Our experiments show that realistic neural networks easily learn target functions even when both theorems do not give any guarantees.

4/29/2024

cs.LG cs.AI

BrowNNe: Brownian Nonlocal Neurons & Activation Functions

Sriram Nagaraj, Truman Hickok

It is generally thought that the use of stochastic activation functions in deep learning architectures yield models with superior generalization abilities. However, a sufficiently rigorous statement and theoretical proof of this heuristic is lacking in the literature. In this paper, we provide several novel contributions to the literature in this regard. Defining a new notion of nonlocal directional derivative, we analyze its theoretical properties (existence and convergence). Second, using a probabilistic reformulation, we show that nonlocal derivatives are epsilon-sub gradients, and derive sample complexity results for convergence of stochastic gradient descent-like methods using nonlocal derivatives. Finally, using our analysis of the nonlocal gradient of Holder continuous functions, we observe that sample paths of Brownian motion admit nonlocal directional derivatives, and the nonlocal derivatives of Brownian motion are seen to be Gaussian processes with computable mean and standard deviation. Using the theory of nonlocal directional derivatives, we solve a highly nondifferentiable and nonconvex model problem of parameter estimation on image articulation manifolds. Using Brownian motion infused ReLU activation functions with the nonlocal gradient in place of the usual gradient during backpropagation, we also perform experiments on multiple well-studied deep learning architectures. Our experiments indicate the superior generalization capabilities of Brownian neural activation functions in low-training data regimes, where the use of stochastic neurons beats the deterministic ReLU counterpart.

6/26/2024

cs.LG cs.NA