Posterior Inference on Shallow Infinitely Wide Bayesian Neural Networks under Weights with Unbounded Variance

2305.10664

Published 6/6/2024 by Jorge Lor'ia, Anindya Bhadra

🤯

Abstract

From the classical and influential works of Neal (1996), it is known that the infinite width scaling limit of a Bayesian neural network with one hidden layer is a Gaussian process, when the network weights have bounded prior variance. Neal's result has been extended to networks with multiple hidden layers and to convolutional neural networks, also with Gaussian process scaling limits. The tractable properties of Gaussian processes then allow straightforward posterior inference and uncertainty quantification, considerably simplifying the study of the limit process compared to a network of finite width. Neural network weights with unbounded variance, however, pose unique challenges. In this case, the classical central limit theorem breaks down and it is well known that the scaling limit is an $alpha$-stable process under suitable conditions. However, current literature is primarily limited to forward simulations under these processes and the problem of posterior inference under such a scaling limit remains largely unaddressed, unlike in the Gaussian process case. To this end, our contribution is an interpretable and computationally efficient procedure for posterior inference, using a conditionally Gaussian representation, that then allows full use of the Gaussian process machinery for tractable posterior inference and uncertainty quantification in the non-Gaussian regime.

Create account to get full access

Overview

The paper investigates the scaling limits of Bayesian neural networks, which describe the behavior of these networks as the number of hidden units approaches infinity.
When the network weights have bounded prior variance, the scaling limit is a Gaussian process, which simplifies posterior inference and uncertainty quantification.
However, when the weights have unbounded variance, the scaling limit is an α-stable process, posing unique challenges for posterior inference.
The authors propose an efficient procedure for posterior inference under this non-Gaussian scaling limit, using a conditionally Gaussian representation.

Plain English Explanation

Bayesian neural networks are a type of machine learning model that can quantify the uncertainty in their predictions. As the number of hidden units in these networks gets very large, their behavior can be described by a mathematical process.

When the weights (the parameters that determine the network's behavior) have a bounded amount of variation, this limiting process is a Gaussian process. Gaussian processes have handy mathematical properties that make it easy to analyze the network's behavior and uncertainties.

But if the weights have an unbounded amount of variation, the limiting process is not a Gaussian process. Instead, it's a different type of mathematical process called an α-stable process. These α-stable processes are more challenging to work with.

The authors of this paper developed a new method that can still perform efficient and interpretable analysis of Bayesian neural networks, even when the weights have unbounded variation and the limiting process is non-Gaussian. Their key insight was to represent the non-Gaussian process in a special way that allows them to leverage the tools of Gaussian processes.

Technical Explanation

The paper builds on the classical work of Neal (1996), which showed that the infinite width scaling limit of a Bayesian neural network with one hidden layer is a Gaussian process, when the network weights have bounded prior variance.

This result has been extended to networks with multiple hidden layers and convolutional neural networks, all of which have Gaussian process scaling limits. The tractable properties of Gaussian processes then simplify posterior inference and uncertainty quantification, compared to analyzing a network of finite width.

However, when the network weights have unbounded variance, the classical central limit theorem breaks down. In this case, the scaling limit is an α-stable process under suitable conditions, as shown in recent work.

While forward simulations under these α-stable processes have been studied, the problem of posterior inference remains largely unaddressed, unlike in the Gaussian process case. To address this, the authors propose an interpretable and computationally efficient procedure for posterior inference, using a conditionally Gaussian representation.

This allows them to leverage the machinery of Gaussian processes to perform tractable posterior inference and uncertainty quantification, even in the non-Gaussian regime.

Critical Analysis

The paper provides a valuable contribution by addressing the challenge of posterior inference under the non-Gaussian scaling limit of Bayesian neural networks with unbounded weight variances.

While forward simulations under α-stable processes have been explored, the authors rightly note that the problem of posterior inference in this setting has remained largely unaddressed. Their proposed method of using a conditionally Gaussian representation is an ingenious solution that allows them to harness the well-understood tools of Gaussian processes.

One potential limitation is that the method may be sensitive to the specific choice of the conditional Gaussian representation. The authors acknowledge this and mention that further research is needed to fully understand the properties and limitations of their approach.

Additionally, the paper focuses on the theoretical properties of the scaling limit and the proposed inference method, but does not provide extensive empirical evaluations. It would be valuable to see how the method performs on realistic benchmark tasks, especially in comparison to other approaches for Bayesian inference in deep neural networks, such as those explored in other papers in this area.

Overall, this paper makes an important contribution by addressing a significant gap in the literature and providing a promising solution for posterior inference in Bayesian neural networks with non-Gaussian scaling limits. Further empirical validation and exploration of the method's properties would help solidify its value and potential impact.

Conclusion

This paper tackles the challenge of posterior inference in Bayesian neural networks with unbounded weight variances, where the scaling limit is a non-Gaussian α-stable process. By proposing an efficient procedure that leverages a conditionally Gaussian representation, the authors enable the use of well-understood Gaussian process tools for tractable posterior inference and uncertainty quantification in this setting.

This work represents an important step forward in understanding and analyzing the behavior of Bayesian neural networks in the limit of infinitely many hidden units. The authors' approach has the potential to significantly simplify the study of these powerful machine learning models, with implications for a wide range of applications where accurate uncertainty quantification is crucial.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤯

Posterior and variational inference for deep neural networks with heavy-tailed weights

Ismael Castillo, Paul Egels

We consider deep neural networks in a Bayesian framework with a prior distribution sampling the network weights at random. Following a recent idea of Agapiou and Castillo (2023), who show that heavy-tailed prior distributions achieve automatic adaptation to smoothness, we introduce a simple Bayesian deep learning prior based on heavy-tailed weights and ReLU activation. We show that the corresponding posterior distribution achieves near-optimal minimax contraction rates, simultaneously adaptive to both intrinsic dimension and smoothness of the underlying function, in a variety of contexts including nonparametric regression, geometric data and Besov spaces. While most works so far need a form of model selection built-in within the prior distribution, a key aspect of our approach is that it does not require to sample hyperparameters to learn the architecture of the network. We also provide variational Bayes counterparts of the results, that show that mean-field variational approximations still benefit from near-optimal theoretical support.

6/6/2024

stat.ML cs.LG

🧠

Central Limit Theorem for Bayesian Neural Network trained with Variational Inference

Arnaud Descours (MAGNET), Tom Huix (X), Arnaud Guillin (LMBP), Manon Michel (LMBP), 'Eric Moulines (X), Boris Nectoux (LMBP)

In this paper, we rigorously derive Central Limit Theorems (CLT) for Bayesian two-layerneural networks in the infinite-width limit and trained by variational inference on a regression task. The different networks are trained via different maximization schemes of the regularized evidence lower bound: (i) the idealized case with exact estimation of a multiple Gaussian integral from the reparametrization trick, (ii) a minibatch scheme using Monte Carlo sampling, commonly known as Bayes-by-Backprop, and (iii) a computationally cheaper algorithm named Minimal VI. The latter was recently introduced by leveraging the information obtained at the level of the mean-field limit. Laws of large numbers are already rigorously proven for the three schemes that admits the same asymptotic limit. By deriving CLT, this work shows that the idealized and Bayes-by-Backprop schemes have similar fluctuation behavior, that is different from the Minimal VI one. Numerical experiments then illustrate that the Minimal VI scheme is still more efficient, in spite of bigger variances, thanks to its important gain in computational complexity.

6/14/2024

stat.ML cs.LG

🤯

Scalable Bayesian Inference in the Era of Deep Learning: From Gaussian Processes to Deep Neural Networks

Javier Antoran

Large neural networks trained on large datasets have become the dominant paradigm in machine learning. These systems rely on maximum likelihood point estimates of their parameters, precluding them from expressing model uncertainty. This may result in overconfident predictions and it prevents the use of deep learning models for sequential decision making. This thesis develops scalable methods to equip neural networks with model uncertainty. In particular, we leverage the linearised Laplace approximation to equip pre-trained neural networks with the uncertainty estimates provided by their tangent linear models. This turns the problem of Bayesian inference in neural networks into one of Bayesian inference in conjugate Gaussian-linear models. Alas, the cost of this remains cubic in either the number of network parameters or in the number of observations times output dimensions. By assumption, neither are tractable. We address this intractability by using stochastic gradient descent (SGD) -- the workhorse algorithm of deep learning -- to perform posterior sampling in linear models and their convex duals: Gaussian processes. With this, we turn back to linearised neural networks, finding the linearised Laplace approximation to present a number of incompatibilities with modern deep learning practices -- namely, stochastic optimisation, early stopping and normalisation layers -- when used for hyperparameter learning. We resolve these and construct a sample-based EM algorithm for scalable hyperparameter learning with linearised neural networks. We apply the above methods to perform linearised neural network inference with ResNet-50 (25M parameters) trained on Imagenet (1.2M observations and 1000 output dimensions). Additionally, we apply our methods to estimate uncertainty for 3d tomographic reconstructions obtained with the deep image prior network.

5/1/2024

stat.ML cs.LG

Bayesian Inference with Deep Weakly Nonlinear Networks

Boris Hanin, Alexander Zlokapa

We show at a physics level of rigor that Bayesian inference with a fully connected neural network and a shaped nonlinearity of the form $phi(t) = t + psi t^3/L$ is (perturbatively) solvable in the regime where the number of training datapoints $P$ , the input dimension $N_0$, the network layer widths $N$, and the network depth $L$ are simultaneously large. Our results hold with weak assumptions on the data; the main constraint is that $P < N_0$. We provide techniques to compute the model evidence and posterior to arbitrary order in $1/N$ and at arbitrary temperature. We report the following results from the first-order computation: 1. When the width $N$ is much larger than the depth $L$ and training set size $P$, neural network Bayesian inference coincides with Bayesian inference using a kernel. The value of $psi$ determines the curvature of a sphere, hyperbola, or plane into which the training data is implicitly embedded under the feature map. 2. When $LP/N$ is a small constant, neural network Bayesian inference departs from the kernel regime. At zero temperature, neural network Bayesian inference is equivalent to Bayesian inference using a data-dependent kernel, and $LP/N$ serves as an effective depth that controls the extent of feature learning. 3. In the restricted case of deep linear networks ($psi=0$) and noisy data, we show a simple data model for which evidence and generalization error are optimal at zero temperature. As $LP/N$ increases, both evidence and generalization further improve, demonstrating the benefit of depth in benign overfitting.

5/28/2024

stat.ML cs.AI cs.LG