Central Limit Theorem for Bayesian Neural Network trained with Variational Inference

2406.09048

Published 6/14/2024 by Arnaud Descours (MAGNET), Tom Huix (X), Arnaud Guillin (LMBP), Manon Michel (LMBP), 'Eric Moulines (X), Boris Nectoux (LMBP)

stat.ML cs.LG

🧠

Abstract

In this paper, we rigorously derive Central Limit Theorems (CLT) for Bayesian two-layerneural networks in the infinite-width limit and trained by variational inference on a regression task. The different networks are trained via different maximization schemes of the regularized evidence lower bound: (i) the idealized case with exact estimation of a multiple Gaussian integral from the reparametrization trick, (ii) a minibatch scheme using Monte Carlo sampling, commonly known as Bayes-by-Backprop, and (iii) a computationally cheaper algorithm named Minimal VI. The latter was recently introduced by leveraging the information obtained at the level of the mean-field limit. Laws of large numbers are already rigorously proven for the three schemes that admits the same asymptotic limit. By deriving CLT, this work shows that the idealized and Bayes-by-Backprop schemes have similar fluctuation behavior, that is different from the Minimal VI one. Numerical experiments then illustrate that the Minimal VI scheme is still more efficient, in spite of bigger variances, thanks to its important gain in computational complexity.

Create account to get full access

Overview

The paper rigorously derives Central Limit Theorems (CLT) for Bayesian two-layer neural networks in the infinite-width limit, trained by variational inference on a regression task.
The networks are trained using different maximization schemes of the regularized evidence lower bound: (i) the idealized case with exact estimation, (ii) a minibatch scheme using Monte Carlo sampling (Bayes-by-Backprop), and (iii) a computationally cheaper algorithm named Minimal VI.
Laws of large numbers have already been proven for the three schemes, but this work shows that the idealized and Bayes-by-Backprop schemes have similar fluctuation behavior, which is different from the Minimal VI scheme.
Numerical experiments indicate that the Minimal VI scheme is more efficient, despite having larger variances, due to its lower computational complexity.

Plain English Explanation

In this paper, the researchers looked at a type of machine learning model called a Bayesian two-layer neural network. These models are trained using a technique called variational inference, which helps the model learn from data in a Bayesian way.

The researchers wanted to understand how the behavior of these models changes as the network gets very wide (has many hidden units). They looked at three different ways of training the models:

The "idealized" case, where the training process can perfectly estimate a complex mathematical integral.
A more realistic "minibatch" scheme, which uses a technique called "Bayes-by-Backprop" to approximate the integral.
A computationally cheaper algorithm called "Minimal VI," which leverages information from the "mean-field limit" (a simplification of the model).

The researchers showed that, even though the three training schemes all converge to the same behavior in the limit of infinitely many hidden units, the idealized and Bayes-by-Backprop schemes have similar fluctuations (variations) in their behavior, while the Minimal VI scheme behaves differently.

However, the researchers found that the Minimal VI scheme is actually more efficient in practice, despite its larger fluctuations, because it is much faster to compute. This means it can train the models more quickly, even if the final model has slightly higher "variance" (uncertainty) in its predictions.

Technical Explanation

The paper derives Central Limit Theorems for Bayesian two-layer neural networks in the infinite-width limit, trained by variational inference on a regression task.

The networks are trained using different maximization schemes of the regularized evidence lower bound:

The idealized case with exact estimation of a multiple Gaussian integral from the reparametrization trick.
A minibatch scheme using Monte Carlo sampling, commonly known as Bayes-by-Backprop.
A computationally cheaper algorithm named Minimal VI, recently introduced by leveraging the information obtained at the level of the mean-field limit.

Laws of large numbers have already been rigorously proven for the three schemes, which admit the same asymptotic limit. By deriving CLT, this work shows that the idealized and Bayes-by-Backprop schemes have similar fluctuation behavior, which is different from the Minimal VI one.

Numerical experiments then illustrate that the Minimal VI scheme is still more efficient, in spite of bigger variances, thanks to its important gain in computational complexity.

Critical Analysis

The paper provides a rigorous theoretical analysis of the behavior of Bayesian neural networks in the infinite-width limit, which is an important problem in the field. The derivation of Central Limit Theorems for the different training schemes offers valuable insights into the fluctuations and convergence properties of these models.

One potential limitation of the research is that it focuses solely on the infinite-width case, which may not fully capture the behavior of realistic neural networks with finite widths. Additionally, the paper does not explore the implications of these theoretical results for the practical performance of Bayesian neural networks on real-world tasks.

Further research could investigate the finite-width case, as well as the impact of the different training schemes on the generalization and robustness of the learned models. It would also be interesting to see how these findings compare to other approaches for training Bayesian neural networks, such as few-sample variational inference.

Overall, this paper provides a valuable contribution to the theoretical understanding of Bayesian neural networks and highlights the importance of carefully considering the impact of training algorithms on the behavior of these models.

Conclusion

This paper presents a rigorous theoretical analysis of Bayesian two-layer neural networks in the infinite-width limit, trained by variational inference on a regression task. The researchers derived Central Limit Theorems for three different training schemes: the idealized case, a minibatch scheme using Bayes-by-Backprop, and a computationally cheaper algorithm called Minimal VI.

The results show that the idealized and Bayes-by-Backprop schemes have similar fluctuation behavior, which is different from the Minimal VI scheme. However, numerical experiments indicate that the Minimal VI scheme is more efficient in practice, despite its larger variances, due to its lower computational complexity.

These findings offer important insights into the theoretical properties of Bayesian neural networks and could have implications for the design of more efficient training algorithms for these models, which are widely used in machine learning and artificial intelligence applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤯

Posterior Inference on Shallow Infinitely Wide Bayesian Neural Networks under Weights with Unbounded Variance

Jorge Lor'ia, Anindya Bhadra

From the classical and influential works of Neal (1996), it is known that the infinite width scaling limit of a Bayesian neural network with one hidden layer is a Gaussian process, when the network weights have bounded prior variance. Neal's result has been extended to networks with multiple hidden layers and to convolutional neural networks, also with Gaussian process scaling limits. The tractable properties of Gaussian processes then allow straightforward posterior inference and uncertainty quantification, considerably simplifying the study of the limit process compared to a network of finite width. Neural network weights with unbounded variance, however, pose unique challenges. In this case, the classical central limit theorem breaks down and it is well known that the scaling limit is an $alpha$-stable process under suitable conditions. However, current literature is primarily limited to forward simulations under these processes and the problem of posterior inference under such a scaling limit remains largely unaddressed, unlike in the Gaussian process case. To this end, our contribution is an interpretable and computationally efficient procedure for posterior inference, using a conditionally Gaussian representation, that then allows full use of the Gaussian process machinery for tractable posterior inference and uncertainty quantification in the non-Gaussian regime.

6/6/2024

stat.ML cs.LG

🤿

Quantitative CLTs in Deep Neural Networks

Stefano Favaro, Boris Hanin, Domenico Marinucci, Ivan Nourdin, Giovanni Peccati

We study the distribution of a fully connected neural network with random Gaussian weights and biases in which the hidden layer widths are proportional to a large constant $n$. Under mild assumptions on the non-linearity, we obtain quantitative bounds on normal approximations valid at large but finite $n$ and any fixed network depth. Our theorems show both for the finite-dimensional distributions and the entire process, that the distance between a random fully connected network (and its derivatives) to the corresponding infinite width Gaussian process scales like $n^{-gamma}$ for $gamma>0$, with the exponent depending on the metric used to measure discrepancy. Our bounds are strictly stronger in terms of their dependence on network width than any previously available in the literature; in the one-dimensional case, we also prove that they are optimal, i.e., we establish matching lower bounds.

6/18/2024

cs.LG cs.AI stat.ML

🤯

Posterior and variational inference for deep neural networks with heavy-tailed weights

Ismael Castillo, Paul Egels

We consider deep neural networks in a Bayesian framework with a prior distribution sampling the network weights at random. Following a recent idea of Agapiou and Castillo (2023), who show that heavy-tailed prior distributions achieve automatic adaptation to smoothness, we introduce a simple Bayesian deep learning prior based on heavy-tailed weights and ReLU activation. We show that the corresponding posterior distribution achieves near-optimal minimax contraction rates, simultaneously adaptive to both intrinsic dimension and smoothness of the underlying function, in a variety of contexts including nonparametric regression, geometric data and Besov spaces. While most works so far need a form of model selection built-in within the prior distribution, a key aspect of our approach is that it does not require to sample hyperparameters to learn the architecture of the network. We also provide variational Bayes counterparts of the results, that show that mean-field variational approximations still benefit from near-optimal theoretical support.

6/6/2024

stat.ML cs.LG

Bayesian Inference with Deep Weakly Nonlinear Networks

Boris Hanin, Alexander Zlokapa

We show at a physics level of rigor that Bayesian inference with a fully connected neural network and a shaped nonlinearity of the form $phi(t) = t + psi t^3/L$ is (perturbatively) solvable in the regime where the number of training datapoints $P$ , the input dimension $N_0$, the network layer widths $N$, and the network depth $L$ are simultaneously large. Our results hold with weak assumptions on the data; the main constraint is that $P < N_0$. We provide techniques to compute the model evidence and posterior to arbitrary order in $1/N$ and at arbitrary temperature. We report the following results from the first-order computation: 1. When the width $N$ is much larger than the depth $L$ and training set size $P$, neural network Bayesian inference coincides with Bayesian inference using a kernel. The value of $psi$ determines the curvature of a sphere, hyperbola, or plane into which the training data is implicitly embedded under the feature map. 2. When $LP/N$ is a small constant, neural network Bayesian inference departs from the kernel regime. At zero temperature, neural network Bayesian inference is equivalent to Bayesian inference using a data-dependent kernel, and $LP/N$ serves as an effective depth that controls the extent of feature learning. 3. In the restricted case of deep linear networks ($psi=0$) and noisy data, we show a simple data model for which evidence and generalization error are optimal at zero temperature. As $LP/N$ increases, both evidence and generalization further improve, demonstrating the benefit of depth in benign overfitting.

5/28/2024

stat.ML cs.AI cs.LG