Bayesian Inference with Deep Weakly Nonlinear Networks

2405.16630

Published 5/28/2024 by Boris Hanin, Alexander Zlokapa

Bayesian Inference with Deep Weakly Nonlinear Networks

Abstract

We show at a physics level of rigor that Bayesian inference with a fully connected neural network and a shaped nonlinearity of the form $phi(t) = t + psi t^3/L$ is (perturbatively) solvable in the regime where the number of training datapoints $P$ , the input dimension $N_0$, the network layer widths $N$, and the network depth $L$ are simultaneously large. Our results hold with weak assumptions on the data; the main constraint is that $P < N_0$. We provide techniques to compute the model evidence and posterior to arbitrary order in $1/N$ and at arbitrary temperature. We report the following results from the first-order computation:

When the width $N$ is much larger than the depth $L$ and training set size $P$, neural network Bayesian inference coincides with Bayesian inference using a kernel. The value of $psi$ determines the curvature of a sphere, hyperbola, or plane into which the training data is implicitly embedded under the feature map.
When $LP/N$ is a small constant, neural network Bayesian inference departs from the kernel regime. At zero temperature, neural network Bayesian inference is equivalent to Bayesian inference using a data-dependent kernel, and $LP/N$ serves as an effective depth that controls the extent of feature learning.
In the restricted case of deep linear networks ($psi=0$) and noisy data, we show a simple data model for which evidence and generalization error are optimal at zero temperature. As $LP/N$ increases, both evidence and generalization further improve, demonstrating the benefit of depth in benign overfitting.

Create account to get full access

Overview

This paper proposes a new approach for Bayesian inference using deep neural networks with weakly nonlinear activation functions.
The method aims to improve the scalability and effectiveness of Bayesian inference in the era of deep learning.
The authors demonstrate the advantages of their approach through both theoretical analysis and empirical evaluation on several benchmark tasks.

Plain English Explanation

The paper presents a new technique for performing Bayesian inference, which is the process of updating our beliefs about unknown quantities based on observed data. Traditionally, Bayesian inference can be computationally intensive, especially when dealing with complex models like deep neural networks.

The key idea behind this research is to use a specific type of deep neural network architecture - one with "weakly nonlinear" activation functions. This means the neural network has a relatively simple, smooth nonlinear behavior, rather than highly complex or discontinuous nonlinearities.

By using this type of network, the authors show that Bayesian inference can be made more scalable and effective. The neural network can capture the underlying patterns in the data, while the Bayesian framework allows for principled uncertainty quantification and robust decision-making. This approach could be particularly useful in applications like link to Bayesian reasoning in physics-informed neural networks or link to Bayesian inference for consistent predictions in overparameterized nonlinear regression, where handling uncertainty is crucial.

Technical Explanation

The authors propose a novel framework for Bayesian inference using deep neural networks with weakly nonlinear activation functions. Traditionally, applying Bayesian inference to deep learning models has been challenging due to the high complexity and nonlinearity of neural networks, which can make the inference process computationally intensive.

To address this, the authors leverage a class of neural networks with weakly nonlinear activation functions, such as softplus or tanh. These functions have a relatively simple, smooth nonlinear behavior, which allows for more efficient Bayesian inference compared to highly complex nonlinearities.

The authors develop a theoretical framework to analyze the properties of this approach, including its ability to approximate arbitrary functions and the scalability of the inference process. They also demonstrate the practical effectiveness of their method through experiments on several benchmark tasks, including link to scalable Bayesian inference in the era of deep learning and link to few-sample variational inference for Bayesian neural networks.

Critical Analysis

The paper presents a promising approach for improving the scalability and effectiveness of Bayesian inference in deep learning. The use of weakly nonlinear activation functions is an interesting and well-motivated choice, as it allows for a balance between expressive power and computational tractability.

One potential limitation, as mentioned in the paper, is that the method may not be as effective for modeling highly complex or discontinuous nonlinearities. Additionally, the authors note that the theoretical analysis is focused on the function approximation properties of the networks, and further research may be needed to fully understand the statistical properties and convergence guarantees of the Bayesian inference process.

It would also be valuable to see the method applied to a broader range of real-world applications, beyond the benchmark tasks presented in the paper, to further validate its practical utility. For example, the approach could be evaluated in the context of link to probabilistic survival analysis by approximate Bayesian inference, where accurate uncertainty quantification is crucial.

Overall, this paper represents an important contribution to the field of Bayesian deep learning and provides a solid foundation for further research and development in this area.

Conclusion

This paper introduces a novel framework for Bayesian inference using deep neural networks with weakly nonlinear activation functions. The authors demonstrate that by leveraging this specific type of neural network architecture, Bayesian inference can be made more scalable and effective, without sacrificing the expressive power of deep learning models.

The theoretical analysis and empirical results presented in the paper suggest that this approach has the potential to significantly advance the state of the art in Bayesian deep learning, with important implications for applications that require robust uncertainty quantification and decision-making. Further research and real-world deployments will be crucial to fully realize the benefits of this promising technique.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤯

Scalable Bayesian Inference in the Era of Deep Learning: From Gaussian Processes to Deep Neural Networks

Javier Antoran

Large neural networks trained on large datasets have become the dominant paradigm in machine learning. These systems rely on maximum likelihood point estimates of their parameters, precluding them from expressing model uncertainty. This may result in overconfident predictions and it prevents the use of deep learning models for sequential decision making. This thesis develops scalable methods to equip neural networks with model uncertainty. In particular, we leverage the linearised Laplace approximation to equip pre-trained neural networks with the uncertainty estimates provided by their tangent linear models. This turns the problem of Bayesian inference in neural networks into one of Bayesian inference in conjugate Gaussian-linear models. Alas, the cost of this remains cubic in either the number of network parameters or in the number of observations times output dimensions. By assumption, neither are tractable. We address this intractability by using stochastic gradient descent (SGD) -- the workhorse algorithm of deep learning -- to perform posterior sampling in linear models and their convex duals: Gaussian processes. With this, we turn back to linearised neural networks, finding the linearised Laplace approximation to present a number of incompatibilities with modern deep learning practices -- namely, stochastic optimisation, early stopping and normalisation layers -- when used for hyperparameter learning. We resolve these and construct a sample-based EM algorithm for scalable hyperparameter learning with linearised neural networks. We apply the above methods to perform linearised neural network inference with ResNet-50 (25M parameters) trained on Imagenet (1.2M observations and 1000 output dimensions). Additionally, we apply our methods to estimate uncertainty for 3d tomographic reconstructions obtained with the deep image prior network.

5/1/2024

stat.ML cs.LG

🤯

Posterior Inference on Shallow Infinitely Wide Bayesian Neural Networks under Weights with Unbounded Variance

Jorge Lor'ia, Anindya Bhadra

From the classical and influential works of Neal (1996), it is known that the infinite width scaling limit of a Bayesian neural network with one hidden layer is a Gaussian process, when the network weights have bounded prior variance. Neal's result has been extended to networks with multiple hidden layers and to convolutional neural networks, also with Gaussian process scaling limits. The tractable properties of Gaussian processes then allow straightforward posterior inference and uncertainty quantification, considerably simplifying the study of the limit process compared to a network of finite width. Neural network weights with unbounded variance, however, pose unique challenges. In this case, the classical central limit theorem breaks down and it is well known that the scaling limit is an $alpha$-stable process under suitable conditions. However, current literature is primarily limited to forward simulations under these processes and the problem of posterior inference under such a scaling limit remains largely unaddressed, unlike in the Gaussian process case. To this end, our contribution is an interpretable and computationally efficient procedure for posterior inference, using a conditionally Gaussian representation, that then allows full use of the Gaussian process machinery for tractable posterior inference and uncertainty quantification in the non-Gaussian regime.

6/6/2024

stat.ML cs.LG

🧠

Bayesian Reasoning for Physics Informed Neural Networks

Krzysztof M. Graczyk, Kornel Witkowski

We present the application of the physics-informed neural network (PINN) approach in Bayesian formulation. We have adopted the Bayesian neural network framework to obtain posterior densities from Laplace approximation. For each model or fit, the evidence is computed, which is a measure that classifies the hypothesis. The optimal solution is the one with the highest value of evidence. We have proposed a modification of the Bayesian algorithm to obtain hyperparameters of the model. We have shown that within the Bayesian framework, one can obtain the relative weights between the boundary and equation contributions to the total loss. Presented method leads to predictions comparable to those obtained by sampling from the posterior distribution within the Hybrid Monte Carlo algorithm (HMC). We have solved heat, wave, and Burger's equations, and the results obtained are in agreement with the exact solutions, demonstrating the effectiveness of our approach. In Burger's equation problem, we have demonstrated that the framework can combine information from differential equations and potential measurements. All solutions are provided with uncertainties (induced by the model's parameter dependence) computed within the Bayesian framework.

4/30/2024

cs.LG stat.ML

🤯

Posterior and variational inference for deep neural networks with heavy-tailed weights

Ismael Castillo, Paul Egels

We consider deep neural networks in a Bayesian framework with a prior distribution sampling the network weights at random. Following a recent idea of Agapiou and Castillo (2023), who show that heavy-tailed prior distributions achieve automatic adaptation to smoothness, we introduce a simple Bayesian deep learning prior based on heavy-tailed weights and ReLU activation. We show that the corresponding posterior distribution achieves near-optimal minimax contraction rates, simultaneously adaptive to both intrinsic dimension and smoothness of the underlying function, in a variety of contexts including nonparametric regression, geometric data and Besov spaces. While most works so far need a form of model selection built-in within the prior distribution, a key aspect of our approach is that it does not require to sample hyperparameters to learn the architecture of the network. We also provide variational Bayes counterparts of the results, that show that mean-field variational approximations still benefit from near-optimal theoretical support.

6/6/2024

stat.ML cs.LG