Posterior and variational inference for deep neural networks with heavy-tailed weights

2406.03369

Published 6/6/2024 by Ismael Castillo, Paul Egels

🤯

Abstract

We consider deep neural networks in a Bayesian framework with a prior distribution sampling the network weights at random. Following a recent idea of Agapiou and Castillo (2023), who show that heavy-tailed prior distributions achieve automatic adaptation to smoothness, we introduce a simple Bayesian deep learning prior based on heavy-tailed weights and ReLU activation. We show that the corresponding posterior distribution achieves near-optimal minimax contraction rates, simultaneously adaptive to both intrinsic dimension and smoothness of the underlying function, in a variety of contexts including nonparametric regression, geometric data and Besov spaces. While most works so far need a form of model selection built-in within the prior distribution, a key aspect of our approach is that it does not require to sample hyperparameters to learn the architecture of the network. We also provide variational Bayes counterparts of the results, that show that mean-field variational approximations still benefit from near-optimal theoretical support.

Create account to get full access

Overview

This paper explores Bayesian inference techniques for deep neural networks with heavy-tailed weight distributions.
The authors investigate posterior and variational inference methods for training deep neural networks with non-Gaussian priors, which can capture heavy-tailed weight distributions.
The goal is to develop more flexible and robust neural network models that can better handle uncertainty and outliers in the data.

Plain English Explanation

Deep neural networks are powerful machine learning models that can excel at a wide range of tasks, from image recognition to language processing. However, the standard approach to training these models often assumes that the network weights follow a Gaussian (normal) distribution.

In reality, the true weight distributions of neural networks may be heavy-tailed, meaning they have a larger proportion of very small and very large values compared to a Gaussian distribution. This can be important, as heavy-tailed distributions can better capture uncertainty and handle outliers in the data.

This paper explores Bayesian inference techniques that can learn neural network weights with non-Gaussian, heavy-tailed prior distributions. Bayesian inference is a powerful framework for reasoning about uncertainty in machine learning models.

The authors investigate two main approaches: posterior inference and variational inference. Posterior inference directly samples from the posterior distribution of the weights, while variational inference approximates the posterior using a simpler distribution.

The goal is to develop more flexible and robust neural network models that can better handle uncertainty and outliers in the data, which could lead to improved performance on a variety of real-world tasks.

Technical Explanation

The paper focuses on Bayesian deep neural networks with heavy-tailed weight distributions. Specifically, the authors consider two inference techniques: posterior inference and variational inference.

For posterior inference, the authors use Markov Chain Monte Carlo (MCMC) sampling to draw samples from the true posterior distribution of the network weights. This allows them to capture the full uncertainty in the weights, including heavy-tailed behavior.

For variational inference, the authors propose a flexible variational distribution that can approximate heavy-tailed posteriors. This involves using a scale mixture of Gaussians as the variational family, which can better capture the heavy tails compared to a standard Gaussian variational distribution.

The authors evaluate their approaches on several benchmark datasets and find that the heavy-tailed Bayesian neural networks can outperform standard Gaussian models, particularly in the presence of corrupted or adversarial data. This highlights the potential benefits of moving beyond the Gaussian assumption for neural network weights.

Deep learning meets nonparametric regression: are weight space Gaussian processes really necessary for deep learning? discusses related work on relaxing the Gaussian assumption for neural network weights.

Critical Analysis

The paper presents a thorough investigation of Bayesian inference techniques for deep neural networks with heavy-tailed weight distributions. The authors demonstrate the potential benefits of this approach, especially in the presence of corrupted or adversarial data.

One limitation is that the experiments are conducted on relatively small-scale datasets and networks. It would be valuable to see how the proposed methods scale to larger, more complex neural network architectures and real-world applications.

Additionally, the computational cost of the MCMC-based posterior inference method may be a concern for practical deployment. The authors mention that the variational inference approach is more scalable, but further work may be needed to improve its flexibility and accuracy.

Structured partial stochasticity in Bayesian neural networks explores alternative ways of introducing structured stochasticity in Bayesian neural networks, which could be a complementary direction to the heavy-tailed priors studied in this paper.

Overall, this paper makes an important contribution to the growing body of research on Bayesian deep learning, highlighting the potential benefits of moving beyond the Gaussian assumption for neural network weights.

Conclusion

This paper investigates Bayesian inference techniques for deep neural networks with heavy-tailed weight distributions. The authors demonstrate that relaxing the Gaussian assumption for network weights can lead to more flexible and robust models, particularly in the presence of corrupted or adversarial data.

The proposed posterior and variational inference methods show promising results on benchmark datasets, suggesting that heavy-tailed priors may be a valuable addition to the Bayesian deep learning toolkit. While further work is needed to scale these approaches to larger networks and real-world applications, this paper represents an important step forward in developing more flexible and uncertainty-aware neural network models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤯

Posterior Inference on Shallow Infinitely Wide Bayesian Neural Networks under Weights with Unbounded Variance

Jorge Lor'ia, Anindya Bhadra

From the classical and influential works of Neal (1996), it is known that the infinite width scaling limit of a Bayesian neural network with one hidden layer is a Gaussian process, when the network weights have bounded prior variance. Neal's result has been extended to networks with multiple hidden layers and to convolutional neural networks, also with Gaussian process scaling limits. The tractable properties of Gaussian processes then allow straightforward posterior inference and uncertainty quantification, considerably simplifying the study of the limit process compared to a network of finite width. Neural network weights with unbounded variance, however, pose unique challenges. In this case, the classical central limit theorem breaks down and it is well known that the scaling limit is an $alpha$-stable process under suitable conditions. However, current literature is primarily limited to forward simulations under these processes and the problem of posterior inference under such a scaling limit remains largely unaddressed, unlike in the Gaussian process case. To this end, our contribution is an interpretable and computationally efficient procedure for posterior inference, using a conditionally Gaussian representation, that then allows full use of the Gaussian process machinery for tractable posterior inference and uncertainty quantification in the non-Gaussian regime.

6/6/2024

stat.ML cs.LG

Regularized KL-Divergence for Well-Defined Function-Space Variational Inference in Bayesian neural networks

Tristan Cinquin, Robert Bamler

Bayesian neural networks (BNN) promise to combine the predictive performance of neural networks with principled uncertainty modeling important for safety-critical systems and decision making. However, posterior uncertainty estimates depend on the choice of prior, and finding informative priors in weight-space has proven difficult. This has motivated variational inference (VI) methods that pose priors directly on the function generated by the BNN rather than on weights. In this paper, we address a fundamental issue with such function-space VI approaches pointed out by Burt et al. (2020), who showed that the objective function (ELBO) is negative infinite for most priors of interest. Our solution builds on generalized VI (Knoblauch et al., 2019) with the regularized KL divergence (Quang, 2019) and is, to the best of our knowledge, the first well-defined variational objective for function-space inference in BNNs with Gaussian process (GP) priors. Experiments show that our method incorporates the properties specified by the GP prior on synthetic and small real-world data sets, and provides competitive uncertainty estimates for regression, classification and out-of-distribution detection compared to BNN baselines with both function and weight-space priors.

6/7/2024

cs.LG stat.ML

🤯

Few-sample Variational Inference of Bayesian Neural Networks with Arbitrary Nonlinearities

David J. Schodt

Bayesian Neural Networks (BNNs) extend traditional neural networks to provide uncertainties associated with their outputs. On the forward pass through a BNN, predictions (and their uncertainties) are made either by Monte Carlo sampling network weights from the learned posterior or by analytically propagating statistical moments through the network. Though flexible, Monte Carlo sampling is computationally expensive and can be infeasible or impractical under resource constraints or for large networks. While moment propagation can ameliorate the computational costs of BNN inference, it can be difficult or impossible for networks with arbitrary nonlinearities, thereby restricting the possible set of network layers permitted with such a scheme. In this work, we demonstrate a simple yet effective approach for propagating statistical moments through arbitrary nonlinearities with only 3 deterministic samples, enabling few-sample variational inference of BNNs without restricting the set of network layers used. Furthermore, we leverage this approach to demonstrate a novel nonlinear activation function that we use to inject physics-informed prior information into output nodes of a BNN.

5/22/2024

cs.LG

🤔

Variational inference, Mixture of Gaussians, Bayesian Machine Learning

Tom Huix, Anna Korba, Alain Durmus, Eric Moulines

Variational inference (VI) is a popular approach in Bayesian inference, that looks for the best approximation of the posterior distribution within a parametric family, minimizing a loss that is typically the (reverse) Kullback-Leibler (KL) divergence. Despite its empirical success, the theoretical properties of VI have only received attention recently, and mostly when the parametric family is the one of Gaussians. This work aims to contribute to the theoretical study of VI in the non-Gaussian case by investigating the setting of Mixture of Gaussians with fixed covariance and constant weights. In this view, VI over this specific family can be casted as the minimization of a Mollified relative entropy, i.e. the KL between the convolution (with respect to a Gaussian kernel) of an atomic measure supported on Diracs, and the target distribution. The support of the atomic measure corresponds to the localization of the Gaussian components. Hence, solving variational inference becomes equivalent to optimizing the positions of the Diracs (the particles), which can be done through gradient descent and takes the form of an interacting particle system. We study two sources of error of variational inference in this context when optimizing the mollified relative entropy. The first one is an optimization result, that is a descent lemma establishing that the algorithm decreases the objective at each iteration. The second one is an approximation error, that upper bounds the objective between an optimal finite mixture and the target distribution.

6/11/2024

stat.ML cs.LG