Scalable Bayesian Inference in the Era of Deep Learning: From Gaussian Processes to Deep Neural Networks

2404.19157

Published 5/1/2024 by Javier Antoran

🤯

Abstract

Large neural networks trained on large datasets have become the dominant paradigm in machine learning. These systems rely on maximum likelihood point estimates of their parameters, precluding them from expressing model uncertainty. This may result in overconfident predictions and it prevents the use of deep learning models for sequential decision making. This thesis develops scalable methods to equip neural networks with model uncertainty. In particular, we leverage the linearised Laplace approximation to equip pre-trained neural networks with the uncertainty estimates provided by their tangent linear models. This turns the problem of Bayesian inference in neural networks into one of Bayesian inference in conjugate Gaussian-linear models. Alas, the cost of this remains cubic in either the number of network parameters or in the number of observations times output dimensions. By assumption, neither are tractable. We address this intractability by using stochastic gradient descent (SGD) -- the workhorse algorithm of deep learning -- to perform posterior sampling in linear models and their convex duals: Gaussian processes. With this, we turn back to linearised neural networks, finding the linearised Laplace approximation to present a number of incompatibilities with modern deep learning practices -- namely, stochastic optimisation, early stopping and normalisation layers -- when used for hyperparameter learning. We resolve these and construct a sample-based EM algorithm for scalable hyperparameter learning with linearised neural networks. We apply the above methods to perform linearised neural network inference with ResNet-50 (25M parameters) trained on Imagenet (1.2M observations and 1000 output dimensions). Additionally, we apply our methods to estimate uncertainty for 3d tomographic reconstructions obtained with the deep image prior network.

Create account to get full access

Overview

Large neural networks trained on big datasets have become the dominant approach in machine learning.
These systems rely on point estimates of their parameters, which means they cannot express model uncertainty.
This can lead to overconfident predictions and prevents the use of deep learning models for sequential decision-making.
This research develops scalable methods to equip neural networks with model uncertainty estimates.

Plain English Explanation

Neural networks have become the go-to tool for many machine learning tasks, as they can learn incredibly complex patterns from large datasets. However, these neural networks have a significant limitation - they only provide a single "best guess" for their predictions, without any sense of how confident they are in that guess.

This lack of uncertainty quantification can lead to problems. For example, if a neural network is very confident in a prediction that turns out to be wrong, it could make poor decisions, especially in applications like sequential decision-making. Ideally, neural networks should be able to express their level of confidence in their outputs.

This research tackles this challenge by developing new techniques to equip neural networks with model uncertainty estimates. The key idea is to leverage the Laplace approximation, a mathematical technique that can convert a neural network into a simpler, more interpretable model that can quantify its own uncertainty.

The researchers show how to apply this Laplace approximation approach to large, state-of-the-art neural networks like ResNet-50, and demonstrate its use for tasks like 3D medical imaging reconstruction with the deep image prior network.

Technical Explanation

The core of this research is the development of scalable methods to estimate model uncertainty in large neural networks. Traditionally, neural networks are trained using maximum likelihood, which results in a single "point estimate" of the model parameters, without any sense of the uncertainty in those estimates.

The researchers address this by leveraging the linearized Laplace approximation, which can convert a pre-trained neural network into a simpler Gaussian-linear model. This allows them to quantify the model's uncertainty using Bayesian inference techniques.

However, performing Bayesian inference in these Gaussian-linear models is still computationally expensive, scaling cubically with either the number of model parameters or the number of observations and output dimensions. To address this intractability, the researchers use stochastic gradient descent (SGD) to perform posterior sampling in the linear models and their convex duals, Gaussian processes.

The researchers also identify a number of issues that arise when applying the linearized Laplace approximation to modern deep learning practices, such as stochastic optimization, early stopping, and normalization layers. They resolve these issues by developing a sample-based EM algorithm for scalable hyperparameter learning with linearized neural networks.

Critical Analysis

The researchers present a novel and intriguing approach to quantifying uncertainty in large neural networks, which is an important problem in the field. By leveraging the linearized Laplace approximation, they are able to convert neural networks into simpler Gaussian-linear models that can be subjected to Bayesian inference.

However, the researchers acknowledge that their approach still faces computational challenges, as the Bayesian inference step remains costly. While their use of SGD and Gaussian processes helps to address this, it's unclear how scalable the approach is to the largest neural networks and datasets.

Additionally, the researchers identify several incompatibilities between the linearized Laplace approximation and modern deep learning practices, such as stochastic optimization and normalization layers. While they develop solutions to these issues, it's possible that there are other deep learning techniques that are not well-suited to this approach.

Overall, this research represents an important step forward in equipping neural networks with model uncertainty estimates, but there are still significant challenges to overcome before this approach can be broadly adopted. Researchers and practitioners should continue to explore alternative methods for uncertainty quantification in deep learning, such as Bayesian neural networks and ensemble techniques.

Conclusion

This research tackles the important problem of quantifying model uncertainty in large neural networks, which is crucial for applications like sequential decision-making. By leveraging the linearized Laplace approximation, the researchers are able to convert neural networks into simpler Gaussian-linear models that can be subjected to Bayesian inference.

While the approach faces some computational challenges and compatibility issues with modern deep learning practices, the researchers present a novel and promising direction for incorporating uncertainty estimates into powerful neural network models. Continued research in this area could lead to significant advancements in the robustness and reliability of deep learning systems, with far-reaching implications for a wide range of applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤿

Deep Learning and genetic algorithms for cosmological Bayesian inference speed-up

Isidro G'omez-Vargas, J. Alberto V'azquez

In this paper, we present a novel approach to accelerate the Bayesian inference process, focusing specifically on the nested sampling algorithms. Bayesian inference plays a crucial role in cosmological parameter estimation, providing a robust framework for extracting theoretical insights from observational data. However, its computational demands can be substantial, primarily due to the need for numerous likelihood function evaluations. Our proposed method utilizes the power of deep learning, employing feedforward neural networks to approximate the likelihood function dynamically during the Bayesian inference process. Unlike traditional approaches, our method trains neural networks on-the-fly using the current set of live points as training data, without the need for pre-training. This flexibility enables adaptation to various theoretical models and datasets. We perform simple hyperparameter optimization using genetic algorithms to suggest initial neural network architectures for learning each likelihood function. Once sufficient accuracy is achieved, the neural network replaces the original likelihood function. The implementation integrates with nested sampling algorithms and has been thoroughly evaluated using both simple cosmological dark energy models and diverse observational datasets. Additionally, we explore the potential of genetic algorithms for generating initial live points within nested sampling inference, opening up new avenues for enhancing the efficiency and effectiveness of Bayesian inference methods.

5/7/2024

cs.LG cs.NE stat.ML

🤯

Posterior Inference on Shallow Infinitely Wide Bayesian Neural Networks under Weights with Unbounded Variance

Jorge Lor'ia, Anindya Bhadra

From the classical and influential works of Neal (1996), it is known that the infinite width scaling limit of a Bayesian neural network with one hidden layer is a Gaussian process, when the network weights have bounded prior variance. Neal's result has been extended to networks with multiple hidden layers and to convolutional neural networks, also with Gaussian process scaling limits. The tractable properties of Gaussian processes then allow straightforward posterior inference and uncertainty quantification, considerably simplifying the study of the limit process compared to a network of finite width. Neural network weights with unbounded variance, however, pose unique challenges. In this case, the classical central limit theorem breaks down and it is well known that the scaling limit is an $alpha$-stable process under suitable conditions. However, current literature is primarily limited to forward simulations under these processes and the problem of posterior inference under such a scaling limit remains largely unaddressed, unlike in the Gaussian process case. To this end, our contribution is an interpretable and computationally efficient procedure for posterior inference, using a conditionally Gaussian representation, that then allows full use of the Gaussian process machinery for tractable posterior inference and uncertainty quantification in the non-Gaussian regime.

6/6/2024

stat.ML cs.LG

Scalable Bayesian Learning with posteriors

Samuel Duffield, Kaelan Donatella, Johnathan Chiu, Phoebe Klett, Daniel Simpson

Although theoretically compelling, Bayesian learning with modern machine learning models is computationally challenging since it requires approximating a high dimensional posterior distribution. In this work, we (i) introduce posteriors, an easily extensible PyTorch library hosting general-purpose implementations making Bayesian learning accessible and scalable to large data and parameter regimes; (ii) present a tempered framing of stochastic gradient Markov chain Monte Carlo, as implemented in posteriors, that transitions seamlessly into optimization and unveils a minor modification to deep ensembles to ensure they are asymptotically unbiased for the Bayesian posterior, and (iii) demonstrate and compare the utility of Bayesian approximations through experiments including an investigation into the cold posterior effect and applications with large language models.

6/4/2024

cs.LG stat.ML

🤯

Scalable Subsampling Inference for Deep Neural Networks

Kejin Wu, Dimitris N. Politis

Deep neural networks (DNN) has received increasing attention in machine learning applications in the last several years. Recently, a non-asymptotic error bound has been developed to measure the performance of the fully connected DNN estimator with ReLU activation functions for estimating regression models. The paper at hand gives a small improvement on the current error bound based on the latest results on the approximation ability of DNN. More importantly, however, a non-random subsampling technique--scalable subsampling--is applied to construct a `subagged' DNN estimator. Under regularity conditions, it is shown that the subagged DNN estimator is computationally efficient without sacrificing accuracy for either estimation or prediction tasks. Beyond point estimation/prediction, we propose different approaches to build confidence and prediction intervals based on the subagged DNN estimator. In addition to being asymptotically valid, the proposed confidence/prediction intervals appear to work well in finite samples. All in all, the scalable subsampling DNN estimator offers the complete package in terms of statistical inference, i.e., (a) computational efficiency; (b) point estimation/prediction accuracy; and (c) allowing for the construction of practically useful confidence and prediction intervals.

5/15/2024

stat.ML cs.LG