Variational Linearized Laplace Approximation for Bayesian Deep Learning

2302.12565

Published 5/24/2024 by Luis A. Ortega, Sim'on Rodr'iguez Santana, Daniel Hern'andez-Lobato

🤿

Abstract

The Linearized Laplace Approximation (LLA) has been recently used to perform uncertainty estimation on the predictions of pre-trained deep neural networks (DNNs). However, its widespread application is hindered by significant computational costs, particularly in scenarios with a large number of training points or DNN parameters. Consequently, additional approximations of LLA, such as Kronecker-factored or diagonal approximate GGN matrices, are utilized, potentially compromising the model's performance. To address these challenges, we propose a new method for approximating LLA using a variational sparse Gaussian Process (GP). Our method is based on the dual RKHS formulation of GPs and retains, as the predictive mean, the output of the original DNN. Furthermore, it allows for efficient stochastic optimization, which results in sub-linear training time in the size of the training dataset. Specifically, its training cost is independent of the number of training points. We compare our proposed method against accelerated LLA (ELLA), which relies on the Nystrom approximation, as well as other LLA variants employing the sample-then-optimize principle. Experimental results, both on regression and classification datasets, show that our method outperforms these already existing efficient variants of LLA, both in terms of the quality of the predictive distribution and in terms of total computational time.

Create account to get full access

Overview

The paper proposes a new method for approximating the Linearized Laplace Approximation (LLA), which is used to estimate uncertainty in predictions of deep neural networks (DNNs).
LLA is computationally expensive, especially for large datasets or models, so the authors introduce a variational sparse Gaussian Process (GP) approach that retains the DNN's predictive mean while enabling efficient stochastic optimization.
The new method is compared to other efficient LLA variants and shown to outperform them in terms of predictive distribution quality and computational time.

Plain English Explanation

The paper addresses a problem with a technique called the Linearized Laplace Approximation (LLA). LLA is used to estimate the uncertainty in the predictions made by deep neural networks (DNNs).

The issue is that LLA can be very computationally expensive, especially when you have a lot of training data or a large DNN model. To try to speed things up, researchers have developed some approximations of LLA, like Kronecker-factored or diagonal approximate GGN matrices. However, these approximations can compromise the model's performance.

The authors of this paper propose a new method to approximate LLA using something called a variational sparse Gaussian Process (GP). Their approach keeps the DNN's original prediction as the mean, but uses the GP to efficiently estimate the uncertainty around that prediction. This allows for stochastic optimization, which means the training time doesn't depend on the size of the dataset.

The authors compare their new method to other efficient LLA variants, like the one that uses the Nystrom approximation (ELLA). Their experiments show that their method outperforms these other approaches in terms of the quality of the predicted uncertainty and the overall computation time.

Technical Explanation

The paper introduces a new method for approximating the Linearized Laplace Approximation (LLA), which is used to estimate uncertainty in the predictions of pre-trained deep neural networks (DNNs). LLA is computationally expensive, especially in scenarios with large datasets or many DNN parameters.

To address this, the authors propose a variational sparse Gaussian Process (GP) approach. Their method retains the DNN's original predictive mean, but uses the dual RKHS formulation of GPs to efficiently estimate the uncertainty around that prediction. This allows for stochastic optimization, resulting in training time that is independent of the number of training points.

The authors compare their proposed method, which they call the Variational Sparse Gaussian Process (VSGP) approximation of LLA, to other efficient LLA variants. This includes the accelerated LLA (ELLA) approach that uses the Nystrom approximation, as well as other LLA methods that employ the sample-then-optimize principle.

Experimental results on both regression and classification datasets show that the VSGP method outperforms these existing efficient LLA approximations in terms of the quality of the predictive distribution and the total computational time.

Critical Analysis

The paper presents a novel and efficient approach for approximating the Linearized Laplace Approximation (LLA) to estimate uncertainty in deep neural network (DNN) predictions. The key innovation is the use of a variational sparse Gaussian Process (GP) model, which retains the DNN's predictive mean while enabling efficient stochastic optimization.

One potential limitation is that the authors only evaluate their method on standard regression and classification tasks. It would be interesting to see how the VSGP approach performs on more complex DNN architectures or in domains with structured outputs, such as preventing model collapse in Gaussian Process Latent Variable Models.

Additionally, the authors do not provide much insight into the failure modes or limitations of their VSGP approach. It would be helpful to understand the types of DNN models or datasets where the method may struggle, as well as any potential trade-offs between computational efficiency and the quality of the uncertainty estimates.

Overall, the paper makes a valuable contribution by introducing an efficient LLA approximation technique that outperforms existing methods. Readers are encouraged to think critically about the research and consider how the VSGP approach could be extended or applied to other problems in the future.

Conclusion

The paper proposes a new method for approximating the Linearized Laplace Approximation (LLA), which is used to estimate uncertainty in the predictions of deep neural networks (DNNs). The authors introduce a variational sparse Gaussian Process (GP) approach that retains the DNN's original predictive mean while enabling efficient stochastic optimization.

Experiments show that the proposed VSGP method outperforms other efficient LLA variants in terms of predictive distribution quality and computational time. This is a significant advancement, as the high computational cost of LLA has hindered its widespread adoption, particularly for large-scale DNN models or datasets.

The VSGP approach represents an important step towards making uncertainty estimation more accessible and practical for deep learning practitioners. By providing a efficient yet accurate approximation of LLA, this research could have far-reaching implications for improving the robustness and trustworthiness of DNN models in a variety of real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

📶

Generalized Laplace Approximation

Yinsong Chen, Samson S. Yu, Zhong Li, Chee Peng Lim

In recent years, the inconsistency in Bayesian deep learning has garnered increasing attention. Tempered or generalized posterior distributions often offer a direct and effective solution to this issue. However, understanding the underlying causes and evaluating the effectiveness of generalized posteriors remain active areas of research. In this study, we introduce a unified theoretical framework to attribute Bayesian inconsistency to model misspecification and inadequate priors. We interpret the generalization of the posterior with a temperature factor as a correction for misspecified models through adjustments to the joint probability model, and the recalibration of priors by redistributing probability mass on models within the hypothesis space using data samples. Additionally, we highlight a distinctive feature of Laplace approximation, which ensures that the generalized normalizing constant can be treated as invariant, unlike the typical scenario in general Bayesian learning where this constant varies with model parameters post-generalization. Building on this insight, we propose the generalized Laplace approximation, which involves a simple adjustment to the computation of the Hessian matrix of the regularized loss function. This method offers a flexible and scalable framework for obtaining high-quality posterior distributions. We assess the performance and properties of the generalized Laplace approximation on state-of-the-art neural networks and real-world datasets.

5/27/2024

cs.LG stat.ML

🤯

Scalable Bayesian Inference in the Era of Deep Learning: From Gaussian Processes to Deep Neural Networks

Javier Antoran

Large neural networks trained on large datasets have become the dominant paradigm in machine learning. These systems rely on maximum likelihood point estimates of their parameters, precluding them from expressing model uncertainty. This may result in overconfident predictions and it prevents the use of deep learning models for sequential decision making. This thesis develops scalable methods to equip neural networks with model uncertainty. In particular, we leverage the linearised Laplace approximation to equip pre-trained neural networks with the uncertainty estimates provided by their tangent linear models. This turns the problem of Bayesian inference in neural networks into one of Bayesian inference in conjugate Gaussian-linear models. Alas, the cost of this remains cubic in either the number of network parameters or in the number of observations times output dimensions. By assumption, neither are tractable. We address this intractability by using stochastic gradient descent (SGD) -- the workhorse algorithm of deep learning -- to perform posterior sampling in linear models and their convex duals: Gaussian processes. With this, we turn back to linearised neural networks, finding the linearised Laplace approximation to present a number of incompatibilities with modern deep learning practices -- namely, stochastic optimisation, early stopping and normalisation layers -- when used for hyperparameter learning. We resolve these and construct a sample-based EM algorithm for scalable hyperparameter learning with linearised neural networks. We apply the above methods to perform linearised neural network inference with ResNet-50 (25M parameters) trained on Imagenet (1.2M observations and 1000 output dimensions). Additionally, we apply our methods to estimate uncertainty for 3d tomographic reconstructions obtained with the deep image prior network.

5/1/2024

stat.ML cs.LG

Contraction rates for conjugate gradient and Lanczos approximate posteriors in Gaussian process regression

Bernhard Stankewitz, Botond Szabo

Due to their flexibility and theoretical tractability Gaussian process (GP) regression models have become a central topic in modern statistics and machine learning. While the true posterior in these models is given explicitly, numerical evaluations depend on the inversion of the augmented kernel matrix $ K + sigma^2 I $, which requires up to $ O(n^3) $ operations. For large sample sizes n, which are typically given in modern applications, this is computationally infeasible and necessitates the use of an approximate version of the posterior. Although such methods are widely used in practice, they typically have very limtied theoretical underpinning. In this context, we analyze a class of recently proposed approximation algorithms from the field of Probabilistic numerics. They can be interpreted in terms of Lanczos approximate eigenvectors of the kernel matrix or a conjugate gradient approximation of the posterior mean, which are particularly advantageous in truly large scale applications, as they are fundamentally only based on matrix vector multiplications amenable to the GPU acceleration of modern software frameworks. We combine result from the numerical analysis literature with state of the art concentration results for spectra of kernel matrices to obtain minimax contraction rates. Our theoretical findings are illustrated by numerical experiments.

6/19/2024

stat.ML cs.LG

🧠

Linearization Turns Neural Operators into Function-Valued Gaussian Processes

Emilia Magnani, Marvin Pfortner, Tobias Weber, Philipp Hennig

Modeling dynamical systems, e.g. in climate and engineering sciences, often necessitates solving partial differential equations. Neural operators are deep neural networks designed to learn nontrivial solution operators of such differential equations from data. As for all statistical models, the predictions of these models are imperfect and exhibit errors. Such errors are particularly difficult to spot in the complex nonlinear behaviour of dynamical systems. We introduce a new framework for approximate Bayesian uncertainty quantification in neural operators using function-valued Gaussian processes. Our approach can be interpreted as a probabilistic analogue of the concept of currying from functional programming and provides a practical yet theoretically sound way to apply the linearized Laplace approximation to neural operators. In a case study on Fourier neural operators, we show that, even for a discretized input, our method yields a Gaussian closure--a structured Gaussian process posterior capturing the uncertainty in the output function of the neural operator, which can be evaluated at an arbitrary set of points. The method adds minimal prediction overhead, can be applied post-hoc without retraining the neural operator, and scales to large models and datasets. We showcase the efficacy of our approach through applications to different types of partial differential equations.

6/10/2024

cs.LG stat.ML