Variational Inference for Uncertainty Quantification: an Analysis of Trade-offs

2403.13748

Published 6/10/2024 by Charles C. Margossian, Loucas Pillaud-Vivien, Lawrence K. Saul

Variational Inference for Uncertainty Quantification: an Analysis of Trade-offs

Abstract

Given an intractable distribution $p$, the problem of variational inference (VI) is to find the best approximation from some more tractable family $Q$. Commonly, one chooses $Q$ to be a family of factorized distributions (i.e., the mean-field assumption), even though~$p$ itself does not factorize. We show that this mismatch leads to an impossibility theorem: if $p$ does not factorize, then any factorized approximation $qin Q$ can correctly estimate at most one of the following three measures of uncertainty: (i) the marginal variances, (ii) the marginal precisions, or (iii) the generalized variance (which can be related to the entropy). In practice, the best variational approximation in $Q$ is found by minimizing some divergence $D(q,p)$ between distributions, and so we ask: how does the choice of divergence determine which measure of uncertainty, if any, is correctly estimated by VI? We consider the classic Kullback-Leibler divergences, the more general R'enyi divergences, and a score-based divergence which compares $nabla log p$ and $nabla log q$. We provide a thorough theoretical analysis in the setting where $p$ is a Gaussian and $q$ is a (factorized) Gaussian. We show that all the considered divergences can be textit{ordered} based on the estimates of uncertainty they yield as objective functions for~VI. Finally, we empirically evaluate the validity of this ordering when the target distribution $p$ is not Gaussian.

Create account to get full access

Overview

• This paper explores an ordering of divergences for variational inference with factorized Gaussian approximations. • It presents an impossibility theorem showing that there is no universal best divergence for this setting. • The paper analyzes the tradeoff between variance and entropy in variational inference, and proposes a family of divergences that provide a spectrum of variance-entropy behaviors.

Plain English Explanation

This research paper is about a mathematical problem in the field of machine learning called variational inference. Variational inference is a technique used to approximate complex probability distributions, which is important for building accurate statistical models.

The key idea is that the researchers looked at different ways to measure the "distance" between the true probability distribution and the approximate distribution used in variational inference. They call these different ways of measuring distance "divergences." The researchers showed that there is no single "best" divergence that works well in all situations - there is an inherent tradeoff between the variance (or uncertainty) of the approximation and the entropy (or complexity) of the approximation.

To illustrate this tradeoff, the researchers proposed a family of divergences that allow you to control this variance-entropy balance. Depending on your specific modeling needs, you can choose a divergence from this family that strikes the right balance for your problem.

The significance of this work is that it provides a deeper theoretical understanding of the fundamental limitations of variational inference, which is a widely used technique in machine learning and statistics. By characterizing this variance-entropy tradeoff, the researchers give practitioners a principled way to select the right divergence for their particular application.

Technical Explanation

This paper examines the problem of variational inference with factorized Gaussian (FG) approximations. The authors present an impossibility theorem showing that there is no universally optimal divergence measure for this setting.

The core of the paper is an analysis of the variance-entropy tradeoff in FG-VI. The authors show that different divergence measures lead to different compromises between the variance of the approximate posterior and its entropy (complexity).

To explore this tradeoff, the authors propose a family of divergences that interpolate between extremes like the KL-divergence and squared Hellinger distance. This family allows practitioners to choose a divergence that best fits the requirements of their specific modeling problem, whether that prioritizes low variance or high entropy.

The theoretical insights provided in this work contribute to a deeper understanding of the fundamental limitations of variational inference. By characterizing the variance-entropy tradeoff, the authors give guidance to researchers and practitioners on how to select an appropriate divergence measure for their application.

Critical Analysis

The key contribution of this paper is the impossibility theorem showing that there is no universally optimal divergence for FG-VI. This result highlights the inherent tension between variance and entropy in variational approximations, which is an important consideration for practitioners.

One caveat is that the analysis is limited to factorized Gaussian approximations, whereas many real-world applications may require more flexible variational families. Extending this framework to more complex variational distributions could be an area for future research.

Additionally, the proposed family of divergences, while useful, still represents a restricted set of options. Exploring a wider range of divergences, or developing new divergence measures tailored to specific modeling needs, may lead to further insights.

Overall, this work makes a valuable theoretical contribution by characterizing the fundamental limitations of variational inference. The findings encourage practitioners to think carefully about the tradeoffs involved in selecting a divergence measure, rather than defaulting to standard choices like KL-divergence. Maintaining a critical perspective on the strengths and weaknesses of variational methods is crucial as they continue to be widely adopted in machine learning and statistics.

Conclusion

This paper presents an in-depth analysis of the tradeoffs involved in variational inference with factorized Gaussian approximations. The key insight is that there is no universally optimal divergence measure, as different choices lead to different compromises between the variance and entropy of the approximate posterior.

By proposing a family of divergences that span this variance-entropy spectrum, the authors give practitioners a principled way to select the most appropriate divergence for their specific modeling requirements. This work contributes to a deeper theoretical understanding of the limitations of variational inference, which is an important step towards improving the reliability and robustness of these powerful machine learning techniques.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤔

Variational inference, Mixture of Gaussians, Bayesian Machine Learning

Tom Huix, Anna Korba, Alain Durmus, Eric Moulines

Variational inference (VI) is a popular approach in Bayesian inference, that looks for the best approximation of the posterior distribution within a parametric family, minimizing a loss that is typically the (reverse) Kullback-Leibler (KL) divergence. Despite its empirical success, the theoretical properties of VI have only received attention recently, and mostly when the parametric family is the one of Gaussians. This work aims to contribute to the theoretical study of VI in the non-Gaussian case by investigating the setting of Mixture of Gaussians with fixed covariance and constant weights. In this view, VI over this specific family can be casted as the minimization of a Mollified relative entropy, i.e. the KL between the convolution (with respect to a Gaussian kernel) of an atomic measure supported on Diracs, and the target distribution. The support of the atomic measure corresponds to the localization of the Gaussian components. Hence, solving variational inference becomes equivalent to optimizing the positions of the Diracs (the particles), which can be done through gradient descent and takes the form of an interacting particle system. We study two sources of error of variational inference in this context when optimizing the mollified relative entropy. The first one is an optimization result, that is a descent lemma establishing that the algorithm decreases the objective at each iteration. The second one is an approximation error, that upper bounds the objective between an optimal finite mixture and the target distribution.

6/11/2024

stat.ML cs.LG

New!Particle Semi-Implicit Variational Inference

Jen Ning Lim, Adam M. Johansen

Semi-implicit variational inference (SIVI) enriches the expressiveness of variational families by utilizing a kernel and a mixing distribution to hierarchically define the variational distribution. Existing SIVI methods parameterize the mixing distribution using implicit distributions, leading to intractable variational densities. As a result, directly maximizing the evidence lower bound (ELBO) is not possible and so, they resort to either: optimizing bounds on the ELBO, employing costly inner-loop Markov chain Monte Carlo runs, or solving minimax objectives. In this paper, we propose a novel method for SIVI called Particle Variational Inference (PVI) which employs empirical measures to approximate the optimal mixing distributions characterized as the minimizer of a natural free energy functional via a particle approximation of an Euclidean--Wasserstein gradient flow. This approach means that, unlike prior works, PVI can directly optimize the ELBO; furthermore, it makes no parametric assumption about the mixing distribution. Our empirical results demonstrate that PVI performs favourably against other SIVI methods across various tasks. Moreover, we provide a theoretical analysis of the behaviour of the gradient flow of a related free energy functional: establishing the existence and uniqueness of solutions as well as propagation of chaos results.

7/2/2024

stat.ML cs.LG

✨

Amortized Variational Inference: When and Why?

Charles C. Margossian, David M. Blei

In a probabilistic latent variable model, factorized (or mean-field) variational inference (F-VI) fits a separate parametric distribution for each latent variable. Amortized variational inference (A-VI) instead learns a common inference function, which maps each observation to its corresponding latent variable's approximate posterior. Typically, A-VI is used as a step in the training of variational autoencoders, however it stands to reason that A-VI could also be used as a general alternative to F-VI. In this paper we study when and why A-VI can be used for approximate Bayesian inference. We derive conditions on a latent variable model which are necessary, sufficient, and verifiable under which A-VI can attain F-VI's optimal solution, thereby closing the amortization gap. We prove these conditions are uniquely verified by simple hierarchical models, a broad class that encompasses many models in machine learning. We then show, on a broader class of models, how to expand the domain of AVI's inference function to improve its solution, and we provide examples, e.g. hidden Markov models, where the amortization gap cannot be closed.

5/27/2024

stat.ML cs.LG

👀

Predictive Uncertainty Quantification via Risk Decompositions for Strictly Proper Scoring Rules

Nikita Kotelevskii, Maxim Panov

Uncertainty quantification in predictive modeling often relies on ad hoc methods as there is no universally accepted formal framework for that. This paper introduces a theoretical approach to understanding uncertainty through statistical risks, distinguishing between aleatoric (data-related) and epistemic (model-related) uncertainties. We explain how to split pointwise risk into Bayes risk and excess risk. In particular, we show that excess risk, related to epistemic uncertainty, aligns with Bregman divergences. To turn considered risk measures into actual uncertainty estimates, we suggest using the Bayesian approach by approximating the risks with the help of posterior distributions. We tested our method on image datasets, evaluating its performance in detecting out-of-distribution and misclassified data using the AUROC metric. Our results confirm the effectiveness of the considered approach and offer practical guidance for estimating uncertainty in real-world applications.

6/7/2024

stat.ML cs.LG