Variational inference, Mixture of Gaussians, Bayesian Machine Learning

2406.04012

Published 6/11/2024 by Tom Huix, Anna Korba, Alain Durmus, Eric Moulines

🤔

Abstract

Variational inference (VI) is a popular approach in Bayesian inference, that looks for the best approximation of the posterior distribution within a parametric family, minimizing a loss that is typically the (reverse) Kullback-Leibler (KL) divergence. Despite its empirical success, the theoretical properties of VI have only received attention recently, and mostly when the parametric family is the one of Gaussians. This work aims to contribute to the theoretical study of VI in the non-Gaussian case by investigating the setting of Mixture of Gaussians with fixed covariance and constant weights. In this view, VI over this specific family can be casted as the minimization of a Mollified relative entropy, i.e. the KL between the convolution (with respect to a Gaussian kernel) of an atomic measure supported on Diracs, and the target distribution. The support of the atomic measure corresponds to the localization of the Gaussian components. Hence, solving variational inference becomes equivalent to optimizing the positions of the Diracs (the particles), which can be done through gradient descent and takes the form of an interacting particle system. We study two sources of error of variational inference in this context when optimizing the mollified relative entropy. The first one is an optimization result, that is a descent lemma establishing that the algorithm decreases the objective at each iteration. The second one is an approximation error, that upper bounds the objective between an optimal finite mixture and the target distribution.

Create account to get full access

Overview

Variational inference (VI) is a popular Bayesian inference technique that finds the best approximation of the posterior distribution within a parametric family.
This paper explores the theoretical properties of VI in the non-Gaussian case, specifically for Mixture of Gaussians with fixed covariance and constant weights.
The authors cast VI over this family as minimizing a Mollified relative entropy, which is the Kullback-Leibler (KL) divergence between the convolution of an atomic measure supported on Diracs and the target distribution.
Solving the variational inference problem is equivalent to optimizing the positions of the Diracs (the particles) through gradient descent, forming an interacting particle system.
The paper examines two sources of error in this context: an optimization result that shows the algorithm decreases the objective at each iteration, and an approximation error that bounds the objective between an optimal finite mixture and the target distribution.

Plain English Explanation

Variational inference is a way of doing Bayesian inference, which is a method of making inferences from data. In Bayesian inference, we want to find the best approximation of the true distribution of the data, called the "posterior" distribution.

In this paper, the authors look at a specific type of variational inference, where the approximation is a mixture of Gaussian distributions with fixed covariance and constant weights. They show that this type of variational inference can be seen as minimizing a mollified relative entropy, which is a measure of how different the approximation is from the true distribution.

The authors then show that solving this variational inference problem is equivalent to optimizing the positions of the Gaussian components, which can be done using gradient descent. This forms an "interacting particle system", where the particles (the Gaussian components) interact with each other to find the best approximation.

The paper analyzes two sources of error in this approach. First, they show that the gradient descent algorithm will decrease the objective (the mollified relative entropy) at each iteration, which means it is guaranteed to converge to a local minimum. Second, they provide an upper bound on the difference between the optimal finite mixture and the true distribution, which tells us how well the approximation can do in the limit.

Technical Explanation

The authors cast the variational inference problem over the family of Mixture of Gaussians with fixed covariance and constant weights as the minimization of a Mollified relative entropy. This is the Kullback-Leibler (KL) divergence between the convolution (with respect to a Gaussian kernel) of an atomic measure supported on Diracs, and the target distribution. The support of the atomic measure corresponds to the localization of the Gaussian components.

Solving the variational inference problem is then equivalent to optimizing the positions of the Diracs (the particles), which can be done through gradient descent. This takes the form of an interacting particle system, where the particles interact with each other to find the best approximation.

The paper analyzes two sources of error in this context:

Optimization result: The authors establish a descent lemma, showing that the gradient descent algorithm decreases the objective (the mollified relative entropy) at each iteration. This guarantees convergence to a local minimum.
Approximation error: The authors provide an upper bound on the objective between an optimal finite mixture and the target distribution. This tells us how well the approximation can do in the limit.

Critical Analysis

The paper provides a thorough theoretical analysis of variational inference in the non-Gaussian case, specifically for Mixture of Gaussians with fixed covariance and constant weights. This is an important contribution, as the theoretical properties of VI have mostly been studied in the Gaussian case.

One potential limitation of the research is that it focuses on a relatively narrow family of distributions (Mixture of Gaussians with fixed covariance and constant weights). It would be interesting to see if the authors' analysis can be extended to more general families of distributions, or to cases where the covariance and weights are not fixed.

Additionally, the paper does not provide any experimental results or comparisons to other variational inference methods. While the theoretical analysis is valuable, it would be helpful to see how the proposed approach performs in practice, especially in comparison to other state-of-the-art VI techniques such as variational Bayesian surrogate modelling or coordinate ascent variational inference.

Overall, this paper makes an important contribution to the theoretical understanding of variational inference in non-Gaussian settings. The authors' analysis of the optimization and approximation errors provides valuable insights into the strengths and limitations of this approach, and could inform the development of more robust and efficient VI methods in the future.

Conclusion

This paper explores the theoretical properties of variational inference (VI) in the non-Gaussian case, specifically for Mixture of Gaussians with fixed covariance and constant weights. The authors cast the VI problem as the minimization of a Mollified relative entropy, which allows them to solve the problem by optimizing the positions of the Gaussian components through gradient descent.

The paper provides two key theoretical results: an optimization result showing that the gradient descent algorithm decreases the objective at each iteration, and an approximation error bound between the optimal finite mixture and the target distribution. These insights contribute to a deeper understanding of the strengths and limitations of VI in non-Gaussian settings, and could inform the development of more robust and efficient VI methods in the future.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Variational Inference for Uncertainty Quantification: an Analysis of Trade-offs

Charles C. Margossian, Loucas Pillaud-Vivien, Lawrence K. Saul

Given an intractable distribution $p$, the problem of variational inference (VI) is to find the best approximation from some more tractable family $Q$. Commonly, one chooses $Q$ to be a family of factorized distributions (i.e., the mean-field assumption), even though~$p$ itself does not factorize. We show that this mismatch leads to an impossibility theorem: if $p$ does not factorize, then any factorized approximation $qin Q$ can correctly estimate at most one of the following three measures of uncertainty: (i) the marginal variances, (ii) the marginal precisions, or (iii) the generalized variance (which can be related to the entropy). In practice, the best variational approximation in $Q$ is found by minimizing some divergence $D(q,p)$ between distributions, and so we ask: how does the choice of divergence determine which measure of uncertainty, if any, is correctly estimated by VI? We consider the classic Kullback-Leibler divergences, the more general R'enyi divergences, and a score-based divergence which compares $nabla log p$ and $nabla log q$. We provide a thorough theoretical analysis in the setting where $p$ is a Gaussian and $q$ is a (factorized) Gaussian. We show that all the considered divergences can be textit{ordered} based on the estimates of uncertainty they yield as objective functions for~VI. Finally, we empirically evaluate the validity of this ordering when the target distribution $p$ is not Gaussian.

6/10/2024

stat.ML cs.LG

Extending Mean-Field Variational Inference via Entropic Regularization: Theory and Computation

Bohan Wu, David Blei

Variational inference (VI) has emerged as a popular method for approximate inference for high-dimensional Bayesian models. In this paper, we propose a novel VI method that extends the naive mean field via entropic regularization, referred to as $Xi$-variational inference ($Xi$-VI). $Xi$-VI has a close connection to the entropic optimal transport problem and benefits from the computationally efficient Sinkhorn algorithm. We show that $Xi$-variational posteriors effectively recover the true posterior dependency, where the dependence is downweighted by the regularization parameter. We analyze the role of dimensionality of the parameter space on the accuracy of $Xi$-variational approximation and how it affects computational considerations, providing a rough characterization of the statistical-computational trade-off in $Xi$-VI. We also investigate the frequentist properties of $Xi$-VI and establish results on consistency, asymptotic normality, high-dimensional asymptotics, and algorithmic stability. We provide sufficient criteria for achieving polynomial-time approximate inference using the method. Finally, we demonstrate the practical advantage of $Xi$-VI over mean-field variational inference on simulated and real data.

4/16/2024

stat.ML cs.LG

👁️

Manifold Gaussian Variational Bayes on the Precision Matrix

Martin Magris, Mostafa Shabani, Alexandros Iosifidis

We propose an optimization algorithm for Variational Inference (VI) in complex models. Our approach relies on natural gradient updates where the variational space is a Riemann manifold. We develop an efficient algorithm for Gaussian Variational Inference whose updates satisfy the positive definite constraint on the variational covariance matrix. Our Manifold Gaussian Variational Bayes on the Precision matrix (MGVBP) solution provides simple update rules, is straightforward to implement, and the use of the precision matrix parametrization has a significant computational advantage. Due to its black-box nature, MGVBP stands as a ready-to-use solution for VI in complex models. Over five datasets, we empirically validate our feasible approach on different statistical and econometric models, discussing its performance with respect to baseline methods.

4/17/2024

stat.ML cs.LG

Regularized KL-Divergence for Well-Defined Function-Space Variational Inference in Bayesian neural networks

Tristan Cinquin, Robert Bamler

Bayesian neural networks (BNN) promise to combine the predictive performance of neural networks with principled uncertainty modeling important for safety-critical systems and decision making. However, posterior uncertainty estimates depend on the choice of prior, and finding informative priors in weight-space has proven difficult. This has motivated variational inference (VI) methods that pose priors directly on the function generated by the BNN rather than on weights. In this paper, we address a fundamental issue with such function-space VI approaches pointed out by Burt et al. (2020), who showed that the objective function (ELBO) is negative infinite for most priors of interest. Our solution builds on generalized VI (Knoblauch et al., 2019) with the regularized KL divergence (Quang, 2019) and is, to the best of our knowledge, the first well-defined variational objective for function-space inference in BNNs with Gaussian process (GP) priors. Experiments show that our method incorporates the properties specified by the GP prior on synthetic and small real-world data sets, and provides competitive uncertainty estimates for regression, classification and out-of-distribution detection compared to BNN baselines with both function and weight-space priors.

6/7/2024

cs.LG stat.ML