Convergence of coordinate ascent variational inference for log-concave measures via optimal transport

2404.08792

Published 4/16/2024 by Manuel Arnese, Daniel Lacker

🤯

Abstract

Mean field variational inference (VI) is the problem of finding the closest product (factorized) measure, in the sense of relative entropy, to a given high-dimensional probability measure $rho$. The well known Coordinate Ascent Variational Inference (CAVI) algorithm aims to approximate this product measure by iteratively optimizing over one coordinate (factor) at a time, which can be done explicitly. Despite its popularity, the convergence of CAVI remains poorly understood. In this paper, we prove the convergence of CAVI for log-concave densities $rho$. If additionally $log rho$ has Lipschitz gradient, we find a linear rate of convergence, and if also $rho$ is strongly log-concave, we find an exponential rate. Our analysis starts from the observation that mean field VI, while notoriously non-convex in the usual sense, is in fact displacement convex in the sense of optimal transport when $rho$ is log-concave. This allows us to adapt techniques from the optimization literature on coordinate descent algorithms in Euclidean space.

Create account to get full access

Overview

This paper presents a theoretical analysis of the convergence of Coordinate Ascent Variational Inference (CAVI) for log-concave probability distributions.
The authors show that CAVI can converge to the true distribution under certain conditions, by leveraging optimal transport theory.
The work has implications for efficient inference in Bayesian models with complex, high-dimensional probability distributions.

Plain English Explanation

Coordinate Ascent Variational Inference (CAVI) is a technique used to approximate complex probability distributions, like those found in advanced machine learning models. These distributions can be challenging to work with directly, so CAVI provides a way to simplify them and make them more tractable.

In this paper, the authors demonstrate that under certain conditions, CAVI can actually converge to the true underlying distribution that we're trying to model. This is an important result, as it means CAVI can be used to accurately approximate complex distributions, rather than just providing a rough estimate.

The key insight is that by using optimal transport theory - a mathematical framework for comparing and transforming probability distributions - the authors are able to show that CAVI will converge to the true distribution, rather than getting stuck in a suboptimal approximation.

This work has significant implications for efficient inference in Bayesian models, where accurately capturing complex, high-dimensional probability distributions is critical for making reliable predictions and decisions. By showing that CAVI can converge to the true distribution, this research paves the way for more accurate and scalable Bayesian inference in a wide range of applications.

Technical Explanation

The paper analyzes the convergence properties of Coordinate Ascent Variational Inference (CAVI) for approximating log-concave probability measures. CAVI is a popular algorithm for performing approximate Bayesian inference, where the goal is to find a simpler distribution that closely matches the true, often intractable posterior distribution.

The authors show that under certain assumptions, including log-concavity of the target distribution and bounded log-density gradients, CAVI will converge to the true distribution in the Wasserstein distance metric. This is a significant result, as it establishes that CAVI can recover the true underlying distribution, rather than just a suboptimal approximation.

The key technical insight is the use of optimal transport theory to bound the Wasserstein distance between the CAVI iterates and the true distribution. Optimal transport provides a principled way to compare and transform probability distributions, allowing the authors to rigorously analyze the convergence behavior of CAVI.

The paper also extends these convergence results to more general settings, such as when the target distribution is only locally log-concave, and when the CAVI updates are performed in a randomized order. These extensions broaden the applicability of the theoretical guarantees.

Critical Analysis

The paper provides a strong theoretical foundation for understanding the convergence properties of CAVI, a widely used algorithm in Bayesian inference. The authors' use of optimal transport theory to analyze CAVI is a novel and insightful approach, and the results are quite general and applicable to a wide range of scenarios.

That said, the paper does not address some practical considerations that may arise when applying CAVI in real-world settings. For example, the assumption of log-concavity may not always hold, and the authors' bounds on the log-density gradients may be difficult to verify in practice. Additionally, the paper does not consider the impact of initialization or the effect of dimensionality on the convergence rate.

Further research could explore the performance of CAVI in more realistic settings, such as when the target distribution is only approximately log-concave or has complex dependencies. Comparisons to other variational inference methods, such as Extending Mean Field Variational Inference via Entropic Regularization or Universal Inference Meets Random Projections: Scalable Test, would also help to contextualize the strengths and limitations of CAVI.

Additionally, the paper does not discuss the implications of its results for practical applications, such as in Differentially Private Non-Convex Optimization Under KL-Divergence Constraint or Error Bounds for Particle Gradient Descent and Extensions to Log-Concave Sampling. Exploring these connections could help bridge the gap between the theoretical analysis and real-world impact.

Conclusion

This paper presents a rigorous theoretical analysis of the convergence properties of Coordinate Ascent Variational Inference (CAVI) for log-concave probability measures. By leveraging optimal transport theory, the authors establish that CAVI can converge to the true underlying distribution, rather than just a suboptimal approximation.

These results have important implications for efficient Bayesian inference in complex, high-dimensional models, where accurately capturing the true posterior distribution is crucial for making reliable predictions and decisions. The work also opens up avenues for further research, such as extending the analysis to more general settings and exploring the practical performance of CAVI in real-world applications, including areas like Copula Graphical Model for Multi-Attribute Data Using Optimal Transport.

Overall, this paper makes a valuable contribution to the theoretical understanding of variational inference methods, and its insights could have a significant impact on the development of more accurate and scalable Bayesian modeling techniques.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤔

Variational inference, Mixture of Gaussians, Bayesian Machine Learning

Tom Huix, Anna Korba, Alain Durmus, Eric Moulines

Variational inference (VI) is a popular approach in Bayesian inference, that looks for the best approximation of the posterior distribution within a parametric family, minimizing a loss that is typically the (reverse) Kullback-Leibler (KL) divergence. Despite its empirical success, the theoretical properties of VI have only received attention recently, and mostly when the parametric family is the one of Gaussians. This work aims to contribute to the theoretical study of VI in the non-Gaussian case by investigating the setting of Mixture of Gaussians with fixed covariance and constant weights. In this view, VI over this specific family can be casted as the minimization of a Mollified relative entropy, i.e. the KL between the convolution (with respect to a Gaussian kernel) of an atomic measure supported on Diracs, and the target distribution. The support of the atomic measure corresponds to the localization of the Gaussian components. Hence, solving variational inference becomes equivalent to optimizing the positions of the Diracs (the particles), which can be done through gradient descent and takes the form of an interacting particle system. We study two sources of error of variational inference in this context when optimizing the mollified relative entropy. The first one is an optimization result, that is a descent lemma establishing that the algorithm decreases the objective at each iteration. The second one is an approximation error, that upper bounds the objective between an optimal finite mixture and the target distribution.

6/11/2024

stat.ML cs.LG

🤯

Algorithms for mean-field variational inference via polyhedral optimization in the Wasserstein space

Yiheng Jiang, Sinho Chewi, Aram-Alexandre Pooladian

We develop a theory of finite-dimensional polyhedral subsets over the Wasserstein space and optimization of functionals over them via first-order methods. Our main application is to the problem of mean-field variational inference, which seeks to approximate a distribution $pi$ over $mathbb{R}^d$ by a product measure $pi^star$. When $pi$ is strongly log-concave and log-smooth, we provide (1) approximation rates certifying that $pi^star$ is close to the minimizer $pi^star_diamond$ of the KL divergence over a emph{polyhedral} set $mathcal{P}_diamond$, and (2) an algorithm for minimizing $text{KL}(cdot|pi)$ over $mathcal{P}_diamond$ with accelerated complexity $O(sqrt kappa log(kappa d/varepsilon^2))$, where $kappa$ is the condition number of $pi$.

6/11/2024

cs.LG

Extending Mean-Field Variational Inference via Entropic Regularization: Theory and Computation

Bohan Wu, David Blei

Variational inference (VI) has emerged as a popular method for approximate inference for high-dimensional Bayesian models. In this paper, we propose a novel VI method that extends the naive mean field via entropic regularization, referred to as $Xi$-variational inference ($Xi$-VI). $Xi$-VI has a close connection to the entropic optimal transport problem and benefits from the computationally efficient Sinkhorn algorithm. We show that $Xi$-variational posteriors effectively recover the true posterior dependency, where the dependence is downweighted by the regularization parameter. We analyze the role of dimensionality of the parameter space on the accuracy of $Xi$-variational approximation and how it affects computational considerations, providing a rough characterization of the statistical-computational trade-off in $Xi$-VI. We also investigate the frequentist properties of $Xi$-VI and establish results on consistency, asymptotic normality, high-dimensional asymptotics, and algorithmic stability. We provide sufficient criteria for achieving polynomial-time approximate inference using the method. Finally, we demonstrate the practical advantage of $Xi$-VI over mean-field variational inference on simulated and real data.

4/16/2024

stat.ML cs.LG

Variational Inference for Uncertainty Quantification: an Analysis of Trade-offs

Charles C. Margossian, Loucas Pillaud-Vivien, Lawrence K. Saul

Given an intractable distribution $p$, the problem of variational inference (VI) is to find the best approximation from some more tractable family $Q$. Commonly, one chooses $Q$ to be a family of factorized distributions (i.e., the mean-field assumption), even though~$p$ itself does not factorize. We show that this mismatch leads to an impossibility theorem: if $p$ does not factorize, then any factorized approximation $qin Q$ can correctly estimate at most one of the following three measures of uncertainty: (i) the marginal variances, (ii) the marginal precisions, or (iii) the generalized variance (which can be related to the entropy). In practice, the best variational approximation in $Q$ is found by minimizing some divergence $D(q,p)$ between distributions, and so we ask: how does the choice of divergence determine which measure of uncertainty, if any, is correctly estimated by VI? We consider the classic Kullback-Leibler divergences, the more general R'enyi divergences, and a score-based divergence which compares $nabla log p$ and $nabla log q$. We provide a thorough theoretical analysis in the setting where $p$ is a Gaussian and $q$ is a (factorized) Gaussian. We show that all the considered divergences can be textit{ordered} based on the estimates of uncertainty they yield as objective functions for~VI. Finally, we empirically evaluate the validity of this ordering when the target distribution $p$ is not Gaussian.

6/10/2024

stat.ML cs.LG