Scalable Wasserstein Gradient Flow for Generative Modeling through Unbalanced Optimal Transport

2402.05443

Published 6/4/2024 by Jaemoo Choi, Jaewoong Choi, Myungjoo Kang

📊

Abstract

Wasserstein Gradient Flow (WGF) describes the gradient dynamics of probability density within the Wasserstein space. WGF provides a promising approach for conducting optimization over the probability distributions. Numerically approximating the continuous WGF requires the time discretization method. The most well-known method for this is the JKO scheme. In this regard, previous WGF models employ the JKO scheme and parametrize transport map for each JKO step. However, this approach results in quadratic training complexity $O(K^2)$ with the number of JKO step $K$. This severely limits the scalability of WGF models. In this paper, we introduce a scalable WGF-based generative model, called Semi-dual JKO (S-JKO). Our model is based on the semi-dual form of the JKO step, derived from the equivalence between the JKO step and the Unbalanced Optimal Transport. Our approach reduces the training complexity to $O(K)$. We demonstrate that our model significantly outperforms existing WGF-based generative models, achieving FID scores of 2.62 on CIFAR-10 and 5.46 on CelebA-HQ-256, which are comparable to state-of-the-art image generative models.

Create account to get full access

Overview

This paper introduces a new approach called Semi-dual JKO (S-JKO) for conducting optimization over probability distributions using Wasserstein Gradient Flow (WGF).
WGF describes the gradient dynamics of probability density within the Wasserstein space, providing a promising approach for optimizing over probability distributions.
The most well-known method for numerically approximating continuous WGF is the JKO scheme, which previous WGF models have employed.
However, the existing approach of parametrizing transport maps for each JKO step results in quadratic training complexity, severely limiting the scalability of WGF models.
The S-JKO model introduced in this paper reduces the training complexity to linear, making WGF-based generative models more scalable.

Plain English Explanation

The paper discusses a technique called Wasserstein Gradient Flow (WGF) for optimizing probability distributions. Probability distributions are mathematical representations of how likely different outcomes are. WGF provides a way to efficiently update these distributions to find the best ones for a particular task.

To use WGF in practice, researchers need to approximate the continuous WGF process with a discrete "time-stepping" approach. The most common method for this is called the JKO scheme. Previous WGF models have used the JKO scheme, but they had a major limitation - the training complexity grew quadratically with the number of time steps, making the models very slow and hard to scale up.

The new S-JKO model introduced in this paper solves this problem by using a different mathematical technique called the semi-dual form of the JKO step. This reduces the training complexity to grow linearly with the number of time steps, a much more manageable rate. The authors show that this allows their S-JKO model to significantly outperform previous WGF-based generative models on standard benchmark tasks, achieving results comparable to state-of-the-art image generation techniques.

Technical Explanation

The paper introduces a scalable Wasserstein Gradient Flow (WGF)-based generative model called Semi-dual JKO (S-JKO). WGF provides a framework for optimizing over probability distributions by describing the gradient dynamics within the Wasserstein space.

Numerically approximating the continuous WGF requires discretization using the JKO scheme. Previous WGF models have employed the JKO scheme and parameterized transport maps for each JKO step. However, this approach leads to a quadratic training complexity O(K^2) with the number of JKO steps K, severely limiting the scalability of these models.

The key innovation in the S-JKO model is the use of the semi-dual form of the JKO step, derived from the equivalence between the JKO step and Unbalanced Optimal Transport. This formulation reduces the training complexity to O(K), making WGF-based generative models more scalable.

The authors demonstrate that their S-JKO model significantly outperforms existing WGF-based generative models, achieving Fréchet Inception Distance (FID) scores of 2.62 on CIFAR-10 and 5.46 on CelebA-HQ-256, comparable to state-of-the-art image generative models.

Critical Analysis

The paper provides a promising solution to the scalability issues of previous WGF-based generative models by introducing the S-JKO approach. The authors' experiments demonstrate that the S-JKO model can achieve competitive performance on standard benchmarks while being more computationally efficient.

However, the paper does not explore the potential drawbacks or limitations of the S-JKO model. For example, it would be valuable to understand how the model's performance and training stability compare to other state-of-the-art generative models, such as Riemannian Stochastic Gradient Descent (RSGD) and RSGD-SVRG flows, which also aim to optimize over probability distributions.

Additionally, the paper does not discuss the potential challenges or considerations in applying the S-JKO model to more complex datasets or tasks beyond image generation. Further research could explore the model's performance and robustness in diverse domains.

Overall, the S-JKO model represents a valuable contribution to the field of WGF-based generative modeling, but additional research and analysis would be helpful to fully understand its strengths, limitations, and potential applications.

Conclusion

This paper introduces a scalable Wasserstein Gradient Flow (WGF)-based generative model called Semi-dual JKO (S-JKO). The key innovation is the use of the semi-dual form of the JKO step, which reduces the training complexity from quadratic to linear in the number of time steps.

The authors demonstrate that the S-JKO model significantly outperforms existing WGF-based generative models, achieving state-of-the-art performance on standard image generation benchmarks. This breakthrough in scalability could enable more widespread adoption and application of WGF-based techniques in generative modeling and beyond.

While the paper provides a promising solution, further research is needed to fully understand the model's capabilities, limitations, and potential areas for improvement. Nonetheless, the S-JKO model represents an important step forward in the development of efficient and scalable optimization methods for probability distributions.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤔

Convergence of flow-based generative models via proximal gradient descent in Wasserstein space

Xiuyuan Cheng, Jianfeng Lu, Yixin Tan, Yao Xie

Flow-based generative models enjoy certain advantages in computing the data generation and the likelihood, and have recently shown competitive empirical performance. Compared to the accumulating theoretical studies on related score-based diffusion models, analysis of flow-based models, which are deterministic in both forward (data-to-noise) and reverse (noise-to-data) directions, remain sparse. In this paper, we provide a theoretical guarantee of generating data distribution by a progressive flow model, the so-called JKO flow model, which implements the Jordan-Kinderleherer-Otto (JKO) scheme in a normalizing flow network. Leveraging the exponential convergence of the proximal gradient descent (GD) in Wasserstein space, we prove the Kullback-Leibler (KL) guarantee of data generation by a JKO flow model to be $O(varepsilon^2)$ when using $N lesssim log (1/varepsilon)$ many JKO steps ($N$ Residual Blocks in the flow) where $varepsilon $ is the error in the per-step first-order condition. The assumption on data density is merely a finite second moment, and the theory extends to data distributions without density and when there are inversion errors in the reverse process where we obtain KL-$W_2$ mixed error guarantees. The non-asymptotic convergence rate of the JKO-type $W_2$-proximal GD is proved for a general class of convex objective functionals that includes the KL divergence as a special case, which can be of independent interest. The analysis framework can extend to other first-order Wasserstein optimization schemes applied to flow-based generative models.

5/20/2024

stat.ML cs.LG

Generative Modeling by Minimizing the Wasserstein-2 Loss

Yu-Jui Huang, Zachariah Malik

This paper approaches the unsupervised learning problem by minimizing the second-order Wasserstein loss (the $W_2$ loss). The minimization is characterized by a distribution-dependent ordinary differential equation (ODE), whose dynamics involves the Kantorovich potential between a current estimated distribution and the true data distribution. A main result shows that the time-marginal law of the ODE converges exponentially to the true data distribution. To prove that the ODE has a unique solution, we first construct explicitly a solution to the associated nonlinear Fokker-Planck equation and show that it coincides with the unique gradient flow for the $W_2$ loss. Based on this, a unique solution to the ODE is built from Trevisan's superposition principle and the exponential convergence results. An Euler scheme is proposed for the distribution-dependent ODE and it is shown to correctly recover the gradient flow for the $W_2$ loss in the limit. An algorithm is designed by following the scheme and applying persistent training, which is natural in our gradient-flow framework. In both low- and high-dimensional experiments, our algorithm converges much faster than and outperforms Wasserstein generative adversarial networks, by increasing the level of persistent training appropriately.

6/21/2024

stat.ML cs.LG

Fast Gradient Computation for Gromov-Wasserstein Distance

Wei Zhang, Zihao Wang, Jie Fan, Hao Wu, Yong Zhang

The Gromov-Wasserstein distance is a notable extension of optimal transport. In contrast to the classic Wasserstein distance, it solves a quadratic assignment problem that minimizes the pair-wise distance distortion under the transportation of distributions and thus could apply to distributions in different spaces. These properties make Gromov-Wasserstein widely applicable to many fields, such as computer graphics and machine learning. However, the computation of the Gromov-Wasserstein distance and transport plan is expensive. The well-known Entropic Gromov-Wasserstein approach has a cubic complexity since the matrix multiplication operations need to be repeated in computing the gradient of Gromov-Wasserstein loss. This becomes a key bottleneck of the method. Currently, existing methods accelerate the computation focus on sampling and approximation, which leads to low accuracy or incomplete transport plan. In this work, we propose a novel method to accelerate accurate gradient computation by dynamic programming techniques, reducing the complexity from cubic to quadratic. In this way, the original computational bottleneck is broken and the new entropic solution can be obtained with total quadratic time, which is almost optimal complexity. Furthermore, it can be extended to some variants easily. Extensive experiments validate the efficiency and effectiveness of our method.

4/16/2024

cs.LG

🤯

Wasserstein Gradient Flow over Variational Parameter Space for Variational Inference

Dai Hai Nguyen, Tetsuya Sakurai, Hiroshi Mamitsuka

Variational inference (VI) can be cast as an optimization problem in which the variational parameters are tuned to closely align a variational distribution with the true posterior. The optimization task can be approached through vanilla gradient descent in black-box VI or natural-gradient descent in natural-gradient VI. In this work, we reframe VI as the optimization of an objective that concerns probability distributions defined over a textit{variational parameter space}. Subsequently, we propose Wasserstein gradient descent for tackling this optimization problem. Notably, the optimization techniques, namely black-box VI and natural-gradient VI, can be reinterpreted as specific instances of the proposed Wasserstein gradient descent. To enhance the efficiency of optimization, we develop practical methods for numerically solving the discrete gradient flows. We validate the effectiveness of the proposed methods through empirical experiments on a synthetic dataset, supplemented by theoretical analyses.

5/29/2024

cs.LG stat.ML