Double Variance Reduction: A Smoothing Trick for Composite Optimization Problems without First-Order Gradient

Read original: arXiv:2405.17761 - Published 5/29/2024 by Hao Di, Haishan Ye, Yueling Zhang, Xiangyu Chang, Guang Dai, Ivor W. Tsang

🛠️

Overview

Variance reduction techniques are used to improve the performance of first-order (FO) and zeroth-order (ZO) optimization methods by reducing sampling variance.
In composite optimization problems, ZO methods face an additional challenge called coordinate-wise variance, which stems from random gradient estimation.
Prior methods for reducing coordinate-wise variance require estimating all partial derivatives, which is computationally expensive, especially in high-dimensional scenarios.

Plain English Explanation

Optimization is a common task in machine learning and other fields, where we try to find the best set of parameters or values to minimize or maximize some objective function. Variance reduction techniques are used to make this optimization process more efficient by reducing the amount of randomness or "noise" in the calculations.

There are two main types of optimization methods: first-order (FO) and zeroth-order (ZO). FO methods use information about the gradient of the objective function, while ZO methods only use the function values themselves, without any gradient information.

In some problems, called "composite optimization" problems, ZO methods face an additional challenge. The random way they estimate the gradient can introduce an extra source of variance, called "coordinate-wise variance." This coordinate-wise variance makes the optimization process slower and less accurate.

Previous methods for dealing with this coordinate-wise variance required estimating all the partial derivatives of the objective function, which is very computationally expensive, especially in problems with a lot of dimensions (i.e., a lot of parameters to optimize).

Technical Explanation

This paper proposes a new method called Zeroth-order Proximal Double Variance Reduction (ZPDVR) that can effectively reduce both the sampling variance and the coordinate-wise variance in composite optimization problems, without needing to estimate all the partial derivatives.

ZPDVR uses a technique called the "averaging trick" to reduce these two sources of variance. Importantly, ZPDVR only needs to call the stochastic zeroth-order oracle (SZO) - the function that provides the function value and gradient estimates - a constant number of times per iteration, on average. This is a significant improvement over previous methods, which required calling the SZO O(d) times per iteration, where d is the problem dimensionality.

The paper shows that ZPDVR achieves the optimal O(d(n + κ)log(1/ε)) SZO query complexity in the strongly convex and smooth setting, where n is the number of iterations, κ is the condition number, and ε is the desired accuracy. This means ZPDVR can converge to the optimal solution very efficiently, even in high-dimensional problems.

The empirical results presented in the paper validate ZPDVR's linear convergence and demonstrate its superior performance compared to other related methods, such as those discussed in this paper and this one.

Critical Analysis

The paper provides a thorough theoretical analysis of the ZPDVR method and its convergence guarantees. However, the authors do not discuss any potential limitations or drawbacks of their approach.

For example, the paper assumes the objective function is strongly convex and smooth, which may not always be the case in real-world optimization problems. It would be valuable to understand how ZPDVR performs under more general, non-convex or non-smooth settings.

Additionally, the paper focuses on the theoretical analysis and does not provide a comprehensive empirical evaluation. While the results show ZPDVR outperforming other methods, it would be helpful to see a more diverse set of experiments, including comparisons to other variance reduction techniques and gradient estimation methods, as well as an analysis of ZPDVR's practical computational overhead.

Conclusion

The ZPDVR method proposed in this paper represents an important advancement in the field of zeroth-order optimization for composite problems. By effectively reducing both sampling and coordinate-wise variance, ZPDVR can converge to the optimal solution much more efficiently than previous approaches, especially in high-dimensional settings.

While the theoretical guarantees are strong, further empirical validation and analysis of ZPDVR's limitations and applicability to a broader range of optimization problems would help solidify its practical impact and guide future research in this area.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🛠️

Double Variance Reduction: A Smoothing Trick for Composite Optimization Problems without First-Order Gradient

Hao Di, Haishan Ye, Yueling Zhang, Xiangyu Chang, Guang Dai, Ivor W. Tsang

Variance reduction techniques are designed to decrease the sampling variance, thereby accelerating convergence rates of first-order (FO) and zeroth-order (ZO) optimization methods. However, in composite optimization problems, ZO methods encounter an additional variance called the coordinate-wise variance, which stems from the random gradient estimation. To reduce this variance, prior works require estimating all partial derivatives, essentially approximating FO information. This approach demands O(d) function evaluations (d is the dimension size), which incurs substantial computational costs and is prohibitive in high-dimensional scenarios. This paper proposes the Zeroth-order Proximal Double Variance Reduction (ZPDVR) method, which utilizes the averaging trick to reduce both sampling and coordinate-wise variances. Compared to prior methods, ZPDVR relies solely on random gradient estimates, calls the stochastic zeroth-order oracle (SZO) in expectation $mathcal{O}(1)$ times per iteration, and achieves the optimal $mathcal{O}(d(n + kappa)log (frac{1}{epsilon}))$ SZO query complexity in the strongly convex and smooth setting, where $kappa$ represents the condition number and $epsilon$ is the desired accuracy. Empirical results validate ZPDVR's linear convergence and demonstrate its superior performance over other related methods.

5/29/2024

➖

Variance reduction techniques for stochastic proximal point algorithms

Cheik Traor'e, Vassilis Apidopoulos, Saverio Salzo, Silvia Villa

In the context of finite sums minimization, variance reduction techniques are widely used to improve the performance of state-of-the-art stochastic gradient methods. Their practical impact is clear, as well as their theoretical properties. Stochastic proximal point algorithms have been studied as an alternative to stochastic gradient algorithms since they are more stable with respect to the choice of the step size. However, their variance-reduced versions are not as well studied as the gradient ones. In this work, we propose the first unified study of variance reduction techniques for stochastic proximal point algorithms. We introduce a generic stochastic proximal-based algorithm that can be specified to give the proximal version of SVRG, SAGA, and some of their variants. For this algorithm, in the smooth setting, we provide several convergence rates for the iterates and the objective function values, which are faster than those of the vanilla stochastic proximal point algorithm. More specifically, for convex functions, we prove a sublinear convergence rate of $O(1/k)$. In addition, under the Polyak-{L}ojasiewicz (PL) condition, we obtain linear convergence rates. Finally, our numerical experiments demonstrate the advantages of the proximal variance reduction methods over their gradient counterparts in terms of the stability with respect to the choice of the step size in most cases, especially for difficult problems.

8/7/2024

Projection-Free Variance Reduction Methods for Stochastic Constrained Multi-Level Compositional Optimization

Wei Jiang, Sifan Yang, Wenhao Yang, Yibo Wang, Yuanyu Wan, Lijun Zhang

This paper investigates projection-free algorithms for stochastic constrained multi-level optimization. In this context, the objective function is a nested composition of several smooth functions, and the decision set is closed and convex. Existing projection-free algorithms for solving this problem suffer from two limitations: 1) they solely focus on the gradient mapping criterion and fail to match the optimal sample complexities in unconstrained settings; 2) their analysis is exclusively applicable to non-convex functions, without considering convex and strongly convex objectives. To address these issues, we introduce novel projection-free variance reduction algorithms and analyze their complexities under different criteria. For gradient mapping, our complexities improve existing results and match the optimal rates for unconstrained problems. For the widely-used Frank-Wolfe gap criterion, we provide theoretical guarantees that align with those for single-level problems. Additionally, by using a stage-wise adaptation, we further obtain complexities for convex and strongly convex functions. Finally, numerical experiments on different tasks demonstrate the effectiveness of our methods.

6/7/2024

➖

Policy-based Primal-Dual Methods for Concave CMDP with Variance Reduction

Donghao Ying, Mengzi Amy Guo, Hyunin Lee, Yuhao Ding, Javad Lavaei, Zuo-Jun Max Shen

We study Concave Constrained Markov Decision Processes (Concave CMDPs) where both the objective and constraints are defined as concave functions of the state-action occupancy measure. We propose the Variance-Reduced Primal-Dual Policy Gradient Algorithm (VR-PDPG), which updates the primal variable via policy gradient ascent and the dual variable via projected sub-gradient descent. Despite the challenges posed by the loss of additivity structure and the nonconcave nature of the problem, we establish the global convergence of VR-PDPG by exploiting a form of hidden concavity. In the exact setting, we prove an $O(T^{-1/3})$ convergence rate for both the average optimality gap and constraint violation, which further improves to $O(T^{-1/2})$ under strong concavity of the objective in the occupancy measure. In the sample-based setting, we demonstrate that VR-PDPG achieves an $widetilde{O}(epsilon^{-4})$ sample complexity for $epsilon$-global optimality. Moreover, by incorporating a diminishing pessimistic term into the constraint, we show that VR-PDPG can attain a zero constraint violation without compromising the convergence rate of the optimality gap. Finally, we validate the effectiveness of our methods through numerical experiments.

5/28/2024