A New Random Reshuffling Method for Nonsmooth Nonconvex Finite-sum Optimization

Read original: arXiv:2312.01047 - Published 5/1/2024 by Junwen Qiu, Xiao Li, Andre Milzarek

🛠️

Overview

This paper proposes a new optimization method called "normal map-based proximal random reshuffling" (norm-PRR) for solving nonsmooth, nonconvex optimization problems.
The method is designed to work well on large-scale problems, such as training neural networks, where random reshuffling techniques are commonly used.
The paper analyzes the theoretical convergence properties of the norm-PRR method, showing it can achieve better iteration complexity compared to prior approaches.
The paper also demonstrates the efficiency of the norm-PRR method on nonconvex classification tasks through numerical experiments.

Plain English Explanation

Optimization problems are important in many fields, including machine learning, where they are used to train neural networks. One common technique for solving these problems is called "random reshuffling," where the order of the data is shuffled randomly during the optimization process.

The norm-PRR method proposed in this paper is a new way to do random reshuffling that works well for optimization problems that are "nonsmooth" and "nonconvex." Nonsmooth means the objective function may not be differentiable everywhere, and nonconvex means it has multiple local minima.

The key insight behind norm-PRR is to use a "normal map" to guide the random reshuffling in a way that helps the optimization converge faster. The authors show that norm-PRR can achieve better iteration complexity than previous methods for this class of problems, meaning it can find a good solution in fewer steps.

The paper also proves that norm-PRR has strong convergence guarantees - it is guaranteed to converge linearly (at a geometric rate) under certain conditions, and the entire sequence of iterates converges to a stationary point. These theoretical results are supported by numerical experiments on nonconvex classification tasks, which demonstrate the practical effectiveness of the norm-PRR method.

Technical Explanation

The norm-PRR method is designed to solve nonsmooth, nonconvex finite-sum optimization problems of the form:

minimize ∑_{i=1}^n f(x, i)

where x is the optimization variable, n is the number of component functions f(·, i), and the overall objective function is the sum of these component functions.

The key idea behind norm-PRR is to use a "normal map" to guide the random reshuffling of the component functions during the optimization process. The normal map is a function that provides information about the local geometry of the objective function, which the algorithm then uses to update the variable x in a way that promotes faster convergence.

Theoretically, the paper shows that norm-PRR achieves an iteration complexity of O(n^{-1/3}T^{-2/3}), where n is the number of component functions and T is the total number of iterations. This represents an improvement over the previously best-known complexity bounds for this class of problems.

Furthermore, the paper proves that norm-PRR converges linearly under the Polyak-Lojasiewicz condition and in the interpolation setting. It also shows that under the Kurdyka-Lojasiewicz inequality, the entire sequence of iterates generated by norm-PRR converges to a single stationary point, with last iterate convergence rates that can match those in the smooth, strongly convex setting.

The numerical experiments in the paper demonstrate the effectiveness of the norm-PRR method on nonconvex classification tasks, supporting the theoretical insights.

Critical Analysis

The paper provides a thorough theoretical analysis of the norm-PRR method, including its convergence properties and complexity bounds. However, there are a few potential limitations and areas for further research that could be considered:

The paper focuses on nonsmooth, nonconvex finite-sum problems, which is an important class of optimization problems, but there may be other problem structures or applications where the norm-PRR method may not be as effective.
The paper assumes the availability of a proximal oracle for the component functions, which may not be trivial to obtain in all practical scenarios.
The numerical experiments are limited to nonconvex classification tasks, and it would be valuable to see the performance of norm-PRR on a wider range of optimization problems, including larger-scale neural network training tasks.
The paper does not provide a direct comparison to other state-of-the-art methods for nonsmooth, nonconvex optimization, which would help better contextualize the improvements achieved by the norm-PRR approach.

Overall, the norm-PRR method appears to be a promising contribution to the field of nonsmooth, nonconvex optimization, but further research and validation on a broader set of problems and applications would be beneficial.

Conclusion

This paper introduces a new optimization method called "normal map-based proximal random reshuffling" (norm-PRR) for solving nonsmooth, nonconvex finite-sum optimization problems. The key innovation is the use of a normal map to guide the random reshuffling of the component functions, which helps the method achieve better iteration complexity and stronger convergence guarantees compared to previous approaches.

The theoretical analysis and numerical experiments presented in the paper suggest that the norm-PRR method can be an effective tool for optimizing large-scale, nonconvex machine learning models, such as neural networks. As the field of nonsmooth, nonconvex optimization continues to evolve, innovations like the norm-PRR method will be crucial for developing more efficient and reliable optimization algorithms to power the next generation of AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🛠️

A New Random Reshuffling Method for Nonsmooth Nonconvex Finite-sum Optimization

Junwen Qiu, Xiao Li, Andre Milzarek

Random reshuffling techniques are prevalent in large-scale applications, such as training neural networks. While the convergence and acceleration effects of random reshuffling-type methods are fairly well understood in the smooth setting, much less studies seem available in the nonsmooth case. In this work, we design a new normal map-based proximal random reshuffling (norm-PRR) method for nonsmooth nonconvex finite-sum problems. We show that norm-PRR achieves the iteration complexity $O(n^{-1/3}T^{-2/3})$ where $n$ denotes the number of component functions $f(cdot,i)$ and $T$ counts the total number of iterations. This improves the currently known complexity bounds for this class of problems by a factor of $n^{-1/3}$. In addition, we prove that norm-PRR converges linearly under the (global) Polyak-Lojasiewicz condition and in the interpolation setting. We further complement these non-asymptotic results and provide an in-depth analysis of the asymptotic properties of norm-PRR. Specifically, under the (local) Kurdyka-Lojasiewicz inequality, the whole sequence of iterates generated by norm-PRR is shown to converge to a single stationary point. Moreover, we derive last iterate convergence rates that can match those in the smooth, strongly convex setting. Finally, numerical experiments are performed on nonconvex classification tasks to illustrate the efficiency of the proposed approach.

5/1/2024

🏋️

Convergence of ease-controlled Random Reshuffling gradient Algorithms under Lipschitz smoothness

Ruggiero Seccia, Corrado Coppola, Giampaolo Liuzzi, Laura Palagi

In this work, we consider minimizing the average of a very large number of smooth and possibly non-convex functions, and we focus on two widely used minibatch frameworks to tackle this optimization problem: Incremental Gradient (IG) and Random Reshuffling (RR). We define ease-controlled modifications of the IG/RR schemes, which require a light additional computational effort {but} can be proved to converge under {weak} and standard assumptions. In particular, we define two algorithmic schemes in which the IG/RR iteration is controlled by using a watchdog rule and a derivative-free linesearch that activates only sporadically to guarantee convergence. The two schemes differ in the watchdog and the linesearch, which are performed using either a monotonic or a non-monotonic rule. The two schemes also allow controlling the updating of the stepsize used in the main IG/RR iteration, avoiding the use of pre-set rules that may drive the stepsize to zero too fast, reducing the effort in designing effective updating rules of the stepsize. We prove convergence under the mild assumption of Lipschitz continuity of the gradients of the component functions and perform extensive computational analysis using different deep neural architectures and a benchmark of varying-size datasets. We compare our implementation with both a full batch gradient method (i.e. L-BFGS) and an implementation of IG/RR methods, proving that our algorithms require a similar computational effort compared to the other online algorithms and that the control on the learning rate may allow a faster decrease of the objective function.

5/22/2024

🎲

High Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise

Eduard Gorbunov, Marina Danilova, Innokentiy Shibaev, Pavel Dvurechensky, Alexander Gasnikov

Stochastic first-order methods are standard for training large-scale machine learning models. Random behavior may cause a particular run of an algorithm to result in a highly suboptimal objective value, whereas theoretical guarantees are usually proved for the expectation of the objective value. Thus, it is essential to theoretically guarantee that algorithms provide small objective residual with high probability. Existing methods for non-smooth stochastic convex optimization have complexity bounds with the dependence on the confidence level that is either negative-power or logarithmic but under an additional assumption of sub-Gaussian (light-tailed) noise distribution that may not hold in practice. In our paper, we resolve this issue and derive the first high-probability convergence results with logarithmic dependence on the confidence level for non-smooth convex stochastic optimization problems with non-sub-Gaussian (heavy-tailed) noise. To derive our results, we propose novel stepsize rules for two stochastic methods with gradient clipping. Moreover, our analysis works for generalized smooth objectives with Holder-continuous gradients, and for both methods, we provide an extension for strongly convex problems. Finally, our results imply that the first (accelerated) method we consider also has optimal iteration and oracle complexity in all the regimes, and the second one is optimal in the non-smooth setting.

9/2/2024

An Adaptive Second-order Method for a Class of Nonconvex Nonsmooth Composite Optimization

Hao Wang, Xiangyu Yang, Yichen Zhu

This paper explores a specific type of nonconvex sparsity-promoting regularization problems, namely those involving $ell_p$-norm regularization, in conjunction with a twice continuously differentiable loss function. We propose a novel second-order algorithm designed to effectively address this class of challenging nonconvex and nonsmooth problems, showcasing several innovative features: (i) The use of an alternating strategy to solve a reweighted $ell_1$ regularized subproblem and the subspace approximate Newton step. (ii) The reweighted $ell_1$ regularized subproblem relies on a convex approximation to the nonconvex regularization term, enabling a closed-form solution characterized by the soft-thresholding operator. This feature allows our method to be applied to various nonconvex regularization problems. (iii) Our algorithm ensures that the iterates maintain their sign values and that nonzero components are kept away from 0 for a sufficient number of iterations, eventually transitioning to a perturbed Newton method. (iv) We provide theoretical guarantees of global convergence, local superlinear convergence in the presence of the Kurdyka-L ojasiewicz (KL) property, and local quadratic convergence when employing the exact Newton step in our algorithm. We also showcase the effectiveness of our approach through experiments on a diverse set of model prediction problems.

8/20/2024