High Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise

Read original: arXiv:2106.05958 - Published 9/2/2024 by Eduard Gorbunov, Marina Danilova, Innokentiy Shibaev, Pavel Dvurechensky, Alexander Gasnikov

🎲

Overview

Stochastic first-order methods are common for training large-scale machine learning models
Random behavior can lead to highly suboptimal results, even though theoretical guarantees are usually for the expected objective value
It's important to ensure algorithms provide small objective residual with high probability
Existing methods for non-smooth stochastic convex optimization have complexity bounds with negative-power or logarithmic dependence on confidence level, under an assumption of sub-Gaussian noise that may not hold in practice

Plain English Explanation

In machine learning, we often use algorithms that make random choices during the training process. While these algorithms have theoretical guarantees on the expected performance, a particular run of the algorithm may still result in a very poor outcome. This paper aims to address this issue by developing new methods that can ensure the algorithm performs well with high probability, even if the noise in the data doesn't follow a specific distribution.

The key challenge is that existing methods for non-smooth optimization problems (where the objective function is not smooth) have complexity bounds that depend on the confidence level in a way that is either negative-power or logarithmic. This means that to get a high-probability guarantee, the algorithm has to run for a very long time. The authors of this paper resolve this issue by proposing new step-size rules for two stochastic optimization methods, along with an analysis that works for more general objective functions and can also handle strongly convex problems.

Technical Explanation

The paper introduces two new stochastic optimization methods with gradient clipping, and provides high-probability convergence guarantees for these methods in the setting of non-smooth, non-strongly convex optimization problems with non-sub-Gaussian (heavy-tailed) noise.

The first method is an accelerated stochastic gradient descent (SGD) algorithm, and the second is a proximal stochastic gradient method. For both methods, the authors derive step-size rules that lead to logarithmic dependence on the confidence level in the high-probability convergence bounds. This is in contrast to previous results, which had either negative-power or logarithmic dependence under the additional assumption of sub-Gaussian noise.

The analysis in the paper also covers generalized smooth objectives with Hölder-continuous gradients, and provides extensions for strongly convex problems. Additionally, the authors show that the first (accelerated) method has optimal iteration and oracle complexity in all regimes, while the second method is optimal in the non-smooth setting.

Critical Analysis

The paper addresses an important practical concern in machine learning, where the theoretical guarantees on algorithm performance may not always translate to reliable performance in practice. By relaxing the sub-Gaussian noise assumption, the authors have expanded the applicability of their methods to a wider range of real-world scenarios.

However, the analysis in the paper relies on several assumptions, such as Hölder-continuity of the gradients, which may not always hold in practice. Additionally, the paper does not provide experimental validation of the proposed methods, which would be helpful to assess their practical performance and compare them to existing approaches.

Overall, this research makes an important theoretical contribution, but further empirical investigation and analysis of the method's limitations would be valuable to fully understand its practical implications.

Conclusion

This paper presents new stochastic optimization methods with high-probability convergence guarantees for non-smooth, non-strongly convex problems with heavy-tailed noise. By relaxing the common sub-Gaussian assumption, the authors have developed algorithms that are more widely applicable to real-world machine learning scenarios.

The theoretical analysis establishes optimal complexity bounds for the proposed methods, which is a significant advancement over previous results. While the paper does not include empirical validation, the theoretical contributions are an important step forward in ensuring the reliable performance of stochastic optimization algorithms in practice.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🎲

High Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise

Eduard Gorbunov, Marina Danilova, Innokentiy Shibaev, Pavel Dvurechensky, Alexander Gasnikov

Stochastic first-order methods are standard for training large-scale machine learning models. Random behavior may cause a particular run of an algorithm to result in a highly suboptimal objective value, whereas theoretical guarantees are usually proved for the expectation of the objective value. Thus, it is essential to theoretically guarantee that algorithms provide small objective residual with high probability. Existing methods for non-smooth stochastic convex optimization have complexity bounds with the dependence on the confidence level that is either negative-power or logarithmic but under an additional assumption of sub-Gaussian (light-tailed) noise distribution that may not hold in practice. In our paper, we resolve this issue and derive the first high-probability convergence results with logarithmic dependence on the confidence level for non-smooth convex stochastic optimization problems with non-sub-Gaussian (heavy-tailed) noise. To derive our results, we propose novel stepsize rules for two stochastic methods with gradient clipping. Moreover, our analysis works for generalized smooth objectives with Holder-continuous gradients, and for both methods, we provide an extension for strongly convex problems. Finally, our results imply that the first (accelerated) method we consider also has optimal iteration and oracle complexity in all the regimes, and the second one is optimal in the non-smooth setting.

9/2/2024

➖

High-probability Convergence Bounds for Nonlinear Stochastic Gradient Descent Under Heavy-tailed Noise

Aleksandar Armacki, Pranay Sharma, Gauri Joshi, Dragana Bajovic, Dusan Jakovetic, Soummya Kar

We study high-probability convergence guarantees of learning on streaming data in the presence of heavy-tailed noise. In the proposed scenario, the model is updated in an online fashion, as new information is observed, without storing any additional data. To combat the heavy-tailed noise, we consider a general framework of nonlinear stochastic gradient descent (SGD), providing several strong results. First, for non-convex costs and component-wise nonlinearities, we establish a convergence rate arbitrarily close to $mathcal{O}left(t^{-frac{1}{4}}right)$, whose exponent is independent of noise and problem parameters. Second, for strongly convex costs and a broader class of nonlinearities, we establish convergence of the last iterate to the optimum, with a rate $mathcal{O}left(t^{-zeta} right)$, where $zeta in (0,1)$ depends on problem parameters, noise and nonlinearity. As we show analytically and numerically, $zeta$ can be used to inform the preferred choice of nonlinearity for given problem settings. Compared to state-of-the-art, who only consider clipping, require bounded noise moments of order $eta in (1,2]$, and establish convergence rates whose exponents go to zero as $eta rightarrow 1$, we provide high-probability guarantees for a much broader class of nonlinearities and symmetric density noise, with convergence rates whose exponents are bounded away from zero, even when the noise has finite first moment only. Moreover, in the case of strongly convex functions, we demonstrate analytically and numerically that clipping is not always the optimal nonlinearity, further underlining the value of our general framework.

4/22/2024

🎯

High-Probability Convergence for Composite and Distributed Stochastic Minimization and Variational Inequalities with Heavy-Tailed Noise

Eduard Gorbunov, Abdurakhmon Sadiev, Marina Danilova, Samuel Horv'ath, Gauthier Gidel, Pavel Dvurechensky, Alexander Gasnikov, Peter Richt'arik

High-probability analysis of stochastic first-order optimization methods under mild assumptions on the noise has been gaining a lot of attention in recent years. Typically, gradient clipping is one of the key algorithmic ingredients to derive good high-probability guarantees when the noise is heavy-tailed. However, if implemented naively, clipping can spoil the convergence of the popular methods for composite and distributed optimization (Prox-SGD/Parallel SGD) even in the absence of any noise. Due to this reason, many works on high-probability analysis consider only unconstrained non-distributed problems, and the existing results for composite/distributed problems do not include some important special cases (like strongly convex problems) and are not optimal. To address this issue, we propose new stochastic methods for composite and distributed optimization based on the clipping of stochastic gradient differences and prove tight high-probability convergence results (including nearly optimal ones) for the new methods. Using similar ideas, we also develop new methods for composite and distributed variational inequalities and analyze the high-probability convergence of these methods.

7/25/2024

A Functional Model Method for Nonconvex Nonsmooth Conditional Stochastic Optimization

Andrzej Ruszczy'nski, Shangzhe Yang

We consider stochastic optimization problems involving an expected value of a nonlinear function of a base random vector and a conditional expectation of another function depending on the base random vector, a dependent random vector, and the decision variables. We call such problems conditional stochastic optimization problems. They arise in many applications, such as uplift modeling, reinforcement learning, and contextual optimization. We propose a specialized single time-scale stochastic method for nonconvex constrained conditional stochastic optimization problems with a Lipschitz smooth outer function and a generalized differentiable inner function. In the method, we approximate the inner conditional expectation with a rich parametric model whose mean squared error satisfies a stochastic version of a {L}ojasiewicz condition. The model is used by an inner learning algorithm. The main feature of our approach is that unbiased stochastic estimates of the directions used by the method can be generated with one observation from the joint distribution per iteration, which makes it applicable to real-time learning. The directions, however, are not gradients or subgradients of any overall objective function. We prove the convergence of the method with probability one, using the method of differential inclusions and a specially designed Lyapunov function, involving a stochastic generalization of the Bregman distance. Finally, a numerical illustration demonstrates the viability of our approach.

5/20/2024