Convergence Rates for Stochastic Approximation: Biased Noise with Unbounded Variance, and Applications

2312.02828

Published 5/14/2024 by Rajeeva L. Karandikar, M. Vidyasagar

🎲

Abstract

In this paper, we study the convergence properties of the Stochastic Gradient Descent (SGD) method for finding a stationary point of a given objective function $J(cdot)$. The objective function is not required to be convex. Rather, our results apply to a class of invex'' functions, which have the property that every stationary point is also a global minimizer. First, it is assumed that $J(cdot)$ satisfies a property that is slightly weaker than the Kurdyka-Lojasiewicz (KL) condition, denoted here as (KL'). It is shown that the iterations $J({boldsymbol theta}_t)$ converge almost surely to the global minimum of $J(cdot)$. Next, the hypothesis on $J(cdot)$ is strengthened from (KL') to the Polyak-Lojasiewicz (PL) condition. With this stronger hypothesis, we derive estimates on the rate of convergence of $J({boldsymbol theta}_t)$ to its limit. Using these results, we show that for functions satisfying the PL property, the convergence rate of SGD is the same as the best-possible rate for convex functions. While some results along these lines have been published in the past, our contributions contain two distinct improvements. First, the assumptions on the stochastic gradient are more general than elsewhere, and second, our convergence is almost sure, and not in expectation. We also study SGD when only function evaluations are permitted. In this setting, we determine the optimal'' increments or the size of the perturbations. Using the same set of ideas, we establish the global convergence of the Stochastic Approximation (SA) algorithm under more general assumptions on the measurement error, compared to the existing literature. We also derive bounds on the rate of convergence of the SA algorithm under appropriate assumptions.

Create account to get full access

Overview

This paper studies the convergence properties of Stochastic Gradient Descent (SGD), a widely used optimization algorithm, when applied to a class of "invex" functions.
The authors show that under certain conditions, SGD can converge almost surely to the global minimum of the objective function, even when the function is not convex.
The paper also derives bounds on the rate of convergence for functions satisfying a stronger condition called the Polyak-Lojasiewicz (PL) property.
Additionally, the authors study SGD when only function evaluations are permitted, and they establish the global convergence of the Stochastic Approximation (SA) algorithm under more general assumptions.

Plain English Explanation

The paper explores the behavior of the Stochastic Gradient Descent (SGD) algorithm, which is a popular method for optimizing complex functions. Unlike traditional optimization techniques, SGD doesn't require the function to be convex (a specific mathematical property). Instead, the authors show that SGD can still converge to the global minimum of the function, as long as the function satisfies a slightly weaker condition called the "invex" property.

This means that even if the function has multiple local minima, SGD can still find the overall best solution.

The paper also looks at a stronger condition called the Polyak-Lojasiewicz (PL) property, which allows the authors to derive more precise estimates on how quickly SGD converges to the minimum. Interestingly, they show that for functions with the PL property, the convergence rate of SGD is as fast as the best possible rate for convex functions.

The authors also study a variant of SGD where only function evaluations are allowed, rather than the full gradient information. In this case, they determine the optimal size of the random perturbations to make the algorithm converge globally.

Finally, the paper establishes the global convergence of the Stochastic Approximation (SA) algorithm, which is related to SGD, under more general assumptions about the measurement errors in the function evaluations.

This work builds on and improves upon previous research in this area, with key advances in the assumptions made and the types of convergence guarantees provided.

Technical Explanation

The paper starts by considering the problem of finding a stationary point of a given objective function $J(\cdot)$, which is not required to be convex. Instead, the authors assume that $J(\cdot)$ satisfies a condition slightly weaker than the Kurdyka-Lojasiewicz (KL) condition, denoted as (KL'). Under this assumption, they show that the iterates $J(\boldsymbol{\theta}_t)$ generated by the SGD method converge almost surely to the global minimum of $J(\cdot)$.

Next, the authors strengthen the hypothesis on $J(\cdot)$ from (KL') to the Polyak-Lojasiewicz (PL) condition. With this stronger assumption, they derive estimates on the rate of convergence of $J(\boldsymbol{\theta}_t)$ to its limit. Interestingly, they show that for functions satisfying the PL property, the convergence rate of SGD is the same as the best-possible rate for convex functions.

The authors also study SGD when only function evaluations are permitted, rather than the full gradient information. In this setting, they determine the "optimal" increments or the size of the random perturbations to ensure global convergence.

Finally, the paper establishes the global convergence of the Stochastic Approximation (SA) algorithm under more general assumptions on the measurement error, compared to the existing literature. They also derive bounds on the rate of convergence of the SA algorithm under appropriate assumptions.

The key contributions of this work are the more general assumptions on the stochastic gradient, and the almost sure convergence results, as opposed to just convergence in expectation.

Critical Analysis

The paper makes several important advancements in the theoretical understanding of SGD and related algorithms. By relaxing the assumptions on the objective function and considering more general noise models, the authors have expanded the applicability of these optimization methods.

However, the paper does not address the practical implementation details or the sensitivity of the algorithms to hyperparameter choices. Additionally, the analysis is limited to the asymptotic behavior and does not provide finite-time guarantees, which may be more relevant in many real-world applications.

Further research could explore the robustness of these algorithms to model misspecification, the impact of various sampling strategies, and the extension to more complex problem settings, such as constrained optimization or composite objectives.

Conclusion

This paper makes significant contributions to the convergence analysis of Stochastic Gradient Descent and Stochastic Approximation algorithms. By considering a broader class of objective functions and more general noise models, the authors have expanded the theoretical foundations of these widely used optimization techniques.

The results show that under appropriate conditions, SGD and SA can converge to the global minimum, even when the objective function is not convex. Moreover, the authors derive tight bounds on the convergence rates, which can inform the practical deployment of these algorithms. These insights have the potential to drive further advancements in optimization methods and their applications in machine learning and beyond.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

➖

High-probability Convergence Bounds for Nonlinear Stochastic Gradient Descent Under Heavy-tailed Noise

Aleksandar Armacki, Pranay Sharma, Gauri Joshi, Dragana Bajovic, Dusan Jakovetic, Soummya Kar

We study high-probability convergence guarantees of learning on streaming data in the presence of heavy-tailed noise. In the proposed scenario, the model is updated in an online fashion, as new information is observed, without storing any additional data. To combat the heavy-tailed noise, we consider a general framework of nonlinear stochastic gradient descent (SGD), providing several strong results. First, for non-convex costs and component-wise nonlinearities, we establish a convergence rate arbitrarily close to $mathcal{O}left(t^{-frac{1}{4}}right)$, whose exponent is independent of noise and problem parameters. Second, for strongly convex costs and a broader class of nonlinearities, we establish convergence of the last iterate to the optimum, with a rate $mathcal{O}left(t^{-zeta} right)$, where $zeta in (0,1)$ depends on problem parameters, noise and nonlinearity. As we show analytically and numerically, $zeta$ can be used to inform the preferred choice of nonlinearity for given problem settings. Compared to state-of-the-art, who only consider clipping, require bounded noise moments of order $eta in (1,2]$, and establish convergence rates whose exponents go to zero as $eta rightarrow 1$, we provide high-probability guarantees for a much broader class of nonlinearities and symmetric density noise, with convergence rates whose exponents are bounded away from zero, even when the noise has finite first moment only. Moreover, in the case of strongly convex functions, we demonstrate analytically and numerically that clipping is not always the optimal nonlinearity, further underlining the value of our general framework.

4/22/2024

cs.LG stat.ML

Derivatives of Stochastic Gradient Descent

Franck Iutzeler, Edouard Pauwels, Samuel Vaiter

We consider stochastic optimization problems where the objective depends on some parameter, as commonly found in hyperparameter optimization for instance. We investigate the behavior of the derivatives of the iterates of Stochastic Gradient Descent (SGD) with respect to that parameter and show that they are driven by an inexact SGD recursion on a different objective function, perturbed by the convergence of the original SGD. This enables us to establish that the derivatives of SGD converge to the derivative of the solution mapping in terms of mean squared error whenever the objective is strongly convex. Specifically, we demonstrate that with constant step-sizes, these derivatives stabilize within a noise ball centered at the solution derivative, and that with vanishing step-sizes they exhibit $O(log(k)^2 / k)$ convergence rates. Additionally, we prove exponential convergence in the interpolation regime. Our theoretical findings are illustrated by numerical experiments on synthetic tasks.

5/28/2024

cs.LG

🔗

Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions

Devyani Maladkar, Ruichen Jiang, Aryan Mokhtari

Adaptive gradient methods are arguably the most successful optimization algorithms for neural network training. While it is well-known that adaptive gradient methods can achieve better dimensional dependence than stochastic gradient descent (SGD) under favorable geometry for stochastic convex optimization, the theoretical justification for their success in stochastic non-convex optimization remains elusive. In this paper, we aim to close this gap by analyzing the convergence rates of AdaGrad measured by the $ell_1$-norm of the gradient. Specifically, when the objective has $L$-Lipschitz gradient and the stochastic gradient variance is bounded by $sigma^2$, we prove a worst-case convergence rate of $tilde{mathcal{O}}(frac{sqrt{d}L}{sqrt{T}} + frac{sqrt{d} sigma}{T^{1/4}})$, where $d$ is the dimension of the problem.We also present a lower bound of ${Omega}(frac{sqrt{d}}{sqrt{T}})$ for minimizing the gradient $ell_1$-norm in the deterministic setting, showing the tightness of our upper bound in the noiseless case. Moreover, under more fine-grained assumptions on the smoothness structure of the objective and the gradient noise and under favorable gradient $ell_1/ell_2$ geometry, we show that AdaGrad can potentially shave a factor of $sqrt{d}$ compared to SGD. To the best of our knowledge, this is the first result for adaptive gradient methods that demonstrates a provable gain over SGD in the non-convex setting.

6/10/2024

cs.LG stat.ML

Effect of Random Learning Rate: Theoretical Analysis of SGD Dynamics in Non-Convex Optimization via Stationary Distribution

Naoki Yoshida, Shogo Nakakita, Masaaki Imaizumi

We consider a variant of the stochastic gradient descent (SGD) with a random learning rate and reveal its convergence properties. SGD is a widely used stochastic optimization algorithm in machine learning, especially deep learning. Numerous studies reveal the convergence properties of SGD and its simplified variants. Among these, the analysis of convergence using a stationary distribution of updated parameters provides generalizable results. However, to obtain a stationary distribution, the update direction of the parameters must not degenerate, which limits the applicable variants of SGD. In this study, we consider a novel SGD variant, Poisson SGD, which has degenerated parameter update directions and instead utilizes a random learning rate. Consequently, we demonstrate that a distribution of a parameter updated by Poisson SGD converges to a stationary distribution under weak assumptions on a loss function. Based on this, we further show that Poisson SGD finds global minima in non-convex optimization problems and also evaluate the generalization error using this method. As a proof technique, we approximate the distribution by Poisson SGD with that of the bouncy particle sampler (BPS) and derive its stationary distribution, using the theoretical advance of the piece-wise deterministic Markov process (PDMP).

6/26/2024

stat.ML cs.LG