Open Problem: Anytime Convergence Rate of Gradient Descent

Guy Kornowski, Ohad Shamir

Recent results show that vanilla gradient descent can be accelerated for smooth convex objectives, merely by changing the stepsize sequence. We show that this can lead to surprisingly large errors indefinitely, and therefore ask: Is there any stepsize schedule for gradient descent that accelerates the classic $mathcal{O}(1/T)$ convergence rate, at emph{any} stopping time $T$?

6/21/2024

⛏️

On the Convergence of Gradient Descent for Large Learning Rates

Alexandru Cru{a}ciun, Debarghya Ghoshdastidar

A vast literature on convergence guarantees for gradient descent and derived methods exists at the moment. However, a simple practical situation remains unexplored: when a fixed step size is used, can we expect gradient descent to converge starting from any initialization? We provide fundamental impossibility results showing that convergence becomes impossible no matter the initialization if the step size gets too big. Looking at the asymptotic value of the gradient norm along the optimization trajectory, we see that there is a phase transition as the step size crosses a critical value. This has been observed by practitioners, yet the true mechanisms through which this happens remain unclear beyond heuristics. Using results from dynamical systems theory, we provide a proof of this in the case of linear neural networks with a squared loss. We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient. We validate our findings through experiments with non-linear networks.

9/4/2024

⛏️

Convergence rates of stochastic gradient method with independent sequences of step-size and momentum weight

Wen-Liang Hwang

In large-scale learning algorithms, the momentum term is usually included in the stochastic sub-gradient method to improve the learning speed because it can navigate ravines efficiently to reach a local minimum. However, step-size and momentum weight hyper-parameters must be appropriately tuned to optimize convergence. We thus analyze the convergence rate using stochastic programming with Polyak's acceleration of two commonly used step-size learning rates: ``diminishing-to-zero and ``constant-and-drop (where the sequence is divided into stages and a constant step-size is applied at each stage) under strongly convex functions over a compact convex set with bounded sub-gradients. For the former, we show that the convergence rate can be written as a product of exponential in step-size and polynomial in momentum weight. Our analysis justifies the convergence of using the default momentum weight setting and the diminishing-to-zero step-size sequence in large-scale machine learning software. For the latter, we present the condition for the momentum weight sequence to converge at each stage.

8/7/2024

👁️

Convex SGD: Generalization Without Early Stopping

Julien Hendrickx, Alex Olshevsky

We consider the generalization error associated with stochastic gradient descent on a smooth convex function over a compact set. We show the first bound on the generalization error that vanishes when the number of iterations $T$ and the dataset size $n$ go to zero at arbitrary rates; our bound scales as $tilde{O}(1/sqrt{T} + 1/sqrt{n})$ with step-size $alpha_t = 1/sqrt{t}$. In particular, strong convexity is not needed for stochastic gradient descent to generalize well.

4/16/2024

Open Problem: Anytime Convergence Rate of Gradient Descent

Overview

Plain English Explanation

Technical Explanation

Critical Analysis

Conclusion

Related Papers