Can We Remove the Square-Root in Adaptive Gradient Methods? A Second-Order Perspective

2402.03496

Published 6/18/2024 by Wu Lin, Felix Dangel, Runa Eschenhagen, Juhan Bae, Richard E. Turner, Alireza Makhzani

Can We Remove the Square-Root in Adaptive Gradient Methods? A Second-Order Perspective

Abstract

Adaptive gradient optimizers like Adam(W) are the default training algorithms for many deep learning architectures, such as transformers. Their diagonal preconditioner is based on the gradient outer product which is incorporated into the parameter update via a square root. While these methods are often motivated as approximate second-order methods, the square root represents a fundamental difference. In this work, we investigate how the behavior of adaptive methods changes when we remove the root, i.e. strengthen their second-order motivation. Surprisingly, we find that such square-root-free adaptive methods close the generalization gap to SGD on convolutional architectures, while maintaining their root-based counterpart's performance on transformers. The second-order perspective also has practical benefits for developing non-diagonal adaptive methods through the concept of preconditioner invariance. In contrast to root-based methods like Shampoo, root-free counterparts work well and fast with half-precision since they do not require numerically unstable matrix root decompositions and inversions. This is useful to bridge the computation gap between diagonal and non-diagonal methods. Our findings provide new insights into the development of adaptive methods and raise important questions regarding the currently overlooked role of adaptivity for their success. (experiment code: https://github.com/yorkerlin/remove-the-square-root optimizer code: https://github.com/f-dangel/sirfshampoo)

Create account to get full access

Overview

The paper explores whether the square root term in adaptive gradient methods like Adam can be removed to improve optimization performance.
It takes a second-order perspective, analyzing the behavior of these methods from the lens of natural gradient descent.
The researchers propose a new method called AdaFisher that aims to combine the benefits of adaptive gradient methods and natural gradient descent.

Plain English Explanation

The paper looks at a common technique in machine learning called "adaptive gradient methods". These methods, like the popular Adam algorithm, are used to optimize neural networks and other machine learning models.

One key part of these adaptive methods is the square root term. The researchers here ask an interesting question - can we remove this square root term and still get good performance? They argue that the square root plays an important role in balancing the updates, but may not be strictly necessary.

To explore this, the researchers take a "second-order" view of these methods. This means they analyze them not just in terms of the gradients, but also considering the curvature of the optimization landscape. This allows them to connect adaptive methods to a technique called "natural gradient descent" [<a href="https://aimodels.fyi/papers/arxiv/inverse-free-fast-natural-gradient-descent-method">1</a>], which is known to be effective but can be computationally expensive.

Building on this insight, the researchers propose a new method called "AdaFisher" [<a href="https://aimodels.fyi/papers/arxiv/adafisher-adaptive-second-order-optimization-via-fisher">2</a>] that tries to combine the benefits of adaptive gradients and natural gradients. The goal is to get the speed and simplicity of adaptive methods, while maintaining the strong theoretical guarantees of natural gradients.

Technical Explanation

The paper starts by providing a "first-order" view of adaptive gradient methods like Adam [<a href="https://aimodels.fyi/papers/arxiv/studying-k-fac-heuristics-by-viewing-adam">3</a>]. These methods maintain running estimates of the first and second moments of the gradients, and use these to adaptively scale the updates. The square root term is used to balance these two moments.

The researchers then take a "second-order" perspective, analyzing these methods through the lens of natural gradient descent [<a href="https://aimodels.fyi/papers/arxiv/inverse-free-fast-natural-gradient-descent-method">1</a>]. They show that adaptive methods implicitly perform a kind of approximate natural gradient update. This suggests the square root term may not be strictly necessary.

Building on this insight, the researchers propose a new method called AdaFisher [<a href="https://aimodels.fyi/papers/arxiv/adafisher-adaptive-second-order-optimization-via-fisher">2</a>]. AdaFisher uses an approximation of the Fisher information matrix to capture curvature, while maintaining the simplicity and efficiency of adaptive gradients. Experiments show AdaFisher can match or outperform Adam and natural gradient descent on a variety of benchmarks.

Critical Analysis

The paper provides a thoughtful analysis of adaptive gradient methods from a second-order perspective. The connection to natural gradients is an interesting and insightful result. However, the researchers acknowledge that their proposed AdaFisher method still relies on some approximations and heuristics.

Additionally, the paper focuses primarily on convex optimization problems. While these insights may also apply to non-convex deep learning settings, further research would be needed to fully validate the approach in those more complex scenarios [<a href="https://aimodels.fyi/papers/arxiv/adaptive-gradient-methods-at-edge-stability">4</a>].

It would also be valuable to see more in-depth analysis of the tradeoffs between the simplicity of adaptive gradients and the stronger theoretical guarantees of natural gradients. The paper touches on this, but a more comprehensive treatment could help guide practitioners in choosing the most appropriate optimization method for their needs.

Conclusion

This paper takes an innovative second-order view of adaptive gradient methods, connecting them to natural gradient descent. The proposed AdaFisher method aims to combine the benefits of both approaches, offering efficient optimization without sacrificing strong theoretical underpinnings.

While there are still some open questions and limitations, this research represents an important step forward in understanding and improving upon the widely used family of adaptive gradient techniques. As machine learning models become more complex, advances like this that enhance optimization performance could have significant impacts across a variety of applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

AdaFisher: Adaptive Second Order Optimization via Fisher Information

Damien Martins Gomes, Yanlei Zhang, Eugene Belilovsky, Guy Wolf, Mahdi S. Hosseini

First-order optimization methods are currently the mainstream in training deep neural networks (DNNs). Optimizers like Adam incorporate limited curvature information by employing the diagonal matrix preconditioning of the stochastic gradient during the training. Despite their widespread, second-order optimization algorithms exhibit superior convergence properties compared to their first-order counterparts e.g. Adam and SGD. However, their practicality in training DNNs are still limited due to increased per-iteration computations and suboptimal accuracy compared to the first order methods. We present AdaFisher--an adaptive second-order optimizer that leverages a block-diagonal approximation to the Fisher information matrix for adaptive gradient preconditioning. AdaFisher aims to bridge the gap between enhanced convergence capabilities and computational efficiency in second-order optimization framework for training DNNs. Despite the slow pace of second-order optimizers, we showcase that AdaFisher can be reliably adopted for image classification, language modelling and stand out for its stability and robustness in hyperparameter tuning. We demonstrate that AdaFisher outperforms the SOTA optimizers in terms of both accuracy and convergence speed. Code available from href{https://github.com/AtlasAnalyticsLab/AdaFisher}{https://github.com/AtlasAnalyticsLab/AdaFisher}

5/28/2024

cs.LG

Remove that Square Root: A New Efficient Scale-Invariant Version of AdaGrad

Sayantan Choudhury, Nazarii Tupitsa, Nicolas Loizou, Samuel Horvath, Martin Takac, Eduard Gorbunov

Adaptive methods are extremely popular in machine learning as they make learning rate tuning less expensive. This paper introduces a novel optimization algorithm named KATE, which presents a scale-invariant adaptation of the well-known AdaGrad algorithm. We prove the scale-invariance of KATE for the case of Generalized Linear Models. Moreover, for general smooth non-convex problems, we establish a convergence rate of $O left(frac{log T}{sqrt{T}} right)$ for KATE, matching the best-known ones for AdaGrad and Adam. We also compare KATE to other state-of-the-art adaptive algorithms Adam and AdaGrad in numerical experiments with different problems, including complex machine learning tasks like image classification and text classification on real data. The results indicate that KATE consistently outperforms AdaGrad and matches/surpasses the performance of Adam in all considered scenarios.

6/6/2024

cs.LG cs.AI

🔍

ROOT-SGD: Sharp Nonasymptotics and Near-Optimal Asymptotics in a Single Algorithm

Chris Junchi Li

We study the problem of solving strongly convex and smooth unconstrained optimization problems using stochastic first-order algorithms. We devise a novel algorithm, referred to as emph{Recursive One-Over-T SGD} (textsf{ROOT-SGD}), based on an easily implementable, recursive averaging of past stochastic gradients. We prove that it simultaneously achieves state-of-the-art performance in both a finite-sample, nonasymptotic sense and an asymptotic sense. On the nonasymptotic side, we prove risk bounds on the last iterate of textsf{ROOT-SGD} with leading-order terms that match the optimal statistical risk with a unity pre-factor, along with a higher-order term that scales at the sharp rate of $O(n^{-3/2})$ under the Lipschitz condition on the Hessian matrix. On the asymptotic side, we show that when a mild, one-point Hessian continuity condition is imposed, the rescaled last iterate of (multi-epoch) textsf{ROOT-SGD} converges asymptotically to a Gaussian limit with the Cram'{e}r-Rao optimal asymptotic covariance, for a broad range of step-size choices.

6/19/2024

cs.LG stat.ML

🗣️

Studying K-FAC Heuristics by Viewing Adam through a Second-Order Lens

Ross M. Clarke, Jos'e Miguel Hern'andez-Lobato

Research into optimisation for deep learning is characterised by a tension between the computational efficiency of first-order, gradient-based methods (such as SGD and Adam) and the theoretical efficiency of second-order, curvature-based methods (such as quasi-Newton methods and K-FAC). Noting that second-order methods often only function effectively with the addition of stabilising heuristics (such as Levenberg-Marquardt damping), we ask how much these (as opposed to the second-order curvature model) contribute to second-order algorithms' performance. We thus study AdamQLR: an optimiser combining damping and learning rate selection techniques from K-FAC (Martens & Grosse, 2015) with the update directions proposed by Adam, inspired by considering Adam through a second-order lens. We evaluate AdamQLR on a range of regression and classification tasks at various scales and hyperparameter tuning methodologies, concluding K-FAC's adaptive heuristics are of variable standalone general effectiveness, and finding an untuned AdamQLR setting can achieve comparable performance vs runtime to tuned benchmarks.

6/17/2024

cs.LG stat.ML