MoMo: Momentum Models for Adaptive Learning Rates

2305.07583

Published 6/6/2024 by Fabian Schaipp, Ruben Ohana, Michael Eickenberg, Aaron Defazio, Robert M. Gower

👨‍🏫

Abstract

Training a modern machine learning architecture on a new task requires extensive learning-rate tuning, which comes at a high computational cost. Here we develop new Polyak-type adaptive learning rates that can be used on top of any momentum method, and require less tuning to perform well. We first develop MoMo, a Momentum Model based adaptive learning rate for SGD-M (stochastic gradient descent with momentum). MoMo uses momentum estimates of the losses and gradients sampled at each iteration to build a model of the loss function. Our model makes use of any known lower bound of the loss function by using truncation, e.g. most losses are lower-bounded by zero. The model is then approximately minimized at each iteration to compute the next step. We show how MoMo can be used in combination with any momentum-based method, and showcase this by developing MoMo-Adam, which is Adam with our new model-based adaptive learning rate. We show that MoMo attains a $mathcal{O}(1/sqrt{K})$ convergence rate for convex problems with interpolation, needing knowledge of no problem-specific quantities other than the optimal value. Additionally, for losses with unknown lower bounds, we develop on-the-fly estimates of a lower bound, that are incorporated in our model. We show that MoMo and MoMo-Adam improve over SGD-M and Adam in terms of robustness to hyperparameter tuning for training image classifiers on MNIST, CIFAR, and Imagenet, for recommender systems on Criteo, for a transformer model on the translation task IWSLT14, and for a diffusion model.

Create account to get full access

Overview

Introduces a new adaptive learning rate method called MoMo that can be used with any momentum-based optimization algorithm
MoMo uses momentum estimates of the losses and gradients to build a model of the loss function, which is then approximately minimized to compute the next step
Shows that MoMo can attain a convergence rate of O(1/√K) for convex problems with interpolation, without needing to know problem-specific quantities
Demonstrates that MoMo and a variant called MoMo-Adam outperform traditional methods like SGD-M and Adam in terms of robustness to hyperparameter tuning on a variety of tasks

Plain English Explanation

Training modern machine learning models on new tasks often requires extensive tuning of the learning rate, which can be computationally expensive. To address this, the researchers developed a new type of adaptive learning rate called MoMo (Momentum Model) that can be used on top of any momentum-based optimization method.

MoMo works by using the momentum estimates of the losses and gradients at each iteration to build a model of the loss function. This model incorporates any known lower bound of the loss function, such as zero for most losses. MoMo then approximately minimizes this model to determine the next step, rather than relying on a fixed learning rate.

The key benefit of MoMo is that it requires less tuning to perform well, as it can automatically adjust the learning rate based on the structure of the loss function. The researchers show that MoMo can achieve a convergence rate of O(1/√K) for convex problems with interpolation, without needing to know any problem-specific quantities other than the optimal value.

Additionally, for losses with unknown lower bounds, the researchers developed on-the-fly estimates of a lower bound that are incorporated into the MoMo model. They then demonstrate that MoMo and a variant called MoMo-Adam (which combines MoMo with the Adam optimizer) outperform traditional methods like SGD-M and Adam in terms of robustness to hyperparameter tuning on a variety of tasks, including image classification, recommender systems, machine translation, and diffusion models.

Technical Explanation

The key technical contribution of this work is the development of the MoMo (Momentum Model) adaptive learning rate method, which can be used in combination with any momentum-based optimization algorithm, such as SGD with Momentum (SGD-M) or Adam.

MoMo works by using the momentum estimates of the losses and gradients sampled at each iteration to build a model of the loss function. This model makes use of any known lower bound of the loss function, such as zero for most losses, by using truncation. The model is then approximately minimized at each iteration to compute the next step, rather than relying on a fixed learning rate.

The researchers show that MoMo can attain a convergence rate of O(1/√K) for convex problems with interpolation, without needing to know any problem-specific quantities other than the optimal value. This is a stronger guarantee than traditional adaptive methods like Adam, which require more problem-specific information.

Additionally, for losses with unknown lower bounds, the researchers develop on-the-fly estimates of a lower bound that are incorporated into the MoMo model. They then demonstrate the effectiveness of MoMo and MoMo-Adam (a variant that combines MoMo with the Adam optimizer) on a variety of tasks, including image classification, recommender systems, machine translation, and diffusion models.

Critical Analysis

The researchers have presented a novel and promising approach to adaptive learning rates with their MoMo method. By incorporating momentum estimates and known lower bounds of the loss function, MoMo is able to achieve strong convergence guarantees and improved robustness to hyperparameter tuning compared to traditional methods.

However, the paper does not address the potential computational overhead of maintaining and updating the loss function model at each iteration. This could be a concern, especially for large-scale or real-time applications. Additionally, the researchers only provide theoretical guarantees for convex problems with interpolation, and it would be helpful to see more analysis on the performance of MoMo in non-convex settings, which are more common in modern machine learning.

Furthermore, while the empirical results on a variety of tasks are encouraging, it would be valuable to see a more in-depth discussion of the underlying reasons for the improvements, as well as potential failure cases or limitations of the method. [Exploring the behavior of MoMo in the context of other adaptive methods, such as Omni-Smola or Random Scaling Momentum, could also provide additional insights.

Overall, the MoMo method represents a compelling approach to adaptive learning rates, and the researchers have demonstrated its potential through a range of experiments. Further investigation into the practical implementation details and performance in more challenging settings would help to fully evaluate the method's impact and guide future research in this area.

Conclusion

The paper presents a novel adaptive learning rate method called MoMo that can be used with any momentum-based optimization algorithm. MoMo builds a model of the loss function using momentum estimates of the losses and gradients, and then approximately minimizes this model to compute the next step.

The key benefits of MoMo are its strong theoretical guarantees, achieving a convergence rate of O(1/√K) for convex problems with interpolation, and its improved robustness to hyperparameter tuning compared to traditional methods like SGD-M and Adam. The researchers also demonstrate the effectiveness of MoMo and a variant called MoMo-Adam on a variety of tasks, including image classification, recommender systems, machine translation, and diffusion models.

Overall, the MoMo method represents a promising step forward in adaptive learning rate optimization, with the potential to significantly reduce the computational burden of hyperparameter tuning for a wide range of machine learning applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

AdaLomo: Low-memory Optimization with Adaptive Learning Rate

Kai Lv, Hang Yan, Qipeng Guo, Haijun Lv, Xipeng Qiu

Large language models have achieved remarkable success, but their extensive parameter size necessitates substantial memory for training, thereby setting a high threshold. While the recently proposed low-memory optimization (LOMO) reduces memory footprint, its optimization technique, akin to stochastic gradient descent, is sensitive to hyper-parameters and exhibits suboptimal convergence, failing to match the performance of the prevailing optimizer for large language models, AdamW. Through empirical analysis of the Adam optimizer, we found that, compared to momentum, the adaptive learning rate is more critical for bridging the gap. Building on this insight, we introduce the low-memory optimization with adaptive learning rate (AdaLomo), which offers an adaptive learning rate for each parameter. To maintain memory efficiency, we employ non-negative matrix factorization for the second-order moment estimation in the optimizer state. Additionally, we suggest the use of a grouped update normalization to stabilize convergence. Our experiments with instruction-tuning and further pre-training demonstrate that AdaLomo achieves results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models. The code is accessible at https://github.com/OpenLMLab/LOMO.

6/7/2024

cs.LG cs.CL

🏋️

Score-based Generative Models with Adaptive Momentum

Ziqing Wen, Xiaoge Deng, Ping Luo, Tao Sun, Dongsheng Li

Score-based generative models have demonstrated significant practical success in data-generating tasks. The models establish a diffusion process that perturbs the ground truth data to Gaussian noise and then learn the reverse process to transform noise into data. However, existing denoising methods such as Langevin dynamic and numerical stochastic differential equation solvers enjoy randomness but generate data slowly with a large number of score function evaluations, and the ordinary differential equation solvers enjoy faster sampling speed but no randomness may influence the sample quality. To this end, motivated by the Stochastic Gradient Descent (SGD) optimization methods and the high connection between the model sampling process with the SGD, we propose adaptive momentum sampling to accelerate the transforming process without introducing additional hyperparameters. Theoretically, we proved our method promises convergence under given conditions. In addition, we empirically show that our sampler can produce more faithful images/graphs in small sampling steps with 2 to 5 times speed up and obtain competitive scores compared to the baselines on image and graph generation tasks.

5/24/2024

cs.LG

✨

The Marginal Value of Momentum for Small Learning Rate SGD

Runzhe Wang, Sadhika Malladi, Tianhao Wang, Kaifeng Lyu, Zhiyuan Li

Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without stochastic gradient noise. In stochastic optimization, such as training neural networks, folklore suggests that momentum may help deep learning optimization by reducing the variance of the stochastic gradient update, but previous theoretical analyses do not find momentum to offer any provable acceleration. Theoretical results in this paper clarify the role of momentum in stochastic settings where the learning rate is small and gradient noise is the dominant source of instability, suggesting that SGD with and without momentum behave similarly in the short and long time horizons. Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training regimes where the optimal learning rate is not very large, including small- to medium-batch training from scratch on ImageNet and fine-tuning language models on downstream tasks.

4/17/2024

cs.LG

Random Scaling and Momentum for Non-smooth Non-convex Optimization

Qinzi Zhang, Ashok Cutkosky

Training neural networks requires optimizing a loss function that may be highly irregular, and in particular neither convex nor smooth. Popular training algorithms are based on stochastic gradient descent with momentum (SGDM), for which classical analysis applies only if the loss is either convex or smooth. We show that a very small modification to SGDM closes this gap: simply scale the update at each time point by an exponentially distributed random scalar. The resulting algorithm achieves optimal convergence guarantees. Intriguingly, this result is not derived by a specific analysis of SGDM: instead, it falls naturally out of a more general framework for converting online convex optimization algorithms to non-convex optimization algorithms.

5/17/2024

cs.LG