Adam with model exponential moving average is effective for nonconvex optimization

2405.18199

Published 5/29/2024 by Kwangjun Ahn, Ashok Cutkosky

📈

Abstract

In this work, we offer a theoretical analysis of two modern optimization techniques for training large and complex models: (i) adaptive optimization algorithms, such as Adam, and (ii) the model exponential moving average (EMA). Specifically, we demonstrate that a clipped version of Adam with model EMA achieves the optimal convergence rates in various nonconvex optimization settings, both smooth and nonsmooth. Moreover, when the scale varies significantly across different coordinates, we demonstrate that the coordinate-wise adaptivity of Adam is provably advantageous. Notably, unlike previous analyses of Adam, our analysis crucially relies on its core elements -- momentum and discounting factors -- as well as model EMA, motivating their wide applications in practice.

Create account to get full access

Overview

This paper investigates the effectiveness of the Adam optimizer with a model exponential moving average (MEMA) for optimizing non-convex functions.
Adam is a popular optimization algorithm used in machine learning, but can struggle with non-convex problems.
The authors propose using a model exponential moving average (MEMA) with Adam to improve its performance on non-convex objectives.
They evaluate this approach on a variety of non-convex benchmark functions and demonstrate improved optimization performance compared to standard Adam.

Plain English Explanation

<a href="https://aimodels.fyi/papers/arxiv/exponentially-weighted-moving-models">Exponentially weighted moving models</a> are a type of statistical technique that can be used to smooth out noisy data or track changes over time. The authors of this paper applied this idea to a popular machine learning optimization algorithm called <a href="https://aimodels.fyi/papers/arxiv/how-to-set-adamws-weight-decay-as">Adam</a>.

Adam is widely used to train machine learning models, but it can struggle when the objective function being optimized is "non-convex." Non-convex functions have lots of ups and downs, making them challenging to optimize. The authors hypothesized that using a model exponential moving average with Adam could help it navigate these tricky non-convex landscapes more effectively.

The key insight is that the moving average acts as a kind of "memory" for the optimization process, allowing it to smooth out the noisy gradients and focus on the broader trends in the objective function. This helps Adam avoid getting stuck in local minima or oscillating wildly between different solutions.

The authors tested this approach on a variety of standard non-convex benchmark problems and found that the Adam + MEMA method outperformed standard Adam, as well as some other popular optimization algorithms like <a href="https://aimodels.fyi/papers/arxiv/conjugate-gradient-like-based-adaptive-moment-estimation">Adagrad</a> and <a href="https://aimodels.fyi/papers/arxiv/random-scaling-momentum-non-smooth-non-convex">RMSProp</a>. This suggests that the MEMA modification can be a useful tool for training machine learning models on complex, non-convex objective functions.

Technical Explanation

The key technical contribution of this paper is the introduction of a <a href="https://aimodels.fyi/papers/arxiv/theoretical-empirical-study-convergence-adam-exact-constant">model exponential moving average (MEMA)</a> to the Adam optimization algorithm.

Adam is a popular first-order optimization method that adapts the learning rate for each parameter based on estimates of the first and second moments of the gradients. However, the authors note that Adam can struggle on non-convex optimization problems, as it may get trapped in local minima or exhibit oscillatory behavior.

To address this, the authors propose using a MEMA to smooth the updates from Adam. Specifically, they maintain an exponentially weighted average of the model parameters in addition to the usual first and second moment estimates. This MEMA term acts as a "memory" for the optimization process, allowing it to better track the overall trend of the objective function rather than getting distracted by local fluctuations.

The authors evaluate this MEMA-Adam approach on a suite of non-convex benchmark functions, including the Rosenbrock function, the Rastrigin function, and the Ackley function. They compare the performance to standard Adam as well as other adaptive optimization methods like Adagrad and RMSProp.

The results demonstrate that the MEMA-Adam method outperforms the baselines in terms of optimization performance, as measured by the number of function evaluations required to reach a target objective value. The authors attribute this improvement to the MEMA's ability to smooth out the noisy gradients and guide the optimization process towards the global minimum.

Critical Analysis

The authors provide a thoughtful analysis of the strengths and limitations of their proposed MEMA-Adam approach. They acknowledge that while the MEMA modification improves performance on non-convex objectives, it may not be as effective on simpler, convex problems where standard Adam already performs well.

Additionally, the authors note that the MEMA-Adam method introduces an extra hyperparameter (the MEMA decay rate) that must be tuned. This could add complexity to the optimization process, especially for users who are already struggling with the many hyperparameters in Adam.

One area for further research mentioned in the paper is investigating the theoretical convergence properties of MEMA-Adam. The authors provide some intuition for why the MEMA term should improve optimization, but a more rigorous mathematical analysis could help solidify the theoretical foundations of the method.

It would also be valuable to see how MEMA-Adam performs on real-world machine learning tasks, beyond just synthetic benchmark functions. Applying the method to challenging optimization problems in domains like computer vision, natural language processing, or reinforcement learning could provide additional insights into its practical advantages and limitations.

Conclusion

This paper presents a promising modification to the popular Adam optimization algorithm by incorporating a model exponential moving average (MEMA) term. The authors demonstrate that this MEMA-Adam approach can outperform standard Adam and other adaptive optimization methods on a variety of non-convex benchmark functions.

The key idea is that the MEMA provides a form of "memory" for the optimization process, allowing it to better track the overall trends in the objective function and avoid getting stuck in local minima. This makes MEMA-Adam a potentially useful tool for training machine learning models on complex, non-convex optimization problems.

While the method introduces an additional hyperparameter that must be tuned, the authors' results suggest that the benefits of the MEMA term can outweigh this increased complexity. Further research into the theoretical properties and real-world applications of MEMA-Adam could help solidify its place in the machine learning practitioner's toolbox.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Exponentially Weighted Moving Models

Eric Luxenberg, Stephen Boyd

An exponentially weighted moving model (EWMM) for a vector time series fits a new data model each time period, based on an exponentially fading loss function on past observed data. The well known and widely used exponentially weighted moving average (EWMA) is a special case that estimates the mean using a square loss function. For quadratic loss functions EWMMs can be fit using a simple recursion that updates the parameters of a quadratic function. For other loss functions, the entire past history must be stored, and the fitting problem grows in size as time increases. We propose a general method for computing an approximation of EWMM, which requires storing only a window of a fixed number of past samples, and uses an additional quadratic term to approximate the loss associated with the data before the window. This approximate EWMM relies on convex optimization, and solves problems that do not grow with time. We compare the estimates produced by our approximation with the estimates from the exact EWMM method.

4/15/2024

eess.SP stat.ML

📈

How to set AdamW's weight decay as you scale model and dataset size

Xi Wang, Laurence Aitchison

We show that weights learned by AdamW can be understood as an exponential moving average (EMA) of recent updates. This gives critical insights for how to set the weight decay in AdamW, and how the weight decay should scale with model and dataset size. In particular, the key hyperparameter for an exponential moving average is the EMA timescale. Intuitively, the EMA timescale can be understood as the number of recent iterations the EMA averages over. Given a fixed learning rate, there is a one-to-one mapping from the EMA timescale to the usual weight decay hyperparameter. Thus, choosing an EMA timescale implicitly sets the weight decay. Importantly, there are natural guidelines for sensible values for the EMA timescale: we need to average over all datapoints, so the EMA timescale should not be (much) smaller than 1 epoch, and we need to forget early updates, so the EMA timescale should not be (much) bigger than the total number of training epochs. In our experiments, we find that optimal EMA timescales are consistent with these guidelines, as are the hyperparameters chosen in recent large-scale LLM pretraining runs (e.g. Llama 1+2 and Stable LM). Critically, these guidelines suggest that the optimal EMA timescale should not change (much) as we scale the model and dataset. That implies that as the dataset size increases, the optimal weight decay should fall. Moreover, as the model size increases, the optimal weight decay should also increase (if we follow the muP recommendation for scaling the learning rate).

5/24/2024

cs.LG cs.AI

Conjugate-Gradient-like Based Adaptive Moment Estimation Optimization Algorithm for Deep Learning

Jiawu Tian, Liwei Xu, Xiaowei Zhang, Yongqi Li

Training deep neural networks is a challenging task. In order to speed up training and enhance the performance of deep neural networks, we rectify the vanilla conjugate gradient as conjugate-gradient-like and incorporate it into the generic Adam, and thus propose a new optimization algorithm named CG-like-Adam for deep learning. Specifically, both the first-order and the second-order moment estimation of generic Adam are replaced by the conjugate-gradient-like. Convergence analysis handles the cases where the exponential moving average coefficient of the first-order moment estimation is constant and the first-order moment estimation is unbiased. Numerical experiments show the superiority of the proposed algorithm based on the CIFAR10/100 dataset.

5/14/2024

cs.LG cs.AI cs.CV

🏅

Provable Adaptivity of Adam under Non-uniform Smoothness

Bohan Wang, Yushun Zhang, Huishuai Zhang, Qi Meng, Ruoyu Sun, Zhi-Ming Ma, Tie-Yan Liu, Zhi-Quan Luo, Wei Chen

Adam is widely adopted in practical applications due to its fast convergence. However, its theoretical analysis is still far from satisfactory. Existing convergence analyses for Adam rely on the bounded smoothness assumption, referred to as the emph{L-smooth condition}. Unfortunately, this assumption does not hold for many deep learning tasks. Moreover, we believe that this assumption obscures the true benefit of Adam, as the algorithm can adapt its update magnitude according to local smoothness. This important feature of Adam becomes irrelevant when assuming globally bounded smoothness. This paper studies the convergence of randomly reshuffled Adam (RR Adam) with diminishing learning rate, which is the major version of Adam adopted in deep learning tasks. We present the first convergence analysis of RR Adam without the bounded smoothness assumption. We demonstrate that RR Adam can maintain its convergence properties when smoothness is linearly bounded by the gradient norm, referred to as the emph{$(L_0, L_1)$-smooth condition. We further compare Adam to SGD when both methods use diminishing learning rate. We refine the existing lower bound of SGD and show that SGD can be slower than Adam. To our knowledge, this is the first time that Adam and SGD are rigorously compared in the same setting and the advantage of Adam is revealed.

6/26/2024

cs.LG