Adam-mini: Use Fewer Learning Rates To Gain More

2406.16793

Published 6/27/2024 by Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Yinyu Ye, Zhi-Quan Luo, Ruoyu Sun

Adam-mini: Use Fewer Learning Rates To Gain More

Abstract

We propose Adam-mini, an optimizer that achieves on-par or better performance than AdamW with 45% to 50% less memory footprint. Adam-mini reduces memory by cutting down the learning rate resources in Adam (i.e., $1/sqrt{v}$). We find that $geq$ 90% of these learning rates in $v$ could be harmlessly removed if we (1) carefully partition the parameters into blocks following our proposed principle on Hessian structure; (2) assign a single but good learning rate to each parameter block. We further find that, for each of these parameter blocks, there exists a single high-quality learning rate that can outperform Adam, provided that sufficient resources are available to search it out. We then provide one cost-effective way to find good learning rates and propose Adam-mini. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 125M to 7B for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also alleviates communication overheads among GPUs and CPUs, thereby increasing throughput. For instance, Adam-mini achieves 49.6% higher throughput than AdamW when pre-training Llama2-7B on $2times$ A800-80GB GPUs, which saves 33% wall-clock time for pre-training.

Create account to get full access

Overview

The paper introduces a new optimization method called Adam-mini, which aims to improve the efficiency of the popular Adam optimizer by using fewer learning rates.
Adam-mini modifies the standard Adam algorithm to use a single global learning rate instead of separate learning rates for each parameter.
The authors claim that Adam-mini can achieve comparable or better performance than standard Adam while using significantly less memory.

Plain English Explanation

The Adam optimizer is a widely used technique in machine learning for updating the parameters of a model during training. Adam works by adjusting the learning rate for each parameter individually, which can help the model converge more quickly.

However, the authors of this paper argue that the per-parameter learning rates used by Adam can also be inefficient, as they require storing and updating a large number of additional variables. To address this, they propose a new method called Adam-mini, which uses a single global learning rate instead of separate rates for each parameter.

The key idea behind Adam-mini is that a single global learning rate can often be just as effective as the individual rates used in standard Adam, while requiring much less memory to store and update. This can be particularly beneficial for training large models or running on resource-constrained devices, where memory usage is a concern.

The authors demonstrate that Adam-mini can achieve comparable or even better performance than standard Adam on a variety of machine learning tasks, while using significantly less memory. This suggests that Adam-mini could be a useful alternative to the standard Adam optimizer in many practical applications.

Technical Explanation

The Adam optimizer is a popular algorithm for training machine learning models, as it can often converge more quickly than traditional stochastic gradient descent. Adam works by maintaining separate adaptive learning rates for each parameter in the model, which are updated based on the first and second moments of the gradients.

While the adaptive learning rates used by Adam can be beneficial, they also come with a significant memory overhead, as the algorithm needs to store and update a large number of additional variables. This can be problematic for training large models or running on resource-constrained devices.

To address this issue, the authors of the paper propose a new optimization method called Adam-mini. In Adam-mini, the authors modify the standard Adam algorithm to use a single global learning rate instead of separate rates for each parameter. This reduces the memory footprint of the optimizer, as the algorithm only needs to maintain a small number of additional variables.

The authors show that Adam-mini can achieve comparable or even better performance than standard Adam on a variety of machine learning tasks, including image classification, language modeling, and reinforcement learning. They attribute this to the fact that a single global learning rate can often be just as effective as the individual rates used in standard Adam, especially for well-conditioned problems.

The authors also provide theoretical analysis to support their claims, showing that Adam-mini can achieve similar convergence guarantees to standard Adam under certain conditions. Additionally, they demonstrate that Adam-mini can be easily combined with other memory-efficient techniques, such as BADAM and ADALOMO, to further reduce the memory requirements of the optimization process.

Critical Analysis

One potential limitation of the Adam-mini approach is that the single global learning rate may not be as effective as the individual rates used in standard Adam for more complex or ill-conditioned optimization problems. In such cases, the additional flexibility provided by the per-parameter learning rates in standard Adam may be necessary to achieve optimal performance.

Additionally, the authors note that the performance of Adam-mini can be sensitive to the choice of hyperparameters, such as the initial learning rate and the momentum decay rates. Careful tuning of these hyperparameters may be required to achieve the best results, which could limit the practical applicability of the method in some scenarios.

Finally, while the authors demonstrate that Adam-mini can be combined with other memory-efficient techniques, it would be interesting to see how the method performs in comparison to other memory-efficient optimization algorithms, such as MicroAdam or HIFT. A more comprehensive comparison of these approaches could provide further insights into the relative strengths and weaknesses of the Adam-mini method.

Conclusion

The Adam-mini optimization method introduced in this paper offers a promising approach to improving the efficiency of the popular Adam optimizer. By using a single global learning rate instead of separate rates for each parameter, Adam-mini can achieve comparable or better performance while using significantly less memory.

This could be particularly beneficial for training large models or running on resource-constrained devices, where memory usage is a concern. While the method may have some limitations in certain optimization scenarios, the authors' theoretical and empirical results suggest that Adam-mini could be a useful addition to the suite of optimization techniques available to machine learning practitioners.

Overall, this paper provides an interesting contribution to the ongoing efforts to develop more efficient and memory-friendly optimization algorithms for machine learning applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

BAdam: A Memory Efficient Full Parameter Training Method for Large Language Models

Qijun Luo, Hengxu Yu, Xiao Li

This work presents BAdam, an optimization method that leverages the block coordinate descent framework with Adam as the inner solver. BAdam offers a memory efficient approach to the full parameter finetuning of large language models. We conduct theoretical convergence analysis for BAdam in the deterministic case. Experimentally, we apply BAdam to instruction-tune the Llama 2-7B and Llama 3-8B models using a single RTX3090-24GB GPU. The results confirm BAdam's efficiency in terms of memory and running time. Additionally, the convergence verification indicates that BAdam exhibits superior convergence behavior compared to LoRA. Furthermore, the downstream performance evaluation using the MT-bench shows that BAdam modestly surpasses LoRA and more substantially outperforms LOMO. Finally, we compare BAdam with Adam on a medium-sized task, i.e., finetuning RoBERTa-large on the SuperGLUE benchmark. The results demonstrate that BAdam is capable of narrowing the performance gap with Adam more effectively than LoRA. Our code is available at https://github.com/Ledzy/BAdam.

5/24/2024

cs.LG

AdaLomo: Low-memory Optimization with Adaptive Learning Rate

Kai Lv, Hang Yan, Qipeng Guo, Haijun Lv, Xipeng Qiu

Large language models have achieved remarkable success, but their extensive parameter size necessitates substantial memory for training, thereby setting a high threshold. While the recently proposed low-memory optimization (LOMO) reduces memory footprint, its optimization technique, akin to stochastic gradient descent, is sensitive to hyper-parameters and exhibits suboptimal convergence, failing to match the performance of the prevailing optimizer for large language models, AdamW. Through empirical analysis of the Adam optimizer, we found that, compared to momentum, the adaptive learning rate is more critical for bridging the gap. Building on this insight, we introduce the low-memory optimization with adaptive learning rate (AdaLomo), which offers an adaptive learning rate for each parameter. To maintain memory efficiency, we employ non-negative matrix factorization for the second-order moment estimation in the optimizer state. Additionally, we suggest the use of a grouped update normalization to stabilize convergence. Our experiments with instruction-tuning and further pre-training demonstrate that AdaLomo achieves results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models. The code is accessible at https://github.com/OpenLMLab/LOMO.

6/7/2024

cs.LG cs.CL

MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence

Ionut-Vlad Modoranu, Mher Safaryan, Grigory Malinovsky, Eldar Kurtic, Thomas Robert, Peter Richtarik, Dan Alistarh

We propose a new variant of the Adam optimizer [Kingma and Ba, 2014] called MICROADAM that specifically minimizes memory overheads, while maintaining theoretical convergence guarantees. We achieve this by compressing the gradient information before it is fed into the optimizer state, thereby reducing its memory footprint significantly. We control the resulting compression error via a novel instance of the classical error feedback mechanism from distributed optimization [Seide et al., 2014, Alistarh et al., 2018, Karimireddy et al., 2019] in which the error correction information is itself compressed to allow for practical memory gains. We prove that the resulting approach maintains theoretical convergence guarantees competitive to those of AMSGrad, while providing good practical performance. Specifically, we show that MICROADAM can be implemented efficiently on GPUs: on both million-scale (BERT) and billion-scale (LLaMA) models, MicroAdam provides practical convergence competitive to that of the uncompressed Adam baseline, with lower memory usage and similar running time. Our code is available at https://github.com/IST-DASLab/MicroAdam.

5/27/2024

cs.LG cs.NA

⚙️

Promoting Exploration in Memory-Augmented Adam using Critical Momenta

Pranshu Malviya, Gonc{c}alo Mordido, Aristide Baratin, Reza Babanezhad Harikandeh, Jerry Huang, Simon Lacoste-Julien, Razvan Pascanu, Sarath Chandar

Adaptive gradient-based optimizers, notably Adam, have left their mark in training large-scale deep learning models, offering fast convergence and robustness to hyperparameter settings. However, they often struggle with generalization, attributed to their tendency to converge to sharp minima in the loss landscape. To address this, we propose a new memory-augmented version of Adam that encourages exploration towards flatter minima by incorporating a buffer of critical momentum terms during training. This buffer prompts the optimizer to overshoot beyond narrow minima, promoting exploration. Through comprehensive analysis in simple settings, we illustrate the efficacy of our approach in increasing exploration and bias towards flatter minima. We empirically demonstrate that it can improve model performance for image classification on ImageNet and CIFAR10/100, language modelling on Penn Treebank, and online learning tasks on TinyImageNet and 5-dataset. Our code is available at url{https://github.com/chandar-lab/CMOptimizer}.

6/19/2024

cs.LG cs.AI