AdaLomo: Low-memory Optimization with Adaptive Learning Rate

2310.10195

Published 6/7/2024 by Kai Lv, Hang Yan, Qipeng Guo, Haijun Lv, Xipeng Qiu

AdaLomo: Low-memory Optimization with Adaptive Learning Rate

Abstract

Large language models have achieved remarkable success, but their extensive parameter size necessitates substantial memory for training, thereby setting a high threshold. While the recently proposed low-memory optimization (LOMO) reduces memory footprint, its optimization technique, akin to stochastic gradient descent, is sensitive to hyper-parameters and exhibits suboptimal convergence, failing to match the performance of the prevailing optimizer for large language models, AdamW. Through empirical analysis of the Adam optimizer, we found that, compared to momentum, the adaptive learning rate is more critical for bridging the gap. Building on this insight, we introduce the low-memory optimization with adaptive learning rate (AdaLomo), which offers an adaptive learning rate for each parameter. To maintain memory efficiency, we employ non-negative matrix factorization for the second-order moment estimation in the optimizer state. Additionally, we suggest the use of a grouped update normalization to stabilize convergence. Our experiments with instruction-tuning and further pre-training demonstrate that AdaLomo achieves results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models. The code is accessible at https://github.com/OpenLMLab/LOMO.

Create account to get full access

Overview

This paper introduces a new optimization algorithm called AdaLomo, which aims to improve memory efficiency and adaptive learning rates for large-scale machine learning models.
AdaLomo builds upon previous work in MOMO, BADAM, and AdaMole, which explored memory-efficient optimization methods.
The key innovations of AdaLomo include a "fused backward" technique to reduce memory usage, and an adaptive learning rate scheme to improve convergence.

Plain English Explanation

AdaLomo is a new optimization algorithm designed to make machine learning models more memory-efficient and adaptable. Machine learning models, especially large ones like language models, can require a lot of computer memory to train. AdaLomo aims to reduce this memory usage through a technique called "fused backward", which combines multiple gradient computations into a single step.

Additionally, AdaLomo uses an adaptive learning rate, which means the algorithm can automatically adjust the step size it takes during optimization. This allows the model to learn more quickly in some areas and more slowly in others, leading to faster convergence.

The researchers built AdaLomo on top of previous work in this area, including MOMO, BADAM, and AdaMole. By combining these ideas, they've created a new optimization method that is both memory-efficient and adaptive, which could be especially useful for training large, complex machine learning models.

Technical Explanation

The key innovation in AdaLomo is the "fused backward" technique, which reduces memory usage by combining multiple gradient computations into a single step. Typically, when training a neural network, the gradients for each layer need to be stored during the forward pass, and then used in the backward pass to update the model parameters. AdaLomo avoids this by fusing the backward pass, allowing it to compute and apply the gradients in a single, memory-efficient operation.

Additionally, AdaLomo uses an adaptive learning rate scheme, similar to the algorithms in MOMO and AdaMole. This allows the optimization to take larger steps in some areas and smaller steps in others, depending on the curvature of the loss function. This can lead to faster convergence compared to using a fixed learning rate.

The researchers evaluate AdaLomo on a variety of machine learning tasks, including language modeling, image classification, and recommendation systems. They show that AdaLomo is able to achieve similar performance to other state-of-the-art optimization algorithms, while using significantly less memory.

Critical Analysis

The paper provides a thorough evaluation of AdaLomo and compares it to several other memory-efficient optimization methods, including BADAM and LISA. However, the authors acknowledge that AdaLomo may not be the best choice for all scenarios, and that the performance may depend on the specific problem and model architecture.

Additionally, the paper does not provide a detailed analysis of the convergence properties of AdaLomo, or how the adaptive learning rate scheme compares to other adaptive methods like Adam or RMSProp. While the experimental results are promising, a more rigorous theoretical analysis could help to better understand the strengths and limitations of the algorithm.

Overall, AdaLomo appears to be a valuable contribution to the field of memory-efficient optimization, and the ideas behind it could be further developed and extended in future work.

Conclusion

The AdaLomo algorithm introduced in this paper represents an important step forward in the development of memory-efficient optimization methods for large-scale machine learning models. By combining the "fused backward" technique with an adaptive learning rate scheme, AdaLomo is able to achieve strong performance while using significantly less memory than other state-of-the-art optimization algorithms.

This work builds upon and extends previous research in this area, including MOMO, BADAM, and AdaMole. The innovative techniques introduced in AdaLomo could have broad applicability in the training of large, complex machine learning models, particularly in areas like natural language processing and recommender systems where memory usage is a significant constraint.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Full Parameter Fine-tuning for Large Language Models with Limited Resources

Kai Lv, Yuqing Yang, Tengxiao Liu, Qinghui Gao, Qipeng Guo, Xipeng Qiu

Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but demand massive GPU resources for training. Lowering the threshold for LLMs training would encourage greater participation from researchers, benefiting both academia and society. While existing approaches have focused on parameter-efficient fine-tuning, which tunes or adds a small number of parameters, few have addressed the challenge of tuning the full parameters of LLMs with limited resources. In this work, we propose a new optimizer, LOw-Memory Optimization (LOMO), which fuses the gradient computation and the parameter update in one step to reduce memory usage. By integrating LOMO with existing memory saving techniques, we reduce memory usage to 10.8% compared to the standard approach (DeepSpeed solution). Consequently, our approach enables the full parameter fine-tuning of a 65B model on a single machine with 8 RTX 3090, each with 24GB memory.Code and data are available at https://github.com/OpenLMLab/LOMO.

6/7/2024

cs.CL

👨‍🏫

MoMo: Momentum Models for Adaptive Learning Rates

Fabian Schaipp, Ruben Ohana, Michael Eickenberg, Aaron Defazio, Robert M. Gower

Training a modern machine learning architecture on a new task requires extensive learning-rate tuning, which comes at a high computational cost. Here we develop new Polyak-type adaptive learning rates that can be used on top of any momentum method, and require less tuning to perform well. We first develop MoMo, a Momentum Model based adaptive learning rate for SGD-M (stochastic gradient descent with momentum). MoMo uses momentum estimates of the losses and gradients sampled at each iteration to build a model of the loss function. Our model makes use of any known lower bound of the loss function by using truncation, e.g. most losses are lower-bounded by zero. The model is then approximately minimized at each iteration to compute the next step. We show how MoMo can be used in combination with any momentum-based method, and showcase this by developing MoMo-Adam, which is Adam with our new model-based adaptive learning rate. We show that MoMo attains a $mathcal{O}(1/sqrt{K})$ convergence rate for convex problems with interpolation, needing knowledge of no problem-specific quantities other than the optimal value. Additionally, for losses with unknown lower bounds, we develop on-the-fly estimates of a lower bound, that are incorporated in our model. We show that MoMo and MoMo-Adam improve over SGD-M and Adam in terms of robustness to hyperparameter tuning for training image classifiers on MNIST, CIFAR, and Imagenet, for recommender systems on Criteo, for a transformer model on the translation task IWSLT14, and for a diffusion model.

6/6/2024

cs.LG

BAdam: A Memory Efficient Full Parameter Training Method for Large Language Models

Qijun Luo, Hengxu Yu, Xiao Li

This work presents BAdam, an optimization method that leverages the block coordinate descent framework with Adam as the inner solver. BAdam offers a memory efficient approach to the full parameter finetuning of large language models. We conduct theoretical convergence analysis for BAdam in the deterministic case. Experimentally, we apply BAdam to instruction-tune the Llama 2-7B and Llama 3-8B models using a single RTX3090-24GB GPU. The results confirm BAdam's efficiency in terms of memory and running time. Additionally, the convergence verification indicates that BAdam exhibits superior convergence behavior compared to LoRA. Furthermore, the downstream performance evaluation using the MT-bench shows that BAdam modestly surpasses LoRA and more substantially outperforms LOMO. Finally, we compare BAdam with Adam on a medium-sized task, i.e., finetuning RoBERTa-large on the SuperGLUE benchmark. The results demonstrate that BAdam is capable of narrowing the performance gap with Adam more effectively than LoRA. Our code is available at https://github.com/Ledzy/BAdam.

5/24/2024

cs.LG

💬

AdaMoLE: Fine-Tuning Large Language Models with Adaptive Mixture of Low-Rank Adaptation Experts

Zefang Liu, Jiahua Luo

We introduce AdaMoLE, a novel method for fine-tuning large language models (LLMs) through an Adaptive Mixture of Low-Rank Adaptation (LoRA) Experts. Moving beyond conventional methods that employ a static top-k strategy for activating experts, AdaMoLE dynamically adjusts the activation threshold using a dedicated threshold network, adaptively responding to the varying complexities of different tasks. By replacing a single LoRA in a layer with multiple LoRA experts and integrating a gating function with the threshold mechanism, AdaMoLE effectively selects and activates the most appropriate experts based on the input context. Our extensive evaluations across a variety of commonsense reasoning and natural language processing tasks show that AdaMoLE exceeds baseline performance. This enhancement highlights the advantages of AdaMoLE's adaptive selection of LoRA experts, improving model effectiveness without a corresponding increase in the expert count. The experimental validation not only confirms AdaMoLE as a robust approach for enhancing LLMs but also suggests valuable directions for future research in adaptive expert selection mechanisms, potentially broadening the scope for optimizing model performance across diverse language processing tasks.

5/2/2024

cs.CL