Scaling Exponents Across Parameterizations and Optimizers

Read original: arXiv:2407.05872 - Published 7/17/2024 by Katie Everett, Lechao Xiao, Mitchell Wortsman, Alexander A. Alemi, Roman Novak, Peter J. Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee and 1 other

🔗

Overview

Investigates the relationship between model parameters and data, and how this impacts scaling models to larger sizes
Proposes a new perspective on parameterization and derives new theoretical results under weaker assumptions
Conducts extensive empirical evaluations across various optimizers, parameterizations, learning rates, and model sizes up to 26.8B parameters
Finds that the best learning rate scaling prescription would often have been excluded by assumptions in prior work
Demonstrates that an overlooked aspect of parameterization, the epsilon parameter in Adam, must be scaled correctly to avoid gradient underflow

Plain English Explanation

The paper explores how to effectively scale up machine learning models from small to large sizes. Typically, this requires carefully adjusting many technical details, such as the way the model parameters are represented and the optimization algorithm used to train the model.

The researchers propose a new way of thinking about how the model parameters are connected to the training data. They derive some new mathematical results that make fewer assumptions than previous work. They then conduct a large number of experiments, training tens of thousands of models with different parameter representations, optimization algorithms, learning rates, and model sizes up to 26.8 billion parameters.

The key findings are:

The best way to scale the learning rate as the model size changes would often have been excluded by assumptions made in prior research. [link to "Resolving Discrepancies in Compute-Optimal Scaling of Language Models"]
All the different ways of representing the model parameters, not just the "maximal update parameterization" used in some previous work, can achieve good performance when the models are scaled up. Moreover, a new approach to setting the learning rate for the standard parameterization outperforms the maximal update parameterization. [link to "Large-Scale Exploration of Dollar-for-Dollar Transfer"]
An important detail in the Adam optimization algorithm, the epsilon parameter, must be scaled correctly as the model size changes to avoid numerical issues. The researchers propose a new version of Adam that eliminates this parameter entirely. [link to "Parameterization for Second-Order Optimization Effective Towards Infinite"]

Overall, the paper provides new insights into how to effectively scale up large machine learning models, which is an important challenge as models continue to grow in size.

Technical Explanation

The paper investigates the relationship between model parameterization and the ability to scale models to larger sizes. Prior work has made specific assumptions about the alignment between model parameters and training data, but the researchers relax these assumptions and derive new theoretical results.

They conduct extensive empirical evaluations, training tens of thousands of models with all combinations of three optimizers (SGD, Adam, and AdamW), four parameterizations, several alignment assumptions, more than a dozen learning rates, and model sizes up to 26.8 billion parameters. [link to "Navigating Scaling Laws: Compute-Optimality and Adaptive Model Design"]

The key findings are:

The best learning rate scaling prescription would often have been excluded by the assumptions in prior work. [link to "Resolving Discrepancies in Compute-Optimal Scaling of Language Models"]
All parameterizations, not just the "maximal update parameterization" (muP), can achieve good hyperparameter transfer when scaling up models. Moreover, a novel per-layer learning rate prescription for the standard parameterization outperforms muP. [link to "Large-Scale Exploration of Dollar-for-Dollar Transfer"]
The epsilon parameter in the Adam optimizer must be scaled correctly to avoid gradient underflow, and the researchers propose a new, numerically stable version called Adam-atan2 that eliminates the epsilon parameter entirely. [link to "Parameterization for Second-Order Optimization Effective Towards Infinite"]

Critical Analysis

The paper provides a comprehensive and rigorous evaluation of the impact of parameterization and optimization choices on the ability to scale up large machine learning models. The extensive experiments and careful analysis yield valuable insights that can inform the design of future large-scale models.

One potential limitation is that the experiments are focused on language models, and the generalization to other model types and domains is not explicitly tested. Additionally, the paper does not explore the computational and memory costs associated with the different parameterization and optimization approaches, which could be an important consideration in practical applications.

The researchers also acknowledge that their work does not fully resolve the discrepancies between theoretical and empirical scaling laws for large language models, as discussed in the [link to "Navigating Scaling Laws: Compute-Optimality and Adaptive Model Design"] paper. Further research would be needed to fully understand the underlying principles driving the scaling behavior of these models.

Overall, the paper makes a significant contribution to the understanding of model scaling, and the insights and techniques presented could be valuable for researchers and practitioners working on the development of large-scale machine learning models.

Conclusion

This paper provides a new perspective on the relationship between model parameterization and the ability to effectively scale up large machine learning models. Through extensive empirical investigations, the researchers demonstrate that the best practices for scaling models, such as learning rate scaling, can be different from what was previously assumed.

The findings have important implications for the design and optimization of future large-scale models, as they suggest that the details of parameterization and optimization can have a substantial impact on the model's scaling behavior. The proposed Adam-atan2 optimizer, which eliminates the need for the epsilon parameter, also represents a practical contribution that could benefit the broader machine learning community.

Overall, this work advances our understanding of the fundamental principles governing the scalability of large machine learning models, which is a crucial challenge as models continue to grow in size and complexity.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔗

Scaling Exponents Across Parameterizations and Optimizers

Katie Everett, Lechao Xiao, Mitchell Wortsman, Alexander A. Alemi, Roman Novak, Peter J. Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, Jeffrey Pennington

Robust and effective scaling of models from small to large width typically requires the precise adjustment of many algorithmic and architectural details, such as parameterization and optimizer choices. In this work, we propose a new perspective on parameterization by investigating a key assumption in prior work about the alignment between parameters and data and derive new theoretical results under weaker assumptions and a broader set of optimizers. Our extensive empirical investigation includes tens of thousands of models trained with all combinations of three optimizers, four parameterizations, several alignment assumptions, more than a dozen learning rates, and fourteen model sizes up to 26.8B parameters. We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work. Our results show that all parameterizations, not just maximal update parameterization (muP), can achieve hyperparameter transfer; moreover, our novel per-layer learning rate prescription for standard parameterization outperforms muP. Finally, we demonstrate that an overlooked aspect of parameterization, the epsilon parameter in Adam, must be scaled correctly to avoid gradient underflow and propose Adam-atan2, a new numerically stable, scale-invariant version of Adam that eliminates the epsilon hyperparameter entirely.

7/17/2024

Resolving Discrepancies in Compute-Optimal Scaling of Language Models

Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, Yair Carmon

Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model size as a function of the compute budget, but these laws yield substantially different predictions. We explain the discrepancy by reproducing the Kaplan scaling law on two datasets (OpenWebText2 and RefinedWeb) and identifying three factors causing the difference: last layer computational cost, warmup duration, and scale-dependent optimizer tuning. With these factors corrected, we obtain excellent agreement with the Hoffmann et al. (i.e., Chinchilla) scaling law. Counter to a hypothesis of Hoffmann et al., we find that careful learning rate decay is not essential for the validity of their scaling law. As a secondary result, we derive scaling laws for the optimal learning rate and batch size, finding that tuning the AdamW $beta_2$ parameter is essential at lower batch sizes.

7/26/2024

🧠

A Large-Scale Exploration of $mu$-Transfer

Lucas Lingle

Large artificial neural networks have become a mainstay of language, vision, and audio processing and synthesis, yet their initializations and learning rates are often set in an unsophisticated fashion, due to the high cost of hyperparameter sweeps at scale. The $mu$-Parameterization ($mu$P) offers a potential solution to this challenge, yielding scaling rules for model initialization and learning rates while reportedly enabling zero-shot hyperparameter transfer from small to large models. Despite its evident promise, the $mu$P method is not yet widely adopted, perhaps due to higher implementation complexity, many variations, or complex theoretical background. This work investigates $mu$P empirically, focusing on the ubiquitous transformer architecture, and aims to answer a simple question: does $mu$-Transfer yield optimal learning rates in practice? Studying models of up to 10B parameters and training budgets of up to 190B tokens, we find $mu$-Transfer works as intended for the majority of important cases, yet also identify a few cases where it may not.

6/27/2024

On the Parameterization of Second-Order Optimization Effective Towards the Infinite Width

Satoki Ishikawa, Ryo Karakida

Second-order optimization has been developed to accelerate the training of deep neural networks and it is being applied to increasingly larger-scale models. In this study, towards training on further larger scales, we identify a specific parameterization for second-order optimization that promotes feature learning in a stable manner even if the network width increases significantly. Inspired by a maximal update parameterization, we consider a one-step update of the gradient and reveal the appropriate scales of hyperparameters including random initialization, learning rates, and damping terms. Our approach covers two major second-order optimization algorithms, K-FAC and Shampoo, and we demonstrate that our parameterization achieves higher generalization performance in feature learning. In particular, it enables us to transfer the hyperparameters across models with different widths.

6/11/2024