An Empirical Study of Scaling Laws for Transfer

Read original: arXiv:2408.16947 - Published 9/2/2024 by Matthew Barnett

An Empirical Study of Scaling Laws for Transfer

Overview

This paper presents an empirical study on scaling laws for transfer learning.
Scaling laws describe how model performance scales with factors like model size, dataset size, and compute.
The researchers investigate how transfer learning performance scales compared to training from scratch.

Plain English Explanation

The paper examines how the performance of machine learning models changes as you scale up key factors like the size of the model and the training dataset. This is known as studying "scaling laws."

Specifically, the researchers looked at how transfer learning - where a model is first trained on a large general dataset and then fine-tuned on a specific task - scales compared to training a model from scratch on the specific task.

The findings can help guide decisions around how to effectively scale up machine learning systems for different applications.

Technical Explanation

The paper investigates how transfer learning performance scales compared to training from scratch, as a function of model size, dataset size, and other factors.

The researchers trained models of varying sizes on a source dataset, then fine-tuned them on a target task. They measured performance on the target task and compared it to models trained solely on the target task.

The results show that transfer learning can significantly outperform training from scratch, especially for small target datasets. The scaling laws governing transfer learning were found to differ from the laws for training from scratch.

The researchers also explored the role of hyperparameters and other factors in determining the optimal transfer learning strategy.

Critical Analysis

The paper provides valuable empirical insights into the scaling behavior of transfer learning. However, the analysis is limited to a specific experimental setup and dataset, so the generalizability of the findings remains to be seen.

Additionally, the paper does not delve into the mechanisms underlying the observed scaling laws. Further research is needed to develop a more fundamental understanding of why transfer learning scales differently than training from scratch.

Conclusion

This study offers important empirical guidance on how to effectively scale up machine learning systems using transfer learning. The findings can inform decisions around model and dataset size, as well as the optimal transfer learning strategy for a given application. However, more research is needed to fully unravel the scaling laws underlying transfer learning.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

An Empirical Study of Scaling Laws for Transfer

Matthew Barnett

We present a limited empirical study of scaling laws for transfer learning in transformer models. More specifically, we examine a scaling law that incorporates a transfer gap term, indicating the effectiveness of pre-training on one distribution when optimizing for downstream performance on another distribution. When the transfer gap is low, pre-training is a cost-effective strategy for improving downstream performance. Conversely, when the gap is high, collecting high-quality fine-tuning data becomes relatively more cost effective. Fitting the scaling law to experiments from diverse datasets reveals significant variations in the transfer gap across distributions. In theory, the scaling law can inform optimal data allocation strategies and highlights how the scarcity of downstream data can bottleneck performance. Our findings contribute to a principled way to measure transfer learning efficiency and understand how data availability affects capabilities.

9/2/2024

Unraveling the Mystery of Scaling Laws: Part I

Hui Su, Zhi Tian, Xiaoyu Shen, Xunliang Cai

Scaling law principles indicate a power-law correlation between loss and variables such as model size, dataset size, and computational resources utilized during training. These principles play a vital role in optimizing various aspects of model pre-training, ultimately contributing to the success of large language models such as GPT-4, Llama and Gemini. However, the original scaling law paper by OpenAI did not disclose the complete details necessary to derive the precise scaling law formulas, and their conclusions are only based on models containing up to 1.5 billion parameters. Though some subsequent works attempt to unveil these details and scale to larger models, they often neglect the training dependency of important factors such as the learning rate, context length and batch size, leading to their failure to establish a reliable formula for predicting the test loss trajectory. In this technical report, we confirm that the scaling law formulations proposed in the original OpenAI paper remain valid when scaling the model size up to 33 billion, but the constant coefficients in these formulas vary significantly with the experiment setup. We meticulously identify influential factors and provide transparent, step-by-step instructions to estimate all constant terms in scaling-law formulas by training on models with only 1M~60M parameters. Using these estimated formulas, we showcase the capability to accurately predict various attributes for models with up to 33B parameters before their training, including (1) the minimum possible test loss; (2) the minimum required training steps and processed tokens to achieve a specific loss; (3) the critical batch size with an optimal time/computation trade-off at any loss value; and (4) the complete test loss trajectory with arbitrary batch size.

4/8/2024

Language models scale reliably with over-training and on downstream tasks

Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, Rui Xin, Marianna Nezhurina, Igor Vasiljevic, Jenia Jitsev, Luca Soldaini, Alexandros G. Dimakis, Gabriel Ilharco, Pang Wei Koh, Shuran Song, Thomas Kollar, Yair Carmon, Achal Dave, Reinhard Heckel, Niklas Muennighoff, Ludwig Schmidt

Scaling laws are useful guides for derisking expensive training runs, as they predict performance of large models using cheaper, small-scale experiments. However, there remain gaps between current scaling studies and how language models are ultimately trained and evaluated. For instance, scaling is usually studied in the compute-optimal training regime (i.e., Chinchilla optimal regime). In contrast, models are often over-trained to reduce inference costs. Moreover, scaling laws mostly predict loss on next-token prediction, but models are usually compared on downstream task performance. To address both shortcomings, we create a testbed of 104 models with 0.011B to 6.9B parameters trained with various numbers of tokens on three data distributions. First, we fit scaling laws that extrapolate in both the amount of over-training and the number of model parameters. This enables us to predict the validation loss of a 1.4B parameter, 900B token run (i.e., 32$times$ over-trained) and a 6.9B parameter, 138B token run (i.e., a compute-optimal run)$unicode{x2014}$each from experiments that take 300$times$ less compute. Second, we relate the perplexity of a language model to its downstream task performance by proposing a power law. We use this law to predict top-1 error averaged over downstream tasks for the two aforementioned models, using experiments that take 20$times$ less compute. Our experiments are available at https://github.com/mlfoundations/scaling.

6/18/2024

💬

Selecting Large Language Model to Fine-tune via Rectified Scaling Law

Haowei Lin, Baizhou Huang, Haotian Ye, Qinyu Chen, Zihao Wang, Sujian Li, Jianzhu Ma, Xiaojun Wan, James Zou, Yitao Liang

The ever-growing ecosystem of LLMs has posed a challenge in selecting the most appropriate pre-trained model to fine-tune amidst a sea of options. Given constrained resources, fine-tuning all models and making selections afterward is unrealistic. In this work, we formulate this resource-constrained selection task into predicting fine-tuning performance and illustrate its natural connection with Scaling Law. Unlike pre-training, we find that the fine-tuning scaling curve includes not just the well-known power phase but also the previously unobserved pre-power phase. We also explain why existing Scaling Law fails to capture this phase transition phenomenon both theoretically and empirically. To address this, we introduce the concept of pre-learned data size into our Rectified Scaling Law, which overcomes theoretical limitations and fits experimental results much better. By leveraging our law, we propose a novel LLM selection algorithm that selects the near-optimal model with hundreds of times less resource consumption, while other methods may provide negatively correlated selection. The project page is available at rectified-scaling-law.github.io.

5/29/2024