Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?

Read original: arXiv:2406.04391 - Published 6/10/2024 by Rylan Schaeffer, Hailey Schoelkopf, Brando Miranda, Gabriel Mukobi, Varun Madan, Adam Ibrahim, Herbie Bradley, Stella Biderman, Sanmi Koyejo

🤖

Overview

Predicting the performance of advanced AI systems as they scale in complexity is a highly desirable capability, but has proven challenging.
While scaling laws for pre-training performance are well-established, predicting scaling of specific downstream capabilities remains an elusive task.
This paper investigates why modeling the scaling of downstream performance on common benchmarks has been so difficult.

Plain English Explanation

Imagine you have a very advanced AI system, and as you make it more and more complex, you want to be able to predict how well it will perform on different tasks. While researchers have figured out how to predict how the system's overall performance will scale, predicting how specific capabilities will scale has been much harder.

This paper looks at why it's been so tough to model the scaling of performance on common benchmarks, which are tests designed to measure an AI's abilities in areas like question-answering. The key insight is that these benchmarks involve comparing the AI's answer to a small number of specific wrong answers. Accurately predicting scaling for these benchmarks requires modeling not just how confident the AI gets in the right answer, but also how its confidence in the wrong answers changes as the system gets more complex.

The researchers show that this makes the math behind these benchmarks much more complicated, and explains why pretraining scaling laws tend to be more predictable than scaling of downstream capabilities. Their work points the way towards developing evaluations of advanced AI systems that are more amenable to predictive scaling laws.

Technical Explanation

The researchers used five different model families and twelve well-established multiple-choice question-answering benchmarks to investigate why predicting scaling of downstream performance has been so challenging. They found that the process of computing downstream performance from negative log likelihoods involves a sequence of transformations that progressively degrade the statistical relationship between performance and scale.

The key mechanism causing this degradation is that downstream metrics require comparing the correct choice against a small number of specific incorrect choices. This means accurately predicting downstream capabilities requires predicting not just how probability mass concentrates on the correct choice with scale, but also how probability mass fluctuates on specific incorrect choices with scale.

The researchers empirically study how probability mass on the correct choice co-varies with probability mass on incorrect choices as compute increases, suggesting that scaling laws for incorrect choices might be achievable. Their work also explains why pretraining scaling laws are more predictable than downstream capabilities, and points towards establishing scaling-predictable evaluations of frontier AI models.

Critical Analysis

The paper provides a valuable contribution by identifying a key factor that makes modeling the scaling of downstream performance so challenging - the need to predict how probability mass fluctuates on specific incorrect choices, not just the correct choice. This insight helps explain why pretraining scaling laws are more straightforward to derive.

However, the paper does not explore in depth the potential causes of the observed fluctuations in probability mass on incorrect choices. Further research would be needed to fully understand the mechanisms underlying this phenomenon and whether there are ways to make it more amenable to predictive scaling laws.

Additionally, the paper focuses solely on multiple-choice question-answering benchmarks. While these are widely used, it would be important to investigate whether the same issues arise for other types of downstream tasks and evaluation metrics. Expanding the analysis to a broader set of benchmarks could provide a more comprehensive understanding of the challenges in modeling scaling of specific AI capabilities.

Conclusion

This paper makes an important contribution by identifying a key factor that has made it challenging to model the scaling of downstream performance for advanced AI systems. By showing that accurately predicting downstream capabilities requires modeling not just the correct choice, but also the fluctuations in probability mass on specific incorrect choices, the researchers have shed light on why pretraining scaling laws tend to be more predictable.

Their work points the way towards developing evaluation frameworks for frontier AI models that are more amenable to predictive scaling laws. As AI systems continue to grow in complexity, the ability to reliably forecast their performance on specific tasks will be crucial for guiding research and development. This paper represents an important step forward in this critical area.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤖

Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?

Rylan Schaeffer, Hailey Schoelkopf, Brando Miranda, Gabriel Mukobi, Varun Madan, Adam Ibrahim, Herbie Bradley, Stella Biderman, Sanmi Koyejo

Predictable behavior from scaling advanced AI systems is an extremely desirable property. Although a well-established literature exists on how pretraining performance scales, the literature on how particular downstream capabilities scale is significantly muddier. In this work, we take a step back and ask: why has predicting specific downstream capabilities with scale remained elusive? While many factors are certainly responsible, we identify a new factor that makes modeling scaling behavior on widely used multiple-choice question-answering benchmarks challenging. Using five model families and twelve well-established multiple-choice benchmarks, we show that downstream performance is computed from negative log likelihoods via a sequence of transformations that progressively degrade the statistical relationship between performance and scale. We then reveal the mechanism causing this degradation: downstream metrics require comparing the correct choice against a small number of specific incorrect choices, meaning accurately predicting downstream capabilities requires predicting not just how probability mass concentrates on the correct choice with scale, but also how probability mass fluctuates on specific incorrect choices with scale. We empirically study how probability mass on the correct choice co-varies with probability mass on incorrect choices with increasing compute, suggesting that scaling laws for incorrect choices might be achievable. Our work also explains why pretraining scaling laws are commonly regarded as more predictable than downstream capabilities and contributes towards establishing scaling-predictable evaluations of frontier AI models.

6/10/2024

Language models scale reliably with over-training and on downstream tasks

Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, Rui Xin, Marianna Nezhurina, Igor Vasiljevic, Jenia Jitsev, Luca Soldaini, Alexandros G. Dimakis, Gabriel Ilharco, Pang Wei Koh, Shuran Song, Thomas Kollar, Yair Carmon, Achal Dave, Reinhard Heckel, Niklas Muennighoff, Ludwig Schmidt

Scaling laws are useful guides for derisking expensive training runs, as they predict performance of large models using cheaper, small-scale experiments. However, there remain gaps between current scaling studies and how language models are ultimately trained and evaluated. For instance, scaling is usually studied in the compute-optimal training regime (i.e., Chinchilla optimal regime). In contrast, models are often over-trained to reduce inference costs. Moreover, scaling laws mostly predict loss on next-token prediction, but models are usually compared on downstream task performance. To address both shortcomings, we create a testbed of 104 models with 0.011B to 6.9B parameters trained with various numbers of tokens on three data distributions. First, we fit scaling laws that extrapolate in both the amount of over-training and the number of model parameters. This enables us to predict the validation loss of a 1.4B parameter, 900B token run (i.e., 32$times$ over-trained) and a 6.9B parameter, 138B token run (i.e., a compute-optimal run)$unicode{x2014}$each from experiments that take 300$times$ less compute. Second, we relate the perplexity of a language model to its downstream task performance by proposing a power law. We use this law to predict top-1 error averaged over downstream tasks for the two aforementioned models, using experiments that take 20$times$ less compute. Our experiments are available at https://github.com/mlfoundations/scaling.

6/18/2024

🔮

Scaling Laws Do Not Scale

Fernando Diaz, Michael Madaio

Recent work has advocated for training AI models on ever-larger datasets, arguing that as the size of a dataset increases, the performance of a model trained on that dataset will correspondingly increase (referred to as scaling laws). In this paper, we draw on literature from the social sciences and machine learning to critically interrogate these claims. We argue that this scaling law relationship depends on metrics used to measure performance that may not correspond with how different groups of people perceive the quality of models' output. As the size of datasets used to train large AI models grows and AI systems impact ever larger groups of people, the number of distinct communities represented in training or evaluation datasets grows. It is thus even more likely that communities represented in datasets may have values or preferences not reflected in (or at odds with) the metrics used to evaluate model performance in scaling laws. Different communities may also have values in tension with each other, leading to difficult, potentially irreconcilable choices about metrics used for model evaluations -- threatening the validity of claims that model performance is improving at scale. We end the paper with implications for AI development: that the motivation for scraping ever-larger datasets may be based on fundamentally flawed assumptions about model performance. That is, models may not, in fact, continue to improve as the datasets get larger -- at least not for all people or communities impacted by those models. We suggest opportunities for the field to rethink norms and values in AI development, resisting claims for universality of large models, fostering more local, small-scale designs, and other ways to resist the impetus towards scale in AI.

7/30/2024

Collaborative Performance Prediction for Large Language Models

Qiyuan Zhang, Fuyuan Lyu, Xue Liu, Chen Ma

Comprehensively understanding and accurately predicting the performance of large language models across diverse downstream tasks has emerged as a pivotal challenge in NLP research. The pioneering scaling law on downstream works demonstrated intrinsic similarities within model families and utilized such similarities for performance prediction. However, they tend to overlook the similarities between model families and only consider design factors listed in the original scaling law. To overcome these limitations, we introduce a novel framework, Collaborative Performance Prediction (CPP), which significantly enhances prediction accuracy by leveraging the historical performance of various models on downstream tasks and other design factors for both model and task. We also collect a collaborative data sourced from online platforms containing both historical performance and additional design factors. With the support of the collaborative data, CPP not only surpasses traditional scaling laws in predicting the performance of scaled LLMs but also facilitates a detailed analysis of factor importance, an area previously overlooked.

7/2/2024