Will we run out of data? Limits of LLM scaling based on human-generated data

Read original: arXiv:2211.04325 - Published 6/6/2024 by Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, Marius Hobbhahn

📊

Overview

This paper investigates the potential constraints on the scaling of large language models (LLMs) due to the availability of public human-generated text data.
The researchers forecast the growing demand for training data based on current trends and estimate the total stock of public human text data.
They explore how progress in language modeling can continue when human-generated text datasets cannot be scaled any further.

Plain English Explanation

As large language models (LLMs) like GPT-3 and BERT have become increasingly powerful, there is a growing demand for the vast amounts of text data needed to train them. The authors of this paper examine whether the supply of publicly available human-generated text data will be able to keep up with the growing appetite for training data.

The researchers project that if current trends in LLM development continue, the models will be trained on datasets roughly equal in size to the total available stock of public human text data between 2026 and 2032, or even slightly earlier if the models are overtrained. This suggests that we may be approaching the limits of what can be achieved by simply scaling up the training data.

To overcome this potential bottleneck, the authors propose several alternative strategies. These include generating synthetic data, leveraging transfer learning from data-rich domains, and improving the data efficiency of language models. By exploring these approaches, the researchers aim to identify ways for progress in language modeling to continue even when human-generated text datasets reach their limits.

Technical Explanation

The researchers analyzed the current trends in LLM development and the available stock of public human text data to assess the potential constraints on model scaling. They forecast the growing demand for training data based on the observed scaling laws, which suggest that model performance scales with the square root of the dataset size.

The authors then estimated the total stock of public human text data by aggregating various web crawl datasets, Wikipedia, and other openly available sources. Their analysis indicates that if current LLM development trends continue, models will be trained on datasets roughly equal in size to the available stock of public human text data between 2026 and 2032, or even slightly earlier if the models are overtrained.

To address this potential bottleneck, the researchers explore several strategies. These include generating synthetic data using large language models, leveraging transfer learning from data-rich domains, and improving the data efficiency of language models. By pursuing these approaches, the authors aim to identify ways for progress in language modeling to continue even when human-generated text datasets reach their limits.

Critical Analysis

The paper provides a thoughtful analysis of the potential constraints on LLM scaling posed by the availability of public human-generated text data. The researchers make a compelling case that we may be approaching the limits of what can be achieved by simply scaling up the training data.

However, the paper does not address the potential impact of alternative data sources, such as private or proprietary datasets held by large technology companies. It also does not consider the possibility of further advancements in data augmentation techniques or the emergence of new, more efficient model architectures.

Additionally, the paper focuses primarily on the technical challenges and does not delve into the broader societal implications of the growing reliance on synthetic data or the potential risks of over-reliance on language models trained on limited data sources. Further research in these areas would be valuable.

Conclusion

This paper highlights a critical challenge facing the continued progress of large language models: the potential constraints posed by the availability of public human-generated text data. The researchers provide a thoughtful analysis of this issue and propose several strategies to overcome this bottleneck, such as synthetic data generation, transfer learning, and improved data efficiency.

By exploring these approaches, the authors aim to identify ways for progress in language modeling to continue even when human-generated text datasets reach their limits. This work has important implications for the future development of large language models and their potential impact on various domains, from natural language processing to artificial intelligence more broadly.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📊

Will we run out of data? Limits of LLM scaling based on human-generated data

Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, Marius Hobbhahn

We investigate the potential constraints on LLM scaling posed by the availability of public human-generated text data. We forecast the growing demand for training data based on current trends and estimate the total stock of public human text data. Our findings indicate that if current LLM development trends continue, models will be trained on datasets roughly equal in size to the available stock of public human text data between 2026 and 2032, or slightly earlier if models are overtrained. We explore how progress in language modeling can continue when human-generated text datasets cannot be scaled any further. We argue that synthetic data generation, transfer learning from data-rich domains, and data efficiency improvements might support further progress.

6/6/2024

🏋️

119

The Curse of Recursion: Training on Generated Data Makes Models Forget

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, Ross Anderson

Stable Diffusion revolutionised image creation from descriptive text. GPT-2, GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of language tasks. ChatGPT introduced such language models to the general public. It is now clear that large language models (LLMs) are here to stay, and will bring about drastic change in the whole ecosystem of online text and images. In this paper we consider what the future might hold. What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs. We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models. We demonstrate that it has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.

4/16/2024

Global Data Constraints: Ethical and Effectiveness Challenges in Large Language Model

Jin Yang, Zhiqiang Wang, Yanbin Lin, Zunduo Zhao

Recent advancements in large language models (LLMs), such as GPT-4 and GPT-4o, have shown exceptional performance, especially in languages with abundant resources like English, thanks to extensive datasets that ensure robust training. Conversely, these models exhibit limitations when processing under-resourced languages such as Chinese and Korean, where issues including hallucinatory responses remain prevalent. This paper traces the roots of these disparities to the tokenization process inherent to these models. Specifically, it explores how the tokenizer vocabulary, often used to speed up the tokenization process and reduce tokens but constructed independently of the actual model training data, inadequately represents non-English languages. This misrepresentation results in the propagation of 'under-trained' or 'untrained' tokens, which perpetuate biases and pose serious concerns related to data security and ethical standards. We aim to dissect the tokenization mechanics of GPT-4o, illustrating how its simplified token-handling methods amplify these risks and offer strategic solutions to mitigate associated security and ethical issues. Through this study, we emphasize the critical need to rethink tokenization frameworks to foster more equitable and secure AI technologies.

8/13/2024

Data Generation using Large Language Models for Text Classification: An Empirical Case Study

Yinheng Li, Rogerio Bonatti, Sara Abdali, Justin Wagle, Kazuhito Koishida

Using Large Language Models (LLMs) to generate synthetic data for model training has become increasingly popular in recent years. While LLMs are capable of producing realistic training data, the effectiveness of data generation is influenced by various factors, including the choice of prompt, task complexity, and the quality, quantity, and diversity of the generated data. In this work, we focus exclusively on using synthetic data for text classification tasks. Specifically, we use natural language understanding (NLU) models trained on synthetic data to assess the quality of synthetic data from different generation approaches. This work provides an empirical analysis of the impact of these factors and offers recommendations for better data generation practices.

7/23/2024