Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

2404.01413

Published 5/1/2024 by Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Dhruv Pai, Andrey Gromov and 4 others

cs.LG cs.AI cs.CL cs.ET stat.ML

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

Abstract

The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs? Recent investigations into model-data feedback loops proposed that such loops would lead to a phenomenon termed model collapse, under which performance progressively degrades with each model-data feedback iteration until fitted models become useless. However, those studies largely assumed that new data replace old data over time, where an arguably more realistic assumption is that data accumulate over time. In this paper, we ask: what effect does accumulating data have on model collapse? We empirically study this question by pretraining sequences of language models on text corpora. We confirm that replacing the original real data by each generation's synthetic data does indeed tend towards model collapse, then demonstrate that accumulating the successive generations of synthetic data alongside the original real data avoids model collapse; these results hold across a range of model sizes, architectures, and hyperparameters. We obtain similar results for deep generative models on other types of real data: diffusion models for molecule conformation generation and variational autoencoders for image generation. To understand why accumulating data can avoid model collapse, we use an analytically tractable framework introduced by prior work in which a sequence of linear models are fit to the previous models' outputs. Previous work used this framework to show that if data are replaced, the test error increases with the number of model-fitting iterations; we extend this argument to prove that if data instead accumulate, the test error has a finite upper bound independent of the number of iterations, meaning model collapse no longer occurs.

Get summaries of the top AI research delivered straight to your inbox:

Overview

The provided paper investigates the problem of "model collapse" - when machine learning models fail to learn unique and diverse representations, leading to poor performance.
The researchers propose a novel approach to prevent model collapse by accumulating both real and synthetic data during training.
Through theoretical analysis and empirical experiments, the paper demonstrates how this data accumulation strategy can effectively address the "curse of recursion" that often leads to model collapse.

Plain English Explanation

The main challenge the researchers are tackling is model collapse, which happens when machine learning models struggle to learn unique and varied representations of the data. This can lead to poor performance on real-world tasks.

To address this, the researchers developed a new training approach that involves continuously accumulating both real data and artificially generated, or "synthetic," data. The idea is that by exposing the model to an increasingly diverse set of examples over time, it will be able to learn more robust and generalizable representations, preventing the model from collapsing into a limited set of patterns.

Through mathematical analysis and experiments, the paper shows how this data accumulation strategy can effectively break the "curse of recursion" - a phenomenon where the model's own predictions get amplified over time, leading to a destructive cycle of model collapse. By adding in synthetic data, the model is able to learn more stable and diverse representations that are less susceptible to this curse.

Technical Explanation

The paper starts by establishing the theoretical foundations for why model collapse occurs, particularly in the context of recursive models that make predictions and then use those predictions as inputs for future iterations. The researchers show mathematically how this recursion can lead to the model's predictions becoming increasingly amplified, causing it to collapse into a limited set of representations.

To address this, the researchers propose a new training approach called "Accumulating Real and Synthetic Data" (ARSD). The key idea is to continuously expand the dataset by adding both real data samples and synthetically generated samples. This exposes the model to an increasingly diverse set of examples, preventing it from getting stuck in a suboptimal set of representations.

The researchers analyze the ARSD approach theoretically and show that it can effectively break the curse of recursion, leading to improved model performance and robustness. They also conduct extensive experiments on both synthetic and real-world datasets, demonstrating the efficacy of their approach compared to baseline methods.

Critical Analysis

The paper provides a solid theoretical foundation for understanding the problem of model collapse and the curse of recursion. The proposed ARSD approach seems well-justified and the experimental results are compelling, showing significant improvements over existing methods.

One potential limitation is that the paper focuses primarily on linear models, and it's not entirely clear how well the insights would translate to more complex, non-linear neural network architectures. The researchers acknowledge this and suggest that further investigation is needed to understand the broader applicability of their approach.

Additionally, the paper does not delve into the practical challenges of efficiently generating high-quality synthetic data in real-world scenarios. The success of ARSD likely depends on the ability to produce synthetic samples that are sufficiently diverse and representative of the true data distribution, which can be a non-trivial task.

Overall, the paper presents an innovative and promising solution to the critical problem of model collapse. Further research exploring the extension to more complex models and the practical implementation of the data accumulation strategy would be valuable contributions to the field.

Conclusion

The paper demonstrates that model collapse is not an inevitable outcome of recursive machine learning models. By continuously accumulating both real and synthetic data during training, the researchers have developed an effective approach to break the curse of recursion and learn more robust and diverse representations.

This work has important implications for a wide range of applications that rely on iterative or recursive models, such as language models, reinforcement learning agents, and generative adversarial networks. By addressing the fundamental issue of model collapse, the ARSD approach could lead to significant improvements in the performance and reliability of these types of models.

As the field of machine learning continues to advance, research like this that tackles core challenges and offers innovative solutions will be crucial for driving progress and unlocking the full potential of these powerful technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse

Mohamed El Amine Seddik, Suei-Wen Chen, Soufiane Hayou, Pierre Youssef, Merouane Debbah

The phenomenon of model collapse, introduced in (Shumailov et al., 2023), refers to the deterioration in performance that occurs when new models are trained on synthetic data generated from previously trained models. This recursive training loop makes the tails of the original distribution disappear, thereby making future-generation models forget about the initial (real) distribution. With the aim of rigorously understanding model collapse in language models, we consider in this paper a statistical model that allows us to characterize the impact of various recursive training scenarios. Specifically, we demonstrate that model collapse cannot be avoided when training solely on synthetic data. However, when mixing both real and synthetic data, we provide an estimate of a maximal amount of synthetic data below which model collapse can eventually be avoided. Our theoretical conclusions are further supported by empirical validations.

4/9/2024

cs.LG cs.AI cs.CL

Model Collapse Demystified: The Case of Regression

Elvis Dohmatob, Yunzhen Feng, Julia Kempe

In the era of proliferation of large language and image generation models, the phenomenon of model collapse refers to the situation whereby as a model is trained recursively on data generated from previous generations of itself over time, its performance degrades until the model eventually becomes completely useless, i.e the model collapses. In this work, we study this phenomenon in the setting of high-dimensional regression and obtain analytic formulae which quantitatively outline this phenomenon in a broad range of regimes. In the special case of polynomial decaying spectral and source conditions, we obtain modified scaling laws which exhibit new crossover phenomena from fast to slow rates. We also propose a simple strategy based on adaptive regularization to mitigate model collapse. Our theoretical results are validated with experiments.

5/2/2024

cs.LG cs.AI stat.ML

🏋️

The Curse of Recursion: Training on Generated Data Makes Models Forget

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, Ross Anderson

Stable Diffusion revolutionised image creation from descriptive text. GPT-2, GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of language tasks. ChatGPT introduced such language models to the general public. It is now clear that large language models (LLMs) are here to stay, and will bring about drastic change in the whole ecosystem of online text and images. In this paper we consider what the future might hold. What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs. We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models. We demonstrate that it has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.

4/16/2024

cs.LG cs.AI cs.CL cs.CR cs.CV

📊

On the Stability of Iterative Retraining of Generative Models on their own Data

Quentin Bertrand, Avishek Joey Bose, Alexandre Duplessis, Marco Jiralerspong, Gauthier Gidel

Deep generative models have made tremendous progress in modeling complex data, often exhibiting generation quality that surpasses a typical human's ability to discern the authenticity of samples. Undeniably, a key driver of this success is enabled by the massive amounts of web-scale data consumed by these models. Due to these models' striking performance and ease of availability, the web will inevitably be increasingly populated with synthetic content. Such a fact directly implies that future iterations of generative models will be trained on both clean and artificially generated data from past models. In this paper, we develop a framework to rigorously study the impact of training generative models on mixed datasets -- from classical training on real data to self-consuming generative models trained on purely synthetic data. We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough and the proportion of clean training data (w.r.t. synthetic data) is large enough. We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models on CIFAR10 and FFHQ.

4/3/2024

cs.LG