The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text

2311.09807

Published 4/17/2024 by Yanzhu Guo, Guokan Shang, Michalis Vazirgiannis, Chlo'e Clavel

🏋️

Abstract

This study investigates the consequences of training language models on synthetic data generated by their predecessors, an increasingly prevalent practice given the prominence of powerful generative models. Diverging from the usual emphasis on performance metrics, we focus on the impact of this training methodology on linguistic diversity, especially when conducted recursively over time. To assess this, we adapt and develop a set of novel metrics targeting lexical, syntactic, and semantic diversity, applying them in recursive finetuning experiments across various natural language generation tasks in English. Our findings reveal a consistent decrease in the diversity of the model outputs through successive iterations, especially remarkable for tasks demanding high levels of creativity. This trend underscores the potential risks of training language models on synthetic text, particularly concerning the preservation of linguistic richness. Our study highlights the need for careful consideration of the long-term effects of such training approaches on the linguistic capabilities of language models.

Get summaries of the top AI research delivered straight to your inbox:

Overview

This study investigates the consequences of training language models on synthetic data generated by their predecessors, a common practice as powerful generative models become more prominent.
The focus is on the impact of this training methodology on linguistic diversity, especially when conducted recursively over time, rather than just performance metrics.
The researchers adapted and developed a set of novel metrics to assess lexical, syntactic, and semantic diversity, and applied them in recursive fine-tuning experiments across various natural language generation tasks in English.
The findings reveal a consistent decrease in the diversity of the model outputs through successive iterations, particularly for tasks demanding high levels of creativity.

Plain English Explanation

Language models are AI systems that can generate human-like text. An increasingly common practice is to train these models on synthetic data generated by previous versions of the same model. This study looks at the effects of this training approach on the diversity of the language the models produce, rather than just how well they perform on specific tasks.

The researchers created new ways to measure different aspects of linguistic diversity, like the variety of words, sentence structures, and meanings used. They applied these metrics to experiments where language models were repeatedly fine-tuned (or retrained) on the text they had generated themselves.

The results show that as the models went through more and more cycles of self-training, the language they produced became less diverse, especially for tasks that require a lot of creativity. This suggests there are potential risks to repeatedly training models on their own synthetic output, as it could lead to a loss of richness and variety in the language they can generate.

The study highlights the need to carefully consider the long-term effects of these training approaches on the linguistic capabilities of language models.

Technical Explanation

The researchers adapted and developed a set of novel metrics to assess lexical, syntactic, and semantic diversity in language model outputs. These metrics were applied in recursive fine-tuning experiments across various natural language generation tasks in English, including open-ended story generation, dialogue response generation, and abstractive summarization.

The experiments involved iteratively fine-tuning a base language model on the synthetic text it had generated in previous iterations, simulating the recursive training on self-generated data that is becoming more common. The diversity metrics were used to track changes in the linguistic properties of the model outputs over these successive fine-tuning steps.

The results consistently showed a decrease in lexical, syntactic, and semantic diversity as the models were fine-tuned on their own generated text, particularly for tasks that demand high levels of creativity. This trend underscores the potential risks of training language models on synthetic data, as it may lead to a narrowing of their linguistic capabilities over time.

Critical Analysis

The paper acknowledges several caveats and limitations to the research. The experiments were conducted only on English language tasks, so the generalizability to other languages is unclear. The specific architectures and hyperparameters of the language models used may also have influenced the observed trends.

Additionally, the paper does not explore potential mitigation strategies or the extent to which the diversity loss could be offset by other training techniques, such as incorporating more diverse external data sources. Further research would be needed to fully understand the long-term implications and develop best practices for training language models on synthetic data.

That said, the study raises important considerations about the potential risks of over-reliance on self-generated training data, which is an increasingly common practice in the field of natural language processing. The findings encourage the AI research community to think critically about the stability and long-term effects of these training approaches and explore ways to preserve linguistic richness in language models.

Conclusion

This study provides empirical evidence that training language models on their own synthetic outputs can lead to a consistent decrease in the diversity of the language they generate, especially for creative tasks. The findings underscore the importance of carefully considering the long-term consequences of this prevalent training methodology on the linguistic capabilities of AI systems.

As the use of powerful generative models becomes more widespread, the research highlights the need for the AI community to take a closer look at the potential risks and develop strategies to mitigate the loss of linguistic diversity. Maintaining rich and varied language is crucial for the development of AI systems that can engage in natural, human-like communication and creative expression.

Related Papers

How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse

Mohamed El Amine Seddik, Suei-Wen Chen, Soufiane Hayou, Pierre Youssef, Merouane Debbah

The phenomenon of model collapse, introduced in (Shumailov et al., 2023), refers to the deterioration in performance that occurs when new models are trained on synthetic data generated from previously trained models. This recursive training loop makes the tails of the original distribution disappear, thereby making future-generation models forget about the initial (real) distribution. With the aim of rigorously understanding model collapse in language models, we consider in this paper a statistical model that allows us to characterize the impact of various recursive training scenarios. Specifically, we demonstrate that model collapse cannot be avoided when training solely on synthetic data. However, when mixing both real and synthetic data, we provide an estimate of a maximal amount of synthetic data below which model collapse can eventually be avoided. Our theoretical conclusions are further supported by empirical validations.

4/9/2024

cs.LG cs.AI cs.CL

Best Practices and Lessons Learned on Synthetic Data for Language Models

Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, Andrew M. Dai

The success of AI models relies on the availability of large, diverse, and high-quality datasets, which can be challenging to obtain due to data scarcity, privacy concerns, and high costs. Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns. This paper provides an overview of synthetic data research, discussing its applications, challenges, and future directions. We present empirical evidence from prior art to demonstrate its effectiveness and highlight the importance of ensuring its factuality, fidelity, and unbiasedness. We emphasize the need for responsible use of synthetic data to build more powerful, inclusive, and trustworthy language models.

4/12/2024

cs.CL

🏋️

The Curse of Recursion: Training on Generated Data Makes Models Forget

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, Ross Anderson

Stable Diffusion revolutionised image creation from descriptive text. GPT-2, GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of language tasks. ChatGPT introduced such language models to the general public. It is now clear that large language models (LLMs) are here to stay, and will bring about drastic change in the whole ecosystem of online text and images. In this paper we consider what the future might hold. What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs. We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models. We demonstrate that it has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.

4/16/2024

cs.LG cs.AI cs.CL cs.CR cs.CV

📊

On the Stability of Iterative Retraining of Generative Models on their own Data

Quentin Bertrand, Avishek Joey Bose, Alexandre Duplessis, Marco Jiralerspong, Gauthier Gidel

Deep generative models have made tremendous progress in modeling complex data, often exhibiting generation quality that surpasses a typical human's ability to discern the authenticity of samples. Undeniably, a key driver of this success is enabled by the massive amounts of web-scale data consumed by these models. Due to these models' striking performance and ease of availability, the web will inevitably be increasingly populated with synthetic content. Such a fact directly implies that future iterations of generative models will be trained on both clean and artificially generated data from past models. In this paper, we develop a framework to rigorously study the impact of training generative models on mixed datasets -- from classical training on real data to self-consuming generative models trained on purely synthetic data. We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough and the proportion of clean training data (w.r.t. synthetic data) is large enough. We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models on CIFAR10 and FFHQ.

4/3/2024

cs.LG