QuRating: Selecting High-Quality Data for Training Language Models

Read original: arXiv:2402.09739 - Published 7/19/2024 by Alexander Wettig, Aatmik Gupta, Saumya Malik, Danqi Chen

📊

Overview

This paper introduces a method called QuRating for selecting high-quality pre-training data for large language models (LLMs).
Existing methods for data selection rely on simple heuristics, but QuRating aims to capture more nuanced human intuitions about data quality.
The researchers investigate four key qualities - writing style, required expertise, facts & trivia, and educational value - and find that LLMs can discern these qualities, particularly when making pairwise judgments.
They train a QuRater model to learn scalar ratings from pairwise judgments and use it to annotate a 260B training corpus with quality ratings for each of the four criteria.
By sampling data based on these quality ratings, the researchers are able to train smaller language models that outperform baselines in terms of perplexity and in-context learning performance.

Plain English Explanation

When training large language models, the quality of the data used for pretraining is very important. Existing methods for selecting this data often rely on simple rules, like only choosing data from reputable websites. However, the researchers behind this paper argue that we can capture more nuanced human intuitions about what makes high-quality data.

They identify four key qualities that they think are important: writing style, required expertise, facts & trivia, and educational value. The researchers find that language models are actually quite good at recognizing these qualities, especially when comparing two pieces of text side-by-side.

Using this insight, the researchers train a model called a "QuRater" that can learn to rate text on these four dimensions. They then use the QuRater to assess the quality of a huge 260 billion token training corpus, giving each piece of text a score for each of the four qualities.

Next, the researchers experiment with only using the highest-quality data, as scored by the QuRater, to train smaller language models. They find that these models actually perform better than models trained on a random sample of the full dataset. The best-performing model uses the data scored highest for educational value.

Beyond just selecting the data, the researchers also experiment with using the quality ratings to guide the training curriculum - that is, the order in which the model sees the data. They find this can also improve performance without changing the underlying dataset.

Overall, this research suggests that carefully considering the quality and characteristics of training data, rather than just the quantity, can lead to more capable language models.

Technical Explanation

The key innovation in this paper is the QuRating method for selecting high-quality pretraining data for large language models. Existing approaches often use simple heuristics like only choosing data from reputable websites, but the researchers argue this misses more nuanced aspects of data quality.

To address this, the researchers first identify four key qualities they believe are important: writing style, required expertise, facts & trivia, and educational value. They then conduct experiments showing that large language models can in fact discern these qualities, particularly when making pairwise comparisons between texts.

Building on this, the researchers train a "QuRater" model that can learn scalar ratings for each of the four quality dimensions from pairwise human judgments. They then use this QuRater to annotate a massive 260 billion token training corpus with quality scores for each piece of text.

The researchers experiment with training smaller 1.3 billion parameter language models, but only using the highest-scoring data according to the different quality dimensions. They find that sampling based on these quality ratings, especially educational value, leads to better performance in terms of perplexity and in-context learning compared to baselines.

Additionally, the researchers experiment with using the quality ratings to construct a training curriculum, sequencing the data to improve performance without changing the underlying dataset. They provide extensive analysis of the quality ratings, discussing their characteristics, biases, and broader implications.

Critical Analysis

One limitation of this work is that the researchers only evaluate the performance of the language models on standard benchmarks, rather than real-world tasks. It would be valuable to see how the models perform in more applied settings, to better understand the practical benefits of their data selection approach.

Additionally, the researchers acknowledge that their quality ratings may contain biases, and they do not provide a thorough analysis of the sources and implications of these biases. A more rigorous examination of the potential pitfalls and ethical considerations of using these ratings would strengthen the paper.

Another area for further research would be to explore how the QuRating method could be expanded beyond the four qualities examined here. There may be other important dimensions of data quality that could be incorporated to produce even more robust and capable language models.

Overall, however, this paper represents an important step forward in moving beyond simplistic heuristics for data selection and towards more principled approaches to building high-quality training corpora. The researchers' emphasis on capturing nuanced human intuitions about text quality is a valuable contribution to the field of language model development.

Conclusion

This paper introduces QuRating, a novel method for selecting high-quality pretraining data for large language models. By identifying four key qualities - writing style, required expertise, facts & trivia, and educational value - and training a model to assess these qualities, the researchers are able to produce smaller language models that outperform baselines.

Beyond just data selection, the researchers also demonstrate the value of using these quality ratings to guide the training curriculum, further improving model performance. While the work has some limitations, it represents an important advance in moving towards more principled approaches to building capable and reliable language models.

As the use of large language models becomes increasingly widespread, ensuring the quality and trustworthiness of the data they are trained on will only grow in importance. The insights and techniques presented in this paper offer a promising path forward in this critical area of research.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📊

QuRating: Selecting High-Quality Data for Training Language Models

Alexander Wettig, Aatmik Gupta, Saumya Malik, Danqi Chen

Selecting high-quality pre-training data is important for creating capable language models, but existing methods rely on simple heuristics. We introduce QuRating, a method for selecting pre-training data that can capture human intuitions about data quality. In this paper, we investigate four qualities - writing style, required expertise, facts & trivia, and educational value - and find that LLMs are able to discern these qualities, especially when making pairwise judgments of texts. We train a QuRater model to learn scalar ratings from pairwise judgments, and use it to annotate a 260B training corpus with quality ratings for each of the four criteria. In our experiments, we select 30B tokens according to the different quality ratings and train 1.3B-parameter language models on the selected data. We find that it is important to balance quality and diversity. When we sample using quality ratings as logits over documents, our models obtain lower perplexity and stronger in-context learning performance than baselines. Our best model is based on educational value and performs similarly to a model trained with uniform sampling for 50% more steps. Beyond data selection, we use the quality ratings to construct a training curriculum which improves performance without changing the training dataset. We extensively analyze the quality ratings and discuss their characteristics, biases, and wider implications.

7/19/2024

$Text Quality-Based Pruning for Efficient Training of Language Models$

Text Quality-Based Pruning for Efficient Training of Language Models

Vasu Sharma, Karthik Padthe, Newsha Ardalani, Kushal Tirumala, Russell Howes, Hu Xu, Po-Yao Huang, Shang-Wen Li, Armen Aghajanyan, Gargi Ghosh, Luke Zettlemoyer

In recent times training Language Models (LMs) have relied on computationally heavy training over massive datasets which makes this training process extremely laborious. In this paper we propose a novel method for numerically evaluating text quality in large unlabelled NLP datasets in a model agnostic manner to assign the text instances a quality score. By proposing the text quality metric, the paper establishes a framework to identify and eliminate low-quality text instances, leading to improved training efficiency for LM models. Experimental results over multiple models and datasets demonstrate the efficacy of this approach, showcasing substantial gains in training effectiveness and highlighting the potential for resource-efficient LM training. For example, we observe an absolute accuracy improvement of 0.9% averaged over 14 downstream evaluation tasks for multiple LM models while using 40% lesser data and training 42% faster when training on the OpenWebText dataset and 0.8% average absolute accuracy improvement while using 20% lesser data and training 21% faster on the Wikipedia dataset.

5/14/2024

Curriculum Learning with Quality-Driven Data Selection

Biao Wu, Fang Meng, Ling Chen

The impressive multimodal capabilities demonstrated by OpenAI's GPT-4 have generated significant interest in the development of Multimodal Large Language Models (MLLMs). Visual instruction tuning of MLLMs with machine-generated instruction-following data has shown to enhance zero-shot capabilities across various tasks. However, there has been limited exploration into controlling the quality of the instruction data.Current methodologies for data selection in MLLMs often rely on single, unreliable scores or use downstream tasks for selection, which is time-consuming and can lead to potential overfitting on the chosen evaluation datasets. To mitigate these limitations, we propose a novel data selection methodology that utilizes image-text correlation and model perplexity to evaluate and select data of varying quality. This approach leverages the distinct distribution of these two attributes, mapping data quality into a two-dimensional space that allows for the selection of data based on their location within this distribution. By utilizing this space, we can analyze the impact of task type settings, used as prompts, on data quality. Additionally, this space can be used to construct multi-stage subsets of varying quality to facilitate curriculum learning. Our research includes comprehensive experiments conducted on various datasets. The results emphasize substantial enhancements in five commonly assessed capabilities compared to using the complete dataset. Our codes, data, and models are publicly available at: url{https://anonymous.4open.science/r/EHIT-31B4}

7/2/2024

Large language models can accurately predict searcher preferences

Paul Thomas, Seth Spielman, Nick Craswell, Bhaskar Mitra

Relevance labels, which indicate whether a search result is valuable to a searcher, are key to evaluating and optimising search systems. The best way to capture the true preferences of users is to ask them for their careful feedback on which results would be useful, but this approach does not scale to produce a large number of labels. Getting relevance labels at scale is usually done with third-party labellers, who judge on behalf of the user, but there is a risk of low-quality data if the labeller doesn't understand user needs. To improve quality, one standard approach is to study real users through interviews, user studies and direct feedback, find areas where labels are systematically disagreeing with users, then educate labellers about user needs through judging guidelines, training and monitoring. This paper introduces an alternate approach for improving label quality. It takes careful feedback from real users, which by definition is the highest-quality first-party gold data that can be derived, and develops an large language model prompt that agrees with that data. We present ideas and observations from deploying language models for large-scale relevance labelling at Bing, and illustrate with data from TREC. We have found large language models can be effective, with accuracy as good as human labellers and similar capability to pick the hardest queries, best runs, and best groups. Systematic changes to the prompts make a difference in accuracy, but so too do simple paraphrases. To measure agreement with real searchers needs high-quality gold labels, but with these we find that models produce better labels than third-party workers, for a fraction of the cost, and these labels let us train notably better rankers.

5/20/2024