Does your data spark joy? Performance gains from domain upsampling at the end of training

2406.03476

Published 6/6/2024 by Cody Blakeney, Mansheej Paul, Brett W. Larsen, Sean Owen, Jonathan Frankle

Does your data spark joy? Performance gains from domain upsampling at the end of training

Abstract

Pretraining datasets for large language models (LLMs) have grown to trillions of tokens composed of large amounts of CommonCrawl (CC) web scrape along with smaller, domain-specific datasets. It is expensive to understand the impact of these domain-specific datasets on model capabilities as training at large FLOP scales is required to reveal significant changes to difficult and emergent benchmarks. Given the increasing cost of experimenting with pretraining data, how does one determine the optimal balance between the diversity in general web scrapes and the information density of domain specific data? In this work, we show how to leverage the smaller domain specific datasets by upsampling them relative to CC at the end of training to drive performance improvements on difficult benchmarks. This simple technique allows us to improve up to 6.90 pp on MMLU, 8.26 pp on GSM8K, and 6.17 pp on HumanEval relative to the base data mix for a 7B model trained for 1 trillion (T) tokens, thus rivaling Llama-2 (7B)$unicode{x2014}$a model trained for twice as long. We experiment with ablating the duration of domain upsampling from 5% to 30% of training and find that 10% to 20% percent is optimal for navigating the tradeoff between general language modeling capabilities and targeted benchmarks. We also use domain upsampling to characterize at scale the utility of individual datasets for improving various benchmarks by removing them during this final phase of training. This tool opens up the ability to experiment with the impact of different pretraining datasets at scale, but at an order of magnitude lower cost compared to full pretraining runs.

Create account to get full access

Overview

This paper explores the potential performance gains from domain upsampling - increasing the proportion of data from a specific domain - towards the end of training machine learning models.
The authors investigate whether this technique can lead to improved model performance on the target domain without sacrificing overall performance.
The paper provides empirical results across several different datasets and tasks, demonstrating the effectiveness of domain upsampling in certain scenarios.

Plain English Explanation

The researchers in this study looked at a technique called "domain upsampling" to see if it could help improve the performance of machine learning models on specific types of data, without hurting the overall model performance.

Domain upsampling means increasing the proportion of data from a particular domain (e.g., a specific type of image, text, or other data) towards the end of the model training process. The idea is that by exposing the model to more of the target domain data at the end, it can learn to better recognize and handle that type of information.

The authors tested this approach across multiple datasets and tasks, and found that in some cases, domain upsampling did lead to performance gains on the target domain. However, they also note that this technique may not be beneficial in all situations, and that there are tradeoffs to consider.

Overall, the paper provides empirical evidence that domain upsampling can be a useful technique for improving model performance in certain contexts, but that the specific effects will depend on the dataset, task, and other factors.

Technical Explanation

The paper examines the impact of "domain upsampling" - increasing the proportion of data from a specific domain towards the end of training - on model performance. The authors hypothesize that this technique can lead to improved performance on the target domain without sacrificing overall model performance.

To test this, the researchers conducted experiments across multiple datasets and tasks, including image classification, text classification, and language modeling. They compared models trained with standard data sampling to those trained with domain upsampling applied in the final training epochs.

The results show that domain upsampling can indeed provide performance gains on the target domain in some cases. For example, on the ImageNet dataset, the authors found that upsampling the "dog" domain in the final training epochs led to a 1.2% increase in top-1 accuracy on the dog class, without a significant drop in overall performance.

However, the authors also note that the benefits of domain upsampling are not universal. In certain scenarios, such as language modeling on the WIKI-40B dataset, domain upsampling did not provide meaningful performance improvements. The authors hypothesize that the effectiveness of this technique depends on factors like the initial model performance, the degree of domain shift, and the complexity of the target domain.

Overall, the paper provides empirical evidence that domain upsampling can be a useful tool for improving model performance in some contexts, but that the specific effects will depend on the problem and dataset at hand. The authors encourage further research to better understand the conditions under which domain upsampling is most beneficial.

Critical Analysis

The paper presents a well-designed empirical study on the effects of domain upsampling, with experiments across a range of datasets and tasks. The authors are careful to note the limitations of their findings and acknowledge that the effectiveness of this technique will depend on the specific problem and dataset.

One potential area for further research mentioned in the paper is the interaction between domain upsampling and other techniques like domain generalization or continual pretraining. It would be interesting to see how domain upsampling might complement or interact with these other approaches to improving model performance and robustness.

Additionally, the paper does not deeply explore the mechanisms underlying the performance gains from domain upsampling. Further research could investigate the cognitive and representational changes in the model that enable it to better handle the target domain after this technique is applied.

Overall, this paper provides a valuable contribution to the understanding of data sampling strategies for machine learning models. The authors' careful analysis and acknowledgement of the nuanced effects of domain upsampling set a good example for future work in this area.

Conclusion

This paper investigates the potential performance gains from applying domain upsampling - increasing the proportion of data from a specific domain - towards the end of training machine learning models. The empirical results show that this technique can lead to improved performance on the target domain in some cases, without significantly impacting overall model performance.

While the benefits of domain upsampling are not universal, the paper provides a valuable contribution to the understanding of data sampling strategies in machine learning. The authors' careful analysis and acknowledgement of the nuanced effects of this technique set the stage for further research to better understand the conditions under which domain upsampling is most effective.

Overall, this work highlights the importance of considering domain-specific considerations when training machine learning models, and suggests that targeted data sampling techniques like domain upsampling can be a useful tool for improving model performance in certain scenarios.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

TextGram: Towards a better domain-adaptive pretraining

Sharayu Hiwarkhedkar, Saloni Mittal, Vidula Magdum, Omkar Dhekane, Raviraj Joshi, Geetanjali Kale, Arnav Ladkat

For green AI, it is crucial to measure and reduce the carbon footprint emitted during the training of large language models. In NLP, performing pre-training on Transformer models requires significant computational resources. This pre-training involves using a large amount of text data to gain prior knowledge for performing downstream tasks. Thus, it is important that we select the correct data in the form of domain-specific data from this vast corpus to achieve optimum results aligned with our domain-specific tasks. While training on large unsupervised data is expensive, it can be optimized by performing a data selection step before pretraining. Selecting important data reduces the space overhead and the substantial amount of time required to pre-train the model while maintaining constant accuracy. We investigate the existing selection strategies and propose our own domain-adaptive data selection method - TextGram - that effectively selects essential data from large corpora. We compare and evaluate the results of finetuned models for text classification task with and without data selection. We show that the proposed strategy works better compared to other selection methods.

4/30/2024

cs.CL cs.LG

Large Language Model-guided Document Selection

Xiang Kong, Tom Gunter, Ruoming Pang

Large Language Model (LLM) pre-training exhausts an ever growing compute budget, yet recent research has demonstrated that careful document selection enables comparable model quality with only a fraction of the FLOPs. Inspired by efforts suggesting that domain-specific training document selection is in fact an interpretable process [Gunasekar et al., 2023], as well as research showing that instruction-finetuned LLMs are adept zero-shot data labelers [Gilardi et al.,2023], we explore a promising direction for scalable general-domain document selection; employing a prompted LLM as a document grader, we distill quality labels into a classifier model, which is applied at scale to a large, and already heavily-filtered, web-crawl-derived corpus autonomously. Following the guidance of this classifier, we drop 75% of the corpus and train LLMs on the remaining data. Results across multiple benchmarks show that: 1. Filtering allows us to quality-match a model trained on the full corpus across diverse benchmarks with at most 70% of the FLOPs, 2. More capable LLM labelers and classifier models lead to better results that are less sensitive to the labeler's prompt, 3. In-context learning helps to boost the performance of less-capable labeling models. In all cases we use open-source datasets, models, recipes, and evaluation frameworks, so that results can be reproduced by the community.

6/10/2024

cs.CL

💬

Pretraining and Updating Language- and Domain-specific Large Language Model: A Case Study in Japanese Business Domain

Kosuke Takahashi, Takahiro Omi, Kosuke Arima, Tatsuya Ishigaki

Several previous studies have considered language- and domain-specific large language models (LLMs) as separate topics. This study explores the combination of a non-English language and a high-demand industry domain, focusing on a Japanese business-specific LLM. This type of a model requires expertise in the business domain, strong language skills, and regular updates of its knowledge. We trained a 13-billion-parameter LLM from scratch using a new dataset of business texts and patents, and continually pretrained it with the latest business documents. Further we propose a new benchmark for Japanese business domain question answering (QA) and evaluate our models on it. The results show that our pretrained model improves QA accuracy without losing general knowledge, and that continual pretraining enhances adaptation to new information. Our pretrained model and business domain benchmark are publicly available.

4/17/2024

cs.CL cs.AI

UniGen: Universal Domain Generalization for Sentiment Classification via Zero-shot Dataset Generation

Juhwan Choi, Yeonghwa Kim, Seunguk Yu, JungMin Yun, YoungBin Kim

Although pre-trained language models have exhibited great flexibility and versatility with prompt-based few-shot learning, they suffer from the extensive parameter size and limited applicability for inference. Recent studies have suggested that PLMs be used as dataset generators and a tiny task-specific model be trained to achieve efficient inference. However, their applicability to various domains is limited because they tend to generate domain-specific datasets. In this work, we propose a novel approach to universal domain generalization that generates a dataset regardless of the target domain. This allows for generalization of the tiny task model to any domain that shares the label space, thus enhancing the real-world applicability of the dataset generation paradigm. Our experiments indicate that the proposed method accomplishes generalizability across various domains while using a parameter set that is orders of magnitude smaller than PLMs.

5/6/2024

cs.CL cs.AI