TextGram: Towards a better domain-adaptive pretraining

2404.18228

Published 4/30/2024 by Sharayu Hiwarkhedkar, Saloni Mittal, Vidula Magdum, Omkar Dhekane, Raviraj Joshi, Geetanjali Kale, Arnav Ladkat

cs.CL cs.LG

TextGram: Towards a better domain-adaptive pretraining

Abstract

For green AI, it is crucial to measure and reduce the carbon footprint emitted during the training of large language models. In NLP, performing pre-training on Transformer models requires significant computational resources. This pre-training involves using a large amount of text data to gain prior knowledge for performing downstream tasks. Thus, it is important that we select the correct data in the form of domain-specific data from this vast corpus to achieve optimum results aligned with our domain-specific tasks. While training on large unsupervised data is expensive, it can be optimized by performing a data selection step before pretraining. Selecting important data reduces the space overhead and the substantial amount of time required to pre-train the model while maintaining constant accuracy. We investigate the existing selection strategies and propose our own domain-adaptive data selection method - TextGram - that effectively selects essential data from large corpora. We compare and evaluate the results of finetuned models for text classification task with and without data selection. We show that the proposed strategy works better compared to other selection methods.

Create account to get full access

Overview

Proposes a novel pretraining approach called "TextGram" to improve the performance of language models on domain-specific tasks
Explores the impact of data selection and domain-adaptive pretraining on downstream task performance
Compares the effectiveness of TextGram against traditional pretraining approaches across a variety of benchmarks

Plain English Explanation

The paper introduces a new method called "TextGram" that aims to help language models perform better on specialized tasks within particular domains, such as clinical or legal applications. Traditional language models are often trained on a broad corpus of general text, which can limit their effectiveness when applied to more specialized domains.

TextGram explores ways to "adapt" the language model to a specific domain by carefully selecting the pretraining data and using specialized techniques during the pretraining process. The key idea is to find the right balance between general language understanding and domain-specific knowledge, in order to create models that are more effective for domain-specific tasks while still maintaining strong performance on general language understanding.

The researchers compare the performance of TextGram against other pretraining approaches across a range of benchmarks, evaluating factors like text classification, named entity recognition, and question answering. Their results suggest that the TextGram approach can lead to significant performance improvements on domain-specific tasks, while retaining strong capabilities on general language understanding.

Technical Explanation

The key elements of the TextGram approach are:

Data Selection: The researchers explore different strategies for selecting the most relevant pretraining data for a given domain, such as filtering general corpora or combining multiple domain-specific datasets.
Domain-Adaptive Pretraining: In addition to data selection, the researchers experiment with specialized pretraining techniques that aim to "adapt" the language model to the target domain, such as adversarial training or [self-supervised fine-tuning].
Downstream Evaluation: The performance of the TextGram models is evaluated on a range of domain-specific benchmarks, including text classification, named entity recognition, and question answering tasks. The results are compared to traditional pretraining approaches to assess the effectiveness of the TextGram method.

Critical Analysis

The paper provides a thorough exploration of the impact of data selection and domain-adaptive pretraining on language model performance. However, it's important to note that the proposed TextGram approach may have some limitations:

The effectiveness of the method may be highly dependent on the specific domain and the availability of high-quality, domain-relevant pretraining data. Applying TextGram to domains with limited data could be challenging.
The paper does not deeply explore the trade-offs between general language understanding and domain-specific performance. It's possible that optimizing too heavily for domain-specific tasks could lead to a decline in broader language capabilities.
The experimental setup focuses on a relatively narrow set of benchmarks, and it would be valuable to see the TextGram approach evaluated on a wider range of downstream tasks and applications.

Overall, the TextGram approach represents a promising direction for improving the performance of language models on domain-specific tasks, but further research is needed to fully understand its limitations and potential impact on the field.

Conclusion

The TextGram paper presents a novel pretraining approach that aims to enhance the performance of language models on domain-specific tasks, while still maintaining strong general language understanding capabilities. By carefully selecting pretraining data and using specialized domain-adaptive techniques, the researchers demonstrate significant improvements across a range of benchmarks.

The findings of this work could have important implications for the development of more specialized and effective language models for a variety of real-world applications, from clinical decision support to legal text analysis. As language models become increasingly ubiquitous, techniques like TextGram will likely play a crucial role in ensuring these models are well-suited for the unique challenges and requirements of different domains.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

Feiyang Kang, Hoang Anh Just, Yifan Sun, Himanshu Jahagirdar, Yuanzhi Zhang, Rongxing Du, Anit Kumar Sahu, Ruoxi Jia

This work focuses on leveraging and selecting from vast, unlabeled, open data to pre-fine-tune a pre-trained language model. The goal is to minimize the need for costly domain-specific data for subsequent fine-tuning while achieving desired performance levels. While many data selection algorithms have been designed for small-scale applications, rendering them unsuitable for our context, some emerging methods do cater to language data scales. However, they often prioritize data that aligns with the target distribution. While this strategy may be effective when training a model from scratch, it can yield limited results when the model has already been pre-trained on a different distribution. Differing from prior work, our key idea is to select data that nudges the pre-training distribution closer to the target distribution. We show the optimality of this approach for fine-tuning tasks under certain conditions. We demonstrate the efficacy of our methodology across a diverse array of tasks (NLU, NLG, zero-shot) with models up to 2.7B, showing that it consistently surpasses other selection methods. Moreover, our proposed method is significantly faster than existing techniques, scaling to millions of samples within a single GPU hour. Our code is open-sourced (Code repository: https://anonymous.4open.science/r/DV4LLM-D761/ ). While fine-tuning offers significant potential for enhancing performance across diverse tasks, its associated costs often limit its widespread adoption; with this work, we hope to lay the groundwork for cost-effective fine-tuning, making its benefits more accessible.

5/7/2024

cs.LG cs.AI cs.CL

Does your data spark joy? Performance gains from domain upsampling at the end of training

Cody Blakeney, Mansheej Paul, Brett W. Larsen, Sean Owen, Jonathan Frankle

Pretraining datasets for large language models (LLMs) have grown to trillions of tokens composed of large amounts of CommonCrawl (CC) web scrape along with smaller, domain-specific datasets. It is expensive to understand the impact of these domain-specific datasets on model capabilities as training at large FLOP scales is required to reveal significant changes to difficult and emergent benchmarks. Given the increasing cost of experimenting with pretraining data, how does one determine the optimal balance between the diversity in general web scrapes and the information density of domain specific data? In this work, we show how to leverage the smaller domain specific datasets by upsampling them relative to CC at the end of training to drive performance improvements on difficult benchmarks. This simple technique allows us to improve up to 6.90 pp on MMLU, 8.26 pp on GSM8K, and 6.17 pp on HumanEval relative to the base data mix for a 7B model trained for 1 trillion (T) tokens, thus rivaling Llama-2 (7B)$unicode{x2014}$a model trained for twice as long. We experiment with ablating the duration of domain upsampling from 5% to 30% of training and find that 10% to 20% percent is optimal for navigating the tradeoff between general language modeling capabilities and targeted benchmarks. We also use domain upsampling to characterize at scale the utility of individual datasets for improving various benchmarks by removing them during this final phase of training. This tool opens up the ability to experiment with the impact of different pretraining datasets at scale, but at an order of magnitude lower cost compared to full pretraining runs.

6/6/2024

cs.LG cs.CL

Large Language Model-guided Document Selection

Xiang Kong, Tom Gunter, Ruoming Pang

Large Language Model (LLM) pre-training exhausts an ever growing compute budget, yet recent research has demonstrated that careful document selection enables comparable model quality with only a fraction of the FLOPs. Inspired by efforts suggesting that domain-specific training document selection is in fact an interpretable process [Gunasekar et al., 2023], as well as research showing that instruction-finetuned LLMs are adept zero-shot data labelers [Gilardi et al.,2023], we explore a promising direction for scalable general-domain document selection; employing a prompted LLM as a document grader, we distill quality labels into a classifier model, which is applied at scale to a large, and already heavily-filtered, web-crawl-derived corpus autonomously. Following the guidance of this classifier, we drop 75% of the corpus and train LLMs on the remaining data. Results across multiple benchmarks show that: 1. Filtering allows us to quality-match a model trained on the full corpus across diverse benchmarks with at most 70% of the FLOPs, 2. More capable LLM labelers and classifier models lead to better results that are less sensitive to the labeler's prompt, 3. In-context learning helps to boost the performance of less-capable labeling models. In all cases we use open-source datasets, models, recipes, and evaluation frameworks, so that results can be reproduced by the community.

6/10/2024

cs.CL

👨‍🏫

Text-Free Multi-domain Graph Pre-training:Toward Graph Foundation Models

Xingtong Yu, Chang Zhou, Yuan Fang, Xinming Zhang

Given the ubiquity of graph data, it is intriguing to ask: Is it possible to train a graph foundation model on a broad range of graph data across diverse domains? A major hurdle toward this goal lies in the fact that graphs from different domains often exhibit profoundly divergent characteristics. Although there have been some initial efforts in integrating multi-domain graphs for pre-training, they primarily rely on textual descriptions to align the graphs, limiting their application to text-attributed graphs. Moreover, different source domains may conflict or interfere with each other, and their relevance to the target domain can vary significantly. To address these issues, we propose MDGPT, a text free Multi-Domain Graph Pre-Training and adaptation framework designed to exploit multi-domain knowledge for graph learning. First, we propose a set of domain tokens to to align features across source domains for synergistic pre-training. Second, we propose a dual prompts, consisting of a unifying prompt and a mixing prompt, to further adapt the target domain with unified multi-domain knowledge and a tailored mixture of domain-specific knowledge. Finally, we conduct extensive experiments involving six public datasets to evaluate and analyze MDGPT, which outperforms prior art by up to 37.9%.

5/29/2024

cs.LG