Automatic Pruning of Fine-tuning Datasets for Transformer-based Language Models

Read original: arXiv:2407.08887 - Published 7/15/2024 by Mohammadreza Tayaranian, Seyyed Hasan Mozafari, Brett H. Meyer, James J. Clark, Warren J. Gross

💬

Overview

Transformer-based language models have achieved state-of-the-art performance on various natural language understanding tasks.
These models are first pre-trained on a general corpus and then fine-tuned on specific downstream tasks.
Previous work has studied the effect of pruning the training set of downstream tasks on model performance.
This paper proposes an automatic dataset pruning method for the training set of fine-tuning tasks.

Plain English Explanation

Transformer-based language models, such as BERT and GPT-3, have become incredibly powerful at understanding and generating human language. To achieve this, these models are first trained on a large, general corpus of text data, which is called pre-training. Then, they are further trained on smaller, more specific datasets for particular tasks, such as answering questions or classifying text, in a process called fine-tuning.

Previous research has shown that pruning, or reducing, the size of the fine-tuning dataset can sometimes improve the model's performance on that task. The idea is that by removing irrelevant or noisy data, the model can focus on the most important information and perform better. However, these previous methods relied on human feedback to determine how much data to remove.

In this paper, the researchers propose an automatic way to prune the fine-tuning dataset. Their method looks at how well the model is able to correctly classify each data point in the training set. The data points that the model struggles with are kept, while the ones it can easily classify are removed. This creates a smaller, more focused training set that is tailored to the specific model and task at hand.

The researchers found that this automated pruning method can reduce the size of the fine-tuning dataset by up to 3 times, while actually improving the model's performance on the task by 0.1% on average. This is a significant improvement that could make these powerful language models even more useful in real-world applications.

Technical Explanation

The paper proposes an Adaptive Pruning and Tuning (APT) method for automatically pruning the training set of fine-tuning tasks for transformer-based language models. Unlike previous work that relied on user feedback to determine the subset size, APT automatically extracts training subsets that are adapted for each pair of model and fine-tuning task.

APT operates by measuring the model's success rate in correctly classifying each training data point. Data points that the model struggles with (i.e., low success rate) are kept in the training subset, while those that the model can easily classify (i.e., high success rate) are pruned away. This creates multiple subsets of the training data that navigate the trade-off between subset size and evaluation accuracy.

The researchers experiment with APT on 5 downstream tasks and 2 language models. They find that the largest subset, which they call the "winning ticket" subset, is on average 3 times smaller than the original training set, yet fine-tuning on this subset results in a 0.1% increase in evaluation performance on average.

This work builds on previous research on dataset pruning and efficient pruning of large language models. By automating the pruning process and tailoring the training subset to each model-task pair, the researchers demonstrate a practical way to improve the efficiency and performance of fine-tuning transformer-based language models.

Critical Analysis

The paper provides a compelling approach to automatically pruning the training datasets for fine-tuning transformer-based language models. By focusing the training on the most informative data points, the method can reduce the dataset size while improving performance, which is a valuable contribution.

However, the paper does not explore the limitations of the proposed APT method in depth. For example, it is unclear how well the method would generalize to a wider range of tasks and models, or how sensitive the performance improvements are to the specific hyperparameters and implementation details.

Additionally, the paper does not delve into potential biases or fairness issues that could arise from pruning the training data in this way. There may be valid concerns about inadvertently removing data points that are important for representing underrepresented groups or ensuring the model's fairness.

Further research could investigate the robustness and generalizability of the APT method, as well as its implications for model fairness and accountability. Exploring these areas would strengthen the understanding and practical applicability of this line of research.

Conclusion

This paper presents an automatic dataset pruning method, called APT, that can improve the performance of transformer-based language models on fine-tuning tasks. By selectively removing training data points that the model can already classify well, APT creates smaller, more focused training subsets that lead to better evaluation results.

The key innovation of this work is the ability to automatically tailor the training data to each specific model-task pair, without relying on human feedback. This makes the pruning process scalable and practical for real-world applications of these powerful language models.

While the paper demonstrates promising results, further research is needed to fully understand the limitations and potential biases of the APT method. Nonetheless, this work represents an important step towards more efficient and effective fine-tuning of transformer-based language models, with implications for a wide range of natural language processing tasks.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Automatic Pruning of Fine-tuning Datasets for Transformer-based Language Models

Mohammadreza Tayaranian, Seyyed Hasan Mozafari, Brett H. Meyer, James J. Clark, Warren J. Gross

Transformer-based language models have shown state-of-the-art performance on a variety of natural language understanding tasks. To achieve this performance, these models are first pre-trained on general corpus and then fine-tuned on downstream tasks. Previous work studied the effect of pruning the training set of the downstream tasks on the performance of the model on its evaluation set. In this work, we propose an automatic dataset pruning method for the training set of fine-tuning tasks. Our method is based on the model's success rate in correctly classifying each training data point. Unlike previous work which relies on user feedback to determine subset size, our method automatically extracts training subsets that are adapted for each pair of model and fine-tuning task. Our method provides multiple subsets for use in dataset pruning that navigate the trade-off between subset size and evaluation accuracy. Our largest subset, which we also refer to as the winning ticket subset, is on average $3 times$ smaller than the original training set of the fine-tuning task. Our experiments on 5 downstream tasks and 2 language models show that, on average, fine-tuning on the winning ticket subsets results in a $0.1 %$ increase in the evaluation performance of the model.

7/15/2024

🔎

Large-scale Dataset Pruning with Dynamic Uncertainty

Muyang He, Shuo Yang, Tiejun Huang, Bo Zhao

The state of the art of many learning tasks, e.g., image classification, is advanced by collecting larger datasets and then training larger models on them. As the outcome, the increasing computational cost is becoming unaffordable. In this paper, we investigate how to prune the large-scale datasets, and thus produce an informative subset for training sophisticated deep models with negligible performance drop. We propose a simple yet effective dataset pruning method by exploring both the prediction uncertainty and training dynamics. We study dataset pruning by measuring the variation of predictions during the whole training process on large-scale datasets, i.e., ImageNet-1K and ImageNet-21K, and advanced models, i.e., Swin Transformer and ConvNeXt. Extensive experimental results indicate that our method outperforms the state of the art and achieves 25% lossless pruning ratio on both ImageNet-1K and ImageNet-21K. The code and pruned datasets are available at https://github.com/BAAI-DCAI/Dataset-Pruning.

6/17/2024

💬

APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference

Bowen Zhao, Hannaneh Hajishirzi, Qingqing Cao

Fine-tuning and inference with large Language Models (LM) are generally known to be expensive. Parameter-efficient fine-tuning over pretrained LMs reduces training memory by updating a small number of LM parameters but does not improve inference efficiency. Structured pruning improves LM inference efficiency by removing consistent parameter blocks, yet often increases training memory and time. To improve both training and inference efficiency, we introduce APT that adaptively prunes and tunes parameters for the LMs. At the early stage of fine-tuning, APT dynamically adds salient tuning parameters for fast and accurate convergence while discarding unimportant parameters for efficiency. Compared to baselines, our experiments show that APT maintains up to 98% task performance when pruning RoBERTa and T5 models with 40% parameters left while keeping 86.4% LLaMA models' performance with 70% parameters remained. Furthermore, APT speeds up LMs fine-tuning by up to 8x and reduces large LMs memory training footprint by up to 70%.

6/5/2024

Efficient Pruning of Large Language Model with Adaptive Estimation Fusion

Jun Liu, Chao Wu, Changdi Yang, Hao Tang, Zhenglun Kong, Geng Yuan, Wei Niu, Dong Huang, Yanzhi Wang

Large language models (LLMs) have become crucial for many generative downstream tasks, leading to an inevitable trend and significant challenge to deploy them efficiently on resource-constrained devices. Structured pruning is a widely used method to address this challenge. However, when dealing with the complex structure of the multiple decoder layers, general methods often employ common estimation approaches for pruning. These approaches lead to a decline in accuracy for specific downstream tasks. In this paper, we introduce a simple yet efficient method that adaptively models the importance of each substructure. Meanwhile, it can adaptively fuse coarse-grained and finegrained estimations based on the results from complex and multilayer structures. All aspects of our design seamlessly integrate into the endto-end pruning framework. Our experimental results, compared with state-of-the-art methods on mainstream datasets, demonstrate average accuracy improvements of 1.1%, 1.02%, 2.0%, and 1.2% for LLaMa-7B,Vicuna-7B, Baichuan-7B, and Bloom-7b1, respectively.

5/16/2024