Random Masking Finds Winning Tickets for Parameter Efficient Fine-tuning

Read original: arXiv:2405.02596 - Published 5/7/2024 by Jing Xu, Jingzhao Zhang

Random Masking Finds Winning Tickets for Parameter Efficient Fine-tuning

Overview

• This paper introduces a technique called "Random Masking" that can efficiently fine-tune large language models for specific tasks with far fewer parameters than traditional fine-tuning methods.

• The key idea is to randomly mask out a large portion of the model's parameters during fine-tuning, forcing the model to learn a more compact and task-specific set of parameters.

• The authors show that this "Random Masking" approach can match or outperform standard fine-tuning on a variety of tasks, while using only 1-10% of the model parameters.

Plain English Explanation

Large language models like GPT-3 are incredibly powerful, but fine-tuning them for specific tasks can be computationally expensive and require a lot of training data. This research paper explores a more efficient approach called "Random Masking" that can fine-tune these models using much fewer parameters.

The key insight is that a large language model like GPT-3 has many more parameters than are actually necessary for a specific task. By randomly masking out, or turning off, a large portion of the model's parameters during fine-tuning, the model is forced to learn a more compact and task-specific set of parameters. This allows for "parameter-efficient fine-tuning" - the model can be fine-tuned with just 1-10% of its original parameters and still match or outperform standard fine-tuning approaches.

This is significant because it means large language models can be adapted to new tasks much more efficiently, using far fewer computational resources. The authors provide a comprehensive analysis of how this Random Masking approach compares to other parameter-efficient fine-tuning methods across a variety of benchmarks.

Overall, this work demonstrates an effective technique for making large language models more practical and accessible for real-world applications, by dramatically reducing the amount of compute and data required for fine-tuning.

Technical Explanation

The paper introduces a method called "Random Masking" for parameter-efficient fine-tuning of large language models. The key idea is to randomly mask out a large portion of the model's parameters during the fine-tuning process, forcing the model to learn a more compact and task-specific set of parameters.

Specifically, the authors start with a pre-trained language model like GPT-3. During fine-tuning, they randomly select a subset of the model's parameters (e.g. 1-10%) to keep, and mask out the rest. This encourages the model to learn an efficient sub-network that can perform the target task using only the retained parameters.

The authors evaluate this Random Masking approach on a variety of language understanding and generation tasks, and show that it can match or outperform standard fine-tuning techniques, while using orders of magnitude fewer parameters. For example, they demonstrate its effectiveness on text classification, question answering, and dialogue generation tasks.

One key insight is that the randomly masked parameters don't need to be fine-tuned at all - the model can learn the task-specific solution using only the retained parameters. This stands in contrast to other parameter-efficient fine-tuning methods that still require updating a large portion of the model's weights.

The authors also provide a comprehensive empirical analysis comparing Random Masking to alternative approaches like adapter layers and prompt tuning. They find that Random Masking generally outperforms these other methods in terms of parameter efficiency and task performance.

Critical Analysis

The Random Masking approach presented in this paper is a promising technique for making large language models more practical and accessible. By reducing the number of parameters required for fine-tuning, it opens the door for deploying these powerful models in low-resource settings and on edge devices.

That said, the authors acknowledge several limitations and areas for future work. First, the optimal masking ratio (i.e. the percentage of parameters to retain) is task-dependent and may require some trial-and-error to find. Second, the method has only been evaluated on relatively narrow language tasks, and its effectiveness for more open-ended generation tasks remains to be seen.

Additionally, while Random Masking can reduce the parameter count, it doesn't address the memory and compute requirements of the full language model during inference. Further research is needed on techniques for reducing the overall model size and inference cost, such as model distillation or quantization.

Overall, this paper presents a compelling approach to parameter-efficient fine-tuning that is worthy of further exploration and refinement. As large language models become more ubiquitous, techniques like Random Masking will be crucial for making these powerful AI systems more practical and accessible in real-world applications.

Conclusion

This paper introduces a novel technique called "Random Masking" that can efficiently fine-tune large language models for specific tasks using far fewer parameters than traditional fine-tuning methods. By randomly masking out a large portion of the model's parameters during the fine-tuning process, the authors show that the model can learn a compact, task-specific set of weights that matches or outperforms standard fine-tuning approaches.

This work represents an important step towards making large language models more practical and accessible, by dramatically reducing the computational resources required for adapting these models to new applications. While the approach has some limitations that require further exploration, it demonstrates the potential of parameter-efficient fine-tuning techniques to unlock the full potential of powerful AI systems in a wide range of real-world settings.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Random Masking Finds Winning Tickets for Parameter Efficient Fine-tuning

Jing Xu, Jingzhao Zhang

Fine-tuning large language models (LLM) can be costly. Parameter-efficient fine-tuning (PEFT) addresses the problems by training a fraction of the parameters, whose success reveals the expressiveness and flexibility of pretrained models. This paper studies the limit of PEFT, by further simplifying its design and reducing the number of trainable parameters beyond standard setups. To this end, we use Random Masking to fine-tune the pretrained model. Despite its simplicity, we show that Random Masking is surprisingly effective: with a larger-than-expected learning rate, Random Masking can match the performance of standard PEFT algorithms such as LoRA on various tasks, using fewer trainable parameters. We provide both empirical and theoretical explorations into the success of Random Masking. We show that masking induces a flatter loss landscape and more distant solutions, which allows for and necessitates large learning rates.

5/7/2024

Step-by-Step Unmasking for Parameter-Efficient Fine-tuning of Large Language Models

Aradhye Agarwal, Suhas K Ramesh, Ayan Sengupta, Tanmoy Chakraborty

Fine-tuning large language models (LLMs) on downstream tasks requires substantial computational resources. A class of parameter-efficient fine-tuning (PEFT) aims to mitigate these computational challenges by selectively fine-tuning only a small fraction of the model parameters. Although computationally efficient, these techniques often fail to match the performance of fully fine-tuned models, primarily due to inherent biases introduced during parameter selection. Traditional selective PEFT techniques use a fixed set of parameters based on a predefined budget (a process also known as unmasking), failing to capture parameter importance dynamically and often ending up exceeding the budget. We introduce $text{ID}^3$, a novel selective PEFT method that calculates parameter importance continually and dynamically unmasks parameters by balancing exploration and exploitation in parameter selection. Our empirical study on 15 tasks spanning natural language understanding and generative tasks demonstrates the effectiveness of our method compared to fixed-masking-based PEFT techniques. We analytically show that $text{ID}^3$ reduces the number of gradient updates by a factor of two, enhancing computational efficiency. $text{ID}^3$ is robust to random initialization of neurons and, therefore, can be seamlessly integrated into existing additive and reparametrization-based PEFT modules such as adapters and LoRA for dynamic sparsification.

8/28/2024

Light-PEFT: Lightening Parameter-Efficient Fine-Tuning via Early Pruning

Naibin Gu, Peng Fu, Xiyu Liu, Bowen Shen, Zheng Lin, Weiping Wang

Parameter-efficient fine-tuning (PEFT) has emerged as the predominant technique for fine-tuning in the era of large language models. However, existing PEFT methods still have inadequate training efficiency. Firstly, the utilization of large-scale foundation models during the training process is excessively redundant for certain fine-tuning tasks. Secondly, as the model size increases, the growth in trainable parameters of empirically added PEFT modules becomes non-negligible and redundant, leading to inefficiency. To achieve task-specific efficient fine-tuning, we propose the Light-PEFT framework, which includes two methods: Masked Early Pruning of the Foundation Model and Multi-Granularity Early Pruning of PEFT. The Light-PEFT framework allows for the simultaneous estimation of redundant parameters in both the foundation model and PEFT modules during the early stage of training. These parameters can then be pruned for more efficient fine-tuning. We validate our approach on GLUE, SuperGLUE, QA tasks, and various models. With Light-PEFT, parameters of the foundation model can be pruned by up to over 40%, while still controlling trainable parameters to be only 25% of the original PEFT method. Compared to utilizing the PEFT method directly, Light-PEFT achieves training and inference speedup, reduces memory usage, and maintains comparable performance and the plug-and-play feature of PEFT.

6/7/2024

MLAE: Masked LoRA Experts for Parameter-Efficient Fine-Tuning

Junjie Wang, Guangjing Yang, Wentao Chen, Huahui Yi, Xiaohu Wu, Qicheng Lao

In response to the challenges posed by the extensive parameter updates required for full fine-tuning of large-scale pre-trained models, parameter-efficient fine-tuning (PEFT) methods, exemplified by Low-Rank Adaptation (LoRA), have emerged. LoRA simplifies the fine-tuning process but may still struggle with a certain level of redundancy in low-rank matrices and limited effectiveness from merely increasing their rank. To address these issues, a natural idea is to enhance the independence and diversity of the learning process for the low-rank matrices. Therefore, we propose Masked LoRA Experts (MLAE), an innovative approach that applies the concept of masking to PEFT. Our method incorporates a cellular decomposition strategy that transforms a low-rank matrix into independent rank-1 submatrices, or ``experts'', thus enhancing independence. Additionally, we introduce a binary mask matrix that selectively activates these experts during training to promote more diverse and anisotropic learning, based on expert-level dropout strategies. Our investigations reveal that this selective activation not only enhances performance but also fosters a more diverse acquisition of knowledge with a marked decrease in parameter similarity among MLAE, significantly boosting the quality of the model while barely increasing the parameter count. Remarkably, MLAE achieves new SOTA performance with an average accuracy score of 78.8% on the VTAB-1k benchmark and 90.9% on the FGVC benchmark, demonstrating superior performance. Our code is available at https://github.com/jie040109/MLAE.

5/30/2024