LayerNorm: A key component in parameter-efficient fine-tuning

Read original: arXiv:2403.20284 - Published 4/1/2024 by Taha ValizadehAslani, Hualou Liang

🔄

Introduction

The paper proposes a parameter-efficient fine-tuning approach for transformer-based language models like BERT. Despite their excellent performance, these models are computationally expensive to fine-tune due to their large number of parameters. Existing methods like only training bias parameters or a small portion of the model aim to reduce this cost.

The authors hypothesize that focusing on the optimal component of the model can achieve similar or better performance with fewer parameters. They examine different components of BERT during full fine-tuning and discover that LayerNorm is a key component. LayerNorm possesses the maximum Fisher information among all BERT components.

The authors demonstrate that training only LayerNorm can reach similar performance as training only bias parameters, but with one-fifth the number of parameters. They further show that comparable performance can be obtained by training only a portion of LayerNorm, which can be determined from the downstream task or other tasks.

The paper is organized as follows: Section 2 shows LayerNorm is key for fine-tuning BERT. Section 3 presents the method and results of training only LayerNorm. Section 4 discusses the findings, while Section 5 reviews related work. Section 6 provides conclusions and future work. An appendix describes LayerNorm in detail.

A key component of BERT

The paper aims to identify the most important components of the BERT model for fine-tuning. The authors used the GLUE dataset and fine-tuned BERT-large-cased on different tasks. After fine-tuning, they calculated the change in each component's parameters from their pre-trained values. This change was measured using the L1 distance normalized by the component size.

The heat maps showed that the LayerNorm component underwent the most significant change across most GLUE tasks. Previous studies have also shown that disabling LayerNorm severely degrades BERT's performance.

The authors then used Fisher information to quantify each component's importance. They calculated the Fisher information of each parameter, averaged it for each component, and normalized across tasks. The LayerNorm and attention.output.LayerNorm components had the highest Fisher information, confirming their importance.

In summary, the LayerNorm component appears to be the most crucial for fine-tuning BERT, based on the analysis of parameter changes and Fisher information across GLUE tasks.

Proposed method: Only training LayerNorm

The paper explores fine-tuning only the LayerNorm parameters of a pre-trained BERT model for various GLUE tasks. Experiments show that fine-tuning just the LayerNorm parameters achieves comparable performance to fine-tuning all parameters, but with only a fraction (0.015%) of trainable parameters.

Further analysis is done by training only a subset of the LayerNorm parameters, selected based on their Fisher information score. Results indicate that even training 20% of the LayerNorm parameters can maintain good performance on some tasks.

Visualizations reveal that higher layers of LayerNorm contain more task-relevant information than lower layers, and the bias terms have higher Fisher scores than weights.

The paper also explores finding a global subset of LayerNorm parameters that performs well across all tasks, instead of a task-specific subset. This global subset achieves similar performance to the task-specific subsets through cross-validation experiments.

Overall, the key finding is that fine-tuning a small subset of LayerNorm parameters in pre-trained models can be an extremely parameter-efficient way to adapt to new tasks while retaining performance.

Discussion

The section discusses the importance of the LayerNorm component in Transformer-based models like BERT. Disabling LayerNorm significantly degrades the performance of these models. During fine-tuning, LayerNorm undergoes greater changes compared to other components. Training only LayerNorm or a small portion of it achieves comparable performance to fine-tuning the full model, despite its smaller parameter size.

The analysis shows that the bias terms in LayerNorm contain more information than the weight terms across various GLUE tasks. This aligns with previous findings that training only the bias terms of BERT can be effective.

The study observes that the final layers of BERT's LayerNorm have higher Fisher information (indicating larger gradients) compared to the initial layers. This trend exists across GLUE tasks, suggesting that the final layers undergo more changes during fine-tuning. This phenomenon has been observed in other studies as well.

Related work

The section provides an overview of recent work on parameter-efficient fine-tuning techniques, categorizing them into five groups: adding adaptors, adding prompts, model pruning, partial training, and low-rank decomposition.

Adding adaptors involves introducing trainable modules, called adaptors, into the original frozen model and training only the adaptors. Examples include injecting adapters between layers or adding sparse, task-specific difference vectors.

Adding prompts involves prepending new tokens to the input text and training only the embeddings of these prompt tokens, forcing changes to apply to the corresponding vectors while keeping the core model frozen.

Model pruning removes certain weights from the network based on criteria like low magnitude or input activation norms.

Partial training involves training only a subset of the model, such as the final layers, bias parameters, or the most important parameters determined by Fisher information. The proposed method falls into this category.

Low-rank decomposition methods approximate model updates during fine-tuning using low-rank decomposition, training only low-rank matrices instead of the whole model. Techniques like Low-Rank Adaptation (LoRA) decompose weight changes into low-rank matrices or separate magnitude and direction components.

Conclusions and future work

The paper examines the components of BERT when fine-tuned for various GLUE tasks. It demonstrates that LayerNorm undergoes more changes after fine-tuning compared to other components. This finding aligns with previous research by Kovaleva et al. (2021), which showed that disabling LayerNorm significantly harms BERT's performance.

The study reveals that only fine-tuning LayerNorm achieves comparable performance to Bitfit, a method proposed by Zaken et al. (2021), despite being more sparse. Using Fisher Information, the researchers identify important subsets of LayerNorm parameters. They demonstrate that by fine-tuning as little as 10% of LayerNorm parameters, which is a tiny fraction of the BERT model, similar results can be obtained with only slight performance degradation.

The paper focuses on layer normalization, a popular normalization method in NLP. It suggests extending the parameter-efficient training approach to batch normalization, widely used in computer vision, to improve computational efficiency for training batch normalization models.

Appendix A Normalization in neural networks

This section discusses batch normalization, its shortcomings, and layer normalization as an alternative solution.

Batch normalization normalizes inputs across a mini-batch to reduce internal covariate shift during training. However, it has several drawbacks:

Dependency on mini-batch size, which can introduce variability.
Less effective in recurrent neural networks due to sequential data.
Complications during inference due to needing running averages.
Introduces batch dependency, affecting generalization.
Inability to maintain criticality, leading to gradient explosion.

Layer normalization normalizes across the layer instead of the batch, addressing batch normalization's shortcomings. It performs well in RNNs, avoids batch dependency, operates the same way during training and inference, and maintains criticality. Layer normalization computes the mean and variance across features in a layer, normalizes the inputs based on these statistics, then applies an affine transformation with learnable parameters. Variations have been proposed to improve gradient propagation. The technical details of layer normalization's computation are provided.

Appendix B Distance definitions

The L0 distance between two vectors V1 and V2 of size n is the number of elements where the values differ between the vectors. This is similar to the Hamming distance measure. The L1 distance, also known as the Manhattan distance, between two such vectors is calculated as the sum of the absolute differences between the corresponding elements of the vectors. A mathematical formula is provided to calculate the L1 distance.

Appendix C Metrics for GLUE results

The section does not provide enough information to summarize. The statement "Table 4 shows the metric used for each task" lacks context about the specific tasks, metrics, or research being referenced.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔄

LayerNorm: A key component in parameter-efficient fine-tuning

Taha ValizadehAslani, Hualou Liang

Fine-tuning a pre-trained model, such as Bidirectional Encoder Representations from Transformers (BERT), has been proven to be an effective method for solving many natural language processing (NLP) tasks. However, due to the large number of parameters in many state-of-the-art NLP models, including BERT, the process of fine-tuning is computationally expensive. One attractive solution to this issue is parameter-efficient fine-tuning, which involves modifying only a minimal segment of the model while keeping the remainder unchanged. Yet, it remains unclear which segment of the BERT model is crucial for fine-tuning. In this paper, we first analyze different components in the BERT model to pinpoint which one undergoes the most significant changes after fine-tuning. We find that output LayerNorm changes more than any other components when fine-tuned for different General Language Understanding Evaluation (GLUE) tasks. Then we show that only fine-tuning the LayerNorm can reach comparable, or in some cases better, performance to full fine-tuning and other parameter-efficient fine-tuning methods. Moreover, we use Fisher information to determine the most critical subset of LayerNorm and demonstrate that many NLP tasks in the GLUE benchmark can be solved by fine-tuning only a small portion of LayerNorm with negligible performance degradation.

4/1/2024

💬

Efficiency in Focus: LayerNorm as a Catalyst for Fine-tuning Medical Visual Language Pre-trained Models

Jiawei Chen, Dingkang Yang, Yue Jiang, Mingcheng Li, Jinjie Wei, Xiaolu Hou, Lihua Zhang

In the realm of Medical Visual Language Models (Med-VLMs), the quest for universal efficient fine-tuning mechanisms remains paramount, especially given researchers in interdisciplinary fields are often extremely short of training resources, yet largely unexplored. Given the unique challenges in the medical domain, such as limited data scope and significant domain-specific requirements, evaluating and adapting Parameter-Efficient Fine-Tuning (PEFT) methods specifically for Med-VLMs is essential. Most of the current PEFT methods on Med-VLMs have yet to be comprehensively investigated but mainly focus on adding some components to the model's structure or input. However, fine-tuning intrinsic model components often yields better generality and consistency, and its impact on the ultimate performance of Med-VLMs has been widely overlooked and remains understudied. In this paper, we endeavour to explore an alternative to traditional PEFT methods, especially the impact of fine-tuning LayerNorm layers, FFNs and Attention layers on the Med-VLMs. Our comprehensive studies span both small-scale and large-scale Med-VLMs, evaluating their performance under various fine-tuning paradigms across tasks such as Medical Visual Question Answering and Medical Imaging Report Generation. The findings reveal unique insights into the effects of intrinsic parameter fine-tuning methods on fine-tuning Med-VLMs to downstream tasks and expose fine-tuning solely the LayerNorm layers not only surpasses the efficiency of traditional PEFT methods but also retains the model's accuracy and generalization capabilities across a spectrum of medical downstream tasks. The experiments show LayerNorm fine-tuning's superior adaptability and scalability, particularly in the context of large-scale Med-VLMs.

4/26/2024

SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization

Jialong Guo, Xinghao Chen, Yehui Tang, Yunhe Wang

Transformers have become foundational architectures for both natural language and computer vision tasks. However, the high computational cost makes it quite challenging to deploy on resource-constraint devices. This paper investigates the computational bottleneck modules of efficient transformer, i.e., normalization layers and attention modules. LayerNorm is commonly used in transformer architectures but is not computational friendly due to statistic calculation during inference. However, replacing LayerNorm with more efficient BatchNorm in transformer often leads to inferior performance and collapse in training. To address this problem, we propose a novel method named PRepBN to progressively replace LayerNorm with re-parameterized BatchNorm in training. Moreover, we propose a simplified linear attention (SLA) module that is simple yet effective to achieve strong performance. Extensive experiments on image classification as well as object detection demonstrate the effectiveness of our proposed method. For example, our SLAB-Swin obtains $83.6%$ top-1 accuracy on ImageNet-1K with $16.2$ms latency, which is $2.4$ms less than that of Flatten-Swin with $0.1%$ higher accuracy. We also evaluated our method for language modeling task and obtain comparable performance and lower latency.Codes are publicly available at https://github.com/xinghaochen/SLAB and https://github.com/mindspore-lab/models/tree/master/research/huawei-noah/SLAB.

6/18/2024

💬

A Semantic-based Layer Freezing Approach to Efficient Fine-Tuning of Language Models

Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang

Finetuning language models (LMs) is crucial for adapting the models to downstream data and tasks. However, full finetuning is usually costly. Existing work, such as parameter-efficient finetuning (PEFT), often focuses on textit{how to finetune} but neglects the issue of textit{where to finetune}. As a pioneering work on answering where to finetune (at the layer level), we conduct a semantic analysis of the LM inference process. We first propose a virtual transition of the latent representation and then trace its factual transition. Based on the deviation in transitions, we estimate the gain of finetuning each model layer, and further, narrow down the scope for finetuning. We perform extensive experiments across well-known LMs and datasets. The results show that our approach is effective and efficient, and outperforms the existing baselines. Our approach is orthogonal to existing efficient techniques, such as PEFT methods, offering practical values on LM finetuning.

6/18/2024