Parameter-Efficient Fine-Tuning with Discrete Fourier Transform

Read original: arXiv:2405.03003 - Published 5/7/2024 by Ziqi Gao, Qichao Wang, Aochuan Chen, Zijing Liu, Bingzhe Wu, Liang Chen, Jia Li

📈

Overview

The paper introduces FourierFT, a method to further compress the trainable parameters in fine-tuning foundation models compared to LoRA.
FourierFT treats the weight change matrix as a matrix in the spatial domain and learns only a small fraction of its spectral coefficients, which can then be used to recover the full weight change matrix.
Empirically, FourierFT shows comparable or better performance with fewer trainable parameters than LoRA on various tasks, including natural language understanding, natural language generation, instruction tuning, and image classification.

Plain English Explanation

LoRA is a popular technique for fine-tuning large foundation models, like language models, by only updating a small number of parameters. This is helpful because it reduces the memory and storage requirements for the fine-tuned model.

However, even LoRA can face storage challenges when handling extensive customization adaptations or larger base models. The FourierFT method introduced in this paper aims to further compress the trainable parameters by using the power of the Fourier transform.

The key idea behind FourierFT is to treat the weight change matrix as a matrix in the spatial domain and learn only a small fraction of its spectral coefficients. These learned spectral coefficients can then be used to reconstruct the full weight change matrix using the inverse discrete Fourier transform.

This approach allows FourierFT to achieve comparable or better performance than LoRA with even fewer trainable parameters. For example, when fine-tuning the LLaMA2-7B language model using instruction tuning, FourierFT surpassed LoRA while only requiring 0.064M trainable parameters, compared to LoRA's 33.5M.

Technical Explanation

The paper introduces a new method called FourierFT to further compress the trainable parameters in fine-tuning foundation models compared to LoRA.

In LoRA, the weight change matrix is represented as the product of two low-rank matrices, $A$ and $B$, i.e., $\Delta W = BA$. This reduces the number of trainable parameters, but it can still face storage challenges for extensive customization adaptations or larger base models.

FourierFT takes a different approach by treating the weight change matrix $\Delta W$ as a matrix in the spatial domain and learning only a small fraction of its spectral coefficients. These learned spectral coefficients can then be used to recover the full $\Delta W$ matrix via the inverse discrete Fourier transform.

The authors demonstrate empirically that FourierFT achieves comparable or better performance than LoRA on various tasks, including natural language understanding, natural language generation, instruction tuning, and image classification, while using fewer trainable parameters. For example, when performing instruction tuning on the LLaMA2-7B model, FourierFT surpasses LoRA with only 0.064M trainable parameters, compared to LoRA's 33.5M.

Critical Analysis

The paper provides a compelling solution to the storage challenges faced by LoRA when handling extensive customization adaptations or larger base models. By leveraging the power of the Fourier transform, FourierFT is able to achieve comparable or better performance with significantly fewer trainable parameters.

One potential limitation of the FourierFT approach is the assumption that the weight change matrix can be well-approximated by a small number of spectral coefficients. This may not hold true for all types of fine-tuning tasks or model architectures, and further research may be needed to understand the limitations and applicability of this method.

Additionally, the paper does not provide a detailed analysis of the computational overhead or inference latency introduced by the Fourier transform operations. This information would be valuable for practitioners to assess the practical implications of adopting FourierFT in their workflows.

Nevertheless, the promising results presented in the paper suggest that FourierFT is a valuable addition to the toolkit for efficient fine-tuning of large foundation models. Researchers and practitioners are encouraged to carefully evaluate the trade-offs and consider FourierFT as a potential alternative to LoRA and other low-rank adaptation methods.

Conclusion

The FourierFT method introduced in this paper offers a novel approach to further compressing the trainable parameters in fine-tuning large foundation models. By leveraging the Fourier transform, FourierFT is able to achieve comparable or better performance than LoRA while using significantly fewer parameters.

This breakthrough has important implications for the deployment and scalability of fine-tuned models, particularly in resource-constrained environments or for applications that require extensive customization. As the field of large language models continues to evolve, techniques like FourierFT will play a crucial role in enabling more efficient and accessible model adaptations.

The authors have released the code for FourierFT, which will undoubtedly spur further research and development in this area. Researchers and practitioners are encouraged to explore the potential of FourierFT and investigate its broader applicability across various domains and model architectures.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📈

Parameter-Efficient Fine-Tuning with Discrete Fourier Transform

Ziqi Gao, Qichao Wang, Aochuan Chen, Zijing Liu, Bingzhe Wu, Liang Chen, Jia Li

Low-rank adaptation~(LoRA) has recently gained much interest in fine-tuning foundation models. It effectively reduces the number of trainable parameters by incorporating low-rank matrices $A$ and $B$ to represent the weight change, i.e., $Delta W=BA$. Despite LoRA's progress, it faces storage challenges when handling extensive customization adaptations or larger base models. In this work, we aim to further compress trainable parameters by enjoying the powerful expressiveness of the Fourier transform. Specifically, we introduce FourierFT, which treats $Delta W$ as a matrix in the spatial domain and learns only a small fraction of its spectral coefficients. With the trained spectral coefficients, we implement the inverse discrete Fourier transform to recover $Delta W$. Empirically, our FourierFT method shows comparable or better performance with fewer parameters than LoRA on various tasks, including natural language understanding, natural language generation, instruction tuning, and image classification. For example, when performing instruction tuning on the LLaMA2-7B model, FourierFT surpasses LoRA with only 0.064M trainable parameters, compared to LoRA's 33.5M. Our code is released at url{https://github.com/Chaos96/fourierft}.

5/7/2024

Bayesian-LoRA: LoRA based Parameter Efficient Fine-Tuning using Optimal Quantization levels and Rank Values trough Differentiable Bayesian Gates

Cristian Meo, Ksenia Sycheva, Anirudh Goyal, Justin Dauwels

It is a common practice in natural language processing to pre-train a single model on a general domain and then fine-tune it for downstream tasks. However, when it comes to Large Language Models, fine-tuning the entire model can be computationally expensive, resulting in very intensive energy consumption. As a result, several Parameter Efficient Fine-Tuning (PEFT) approaches were recently proposed. One of the most popular approaches is low-rank adaptation (LoRA), where the key insight is decomposing the update weights of the pre-trained model into two low-rank matrices. However, the proposed approaches either use the same rank value across all different weight matrices, which has been shown to be a sub-optimal choice, or do not use any quantization technique, one of the most important factors when it comes to a model's energy consumption. In this work, we propose Bayesian-LoRA which approaches low-rank adaptation and quantization from a Bayesian perspective by employing a prior distribution on both quantization levels and rank values. As a result, B-LoRA is able to fine-tune a pre-trained model on a specific downstream task, finding the optimal rank values and quantization levels for every low-rank matrix. We validate the proposed model by fine-tuning a pre-trained DeBERTaV3 on the GLUE benchmark. Moreover, we compare it to relevant baselines and present both qualitative and quantitative results, showing how the proposed approach is able to learn optimal-rank quantized matrices. B-LoRA performs on par with or better than the baselines while reducing the total number of bit operations by roughly 70% compared to the baseline methods.

7/10/2024

Parameter-Efficient Fine-Tuning via Circular Convolution

Aochuan Chen, Jiashun Cheng, Zijing Liu, Ziqi Gao, Fugee Tsung, Yu Li, Jia Li

Low-Rank Adaptation (LoRA) has gained popularity for fine-tuning large foundation models, leveraging low-rank matrices $mathbf{A}$ and $mathbf{B}$ to represent weight changes (i.e., $Delta mathbf{W} = mathbf{B} mathbf{A}$). This method reduces trainable parameters and mitigates heavy memory consumption associated with full delta matrices by sequentially multiplying $mathbf{A}$ and $mathbf{B}$ with the activation. Despite its success, the intrinsic low-rank characteristic may limit its performance. Although several variants have been proposed to address this issue, they often overlook the crucial computational and memory efficiency brought by LoRA. In this paper, we propose Circular Convolution Adaptation (C$^3$A), which not only achieves high-rank adaptation with enhanced performance but also excels in both computational power and memory utilization. Extensive experiments demonstrate that C$^3$A consistently outperforms LoRA and its variants across various fine-tuning tasks.

8/22/2024

Flat-LoRA: Low-Rank Adaption over a Flat Loss Landscape

Tao Li, Zhengbao He, Yujun Li, Yasheng Wang, Lifeng Shang, Xiaolin Huang

Fine-tuning large-scale pre-trained models is prohibitively expensive in terms of computational and memory costs. Low-Rank Adaptation (LoRA), a popular Parameter-Efficient Fine-Tuning (PEFT) method, provides an efficient way to fine-tune models by optimizing only a low-rank matrix. Despite recent progress made in improving LoRA's performance, the connection between the LoRA optimization space and the original full parameter space is often overlooked. A solution that appears flat in the LoRA space may exist sharp directions in the full parameter space, potentially harming generalization performance. In this paper, we propose Flat-LoRA, an efficient approach that seeks a low-rank adaptation located in a flat region of the full parameter space.Instead of relying on the well-established sharpness-aware minimization approach, which can incur significant computational and memory burdens, we utilize random weight perturbation with a Bayesian expectation loss objective to maintain training efficiency and design a refined perturbation generation strategy for improved performance. Experiments on natural language processing and image classification tasks with various architectures demonstrate the effectiveness of our approach.

9/24/2024