PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models

Read original: arXiv:2404.02948 - Published 5/29/2024 by Fanxu Meng, Zhaohui Wang, Muhan Zhang

PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models

Overview

This paper introduces PiSSA, a technique for adapting large language models to specific tasks or domains by adjusting the principal singular values and singular vectors of the model's weights.
The key idea is to identify the most important components of the model and fine-tune only those, rather than retraining the entire model.
The authors demonstrate that PiSSA can improve performance on various tasks while requiring less computational resources than full model fine-tuning.

Plain English Explanation

Large language models like GPT-3 are powerful, but they are also very complex, with millions or billions of parameters. This makes them computationally expensive to train and fine-tune for specific tasks.

The PiSSA approach tries to overcome this by identifying the most important parts of the model - the principal singular values and vectors. These represent the key components that contribute the most to the model's performance.

By only fine-tuning these critical components, rather than the entire model, PiSSA can adapt the language model to new tasks or domains more efficiently. It requires less computing power and data compared to retraining the full model.

The authors show that PiSSA can match or even outperform full model fine-tuning on various benchmarks, while being more resource-efficient. This makes it a promising technique for deploying large language models in real-world applications where computational and data constraints are important.

Technical Explanation

The paper introduces the PiSSA (Principal Singular Values and Singular Vectors Adaptation) method for adapting large language models. The key idea is to leverage the principal singular value decomposition (SVD) of the model's weight matrices to identify the most important parameters.

The authors first compute the SVD of the weight matrices in the pre-trained model. This allows them to extract the principal singular values and singular vectors, which represent the most significant components of the model.

During fine-tuning, PiSSA only updates these principal components, rather than the full model. This is achieved by initializing the fine-tuned model with the pre-trained weights, and then only updating the projections of the fine-tuning data onto the principal singular vectors.

The authors evaluate PiSSA on a range of language understanding and generation tasks, comparing it to full model fine-tuning. They find that PiSSA can match or exceed the performance of full fine-tuning, while requiring significantly less computational resources and training data.

Critical Analysis

The PiSSA approach is an interesting and valuable contribution to the field of efficient transfer learning for large language models. By focusing on the most important model components, it provides an effective way to adapt these powerful models to new tasks and domains.

One limitation noted in the paper is that the optimal number of principal components to fine-tune is task-dependent. The authors provide heuristics, but more work may be needed to fully automate this selection process.

Additionally, the paper does not explore how PiSSA's performance might scale as the pre-trained model size increases. Larger models may have more redundant parameters, which could make PiSSA even more effective. Investigating this would be a useful avenue for future research.

Overall, PiSSA represents a promising technique for making large language models more practical and accessible for real-world applications. With further refinement and exploration, it could become an important tool in the ongoing effort to develop efficient and capable AI systems.

Conclusion

The PiSSA method introduced in this paper provides an effective way to adapt large language models to new tasks and domains. By focusing on the most important components of the model, PiSSA can match or exceed the performance of full fine-tuning while requiring significantly less computational resources and training data.

This makes large language models more practical and accessible for real-world applications, where efficiency and constraints on data and compute are crucial. The authors have demonstrated the potential of PiSSA across a range of benchmarks, and further research into its scaling properties and automatic component selection could lead to even more impactful applications.

Overall, PiSSA represents an important step forward in the ongoing efforts to develop powerful and efficient AI systems that can be deployed in a wide variety of contexts.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models

Fanxu Meng, Zhaohui Wang, Muhan Zhang

To parameter-efficiently fine-tune (PEFT) large language models (LLMs), the low-rank adaptation (LoRA) method approximates the model changes $Delta W in mathbb{R}^{m times n}$ through the product of two matrices $A in mathbb{R}^{m times r}$ and $B in mathbb{R}^{r times n}$, where $r ll min(m, n)$, $A$ is initialized with Gaussian noise, and $B$ with zeros. LoRA freezes the original model $W$ and updates the Noise & Zero adapter, which may lead to slow convergence. To overcome this limitation, we introduce Principal Singular values and Singular vectors Adaptation (PiSSA). PiSSA shares the same architecture as LoRA, but initializes the adaptor matrices $A$ and $B$ with the principal components of the original matrix $W$, and put the remaining components into a residual matrix $W^{res} in mathbb{R}^{m times n}$ which is frozen during fine-tuning. Compared to LoRA, PiSSA updates the principal components while freezing the residual parts, allowing faster convergence and enhanced performance. Comparative experiments of PiSSA and LoRA across 12 different models, ranging from 184M to 70B, encompassing 5 NLG and 8 NLU tasks, reveal that PiSSA consistently outperforms LoRA under identical experimental setups. On the GSM8K benchmark, Mistral-7B fine-tuned with PiSSA achieves an accuracy of 72.86%, surpassing LoRA's 67.7% by 5.16%. Due to the same architecture, PiSSA is also compatible with quantization to further reduce the memory requirement of fine-tuning. Compared to QLoRA, QPiSSA (PiSSA with 4-bit quantization) exhibits smaller quantization errors in the initial stages. Fine-tuning LLaMA-3-70B on GSM8K, QPiSSA attains an accuracy of 86.05%, exceeding the performances of QLoRA at 81.73%. Leveraging a fast SVD technique, PiSSA can be initialized in only a few seconds, presenting a negligible cost for transitioning from LoRA to PiSSA.

5/29/2024

SORSA: Singular Values and Orthonormal Regularized Singular Vectors Adaptation of Large Language Models

Yang Cao

The rapid advancement in large language models (LLMs) comes with a significant increase in their parameter size, presenting challenges for adaptation and fine-tuning. Parameter-efficient fine-tuning (PEFT) methods are widely used to adapt LLMs for downstream tasks efficiently. In this paper, we propose Singular Values and Orthonormal Regularized Singular Vectors Adaptation, or SORSA, a novel PEFT method. We introduce a method to analyze the variation of the parameters by performing singular value decomposition (SVD) and discuss and analyze SORSA's superiority in minimizing the alteration in the SVD aspect. Each SORSA adapter consists of two main parts: trainable principal singular weights $W_p = U_p Sigma_p V^top_p$, and frozen residual weights $W_r = U_r Sigma_r V^top_r$. These parts are initialized by performing SVD on pre-trained weights. Moreover, we implement and analyze an orthonormal regularizer, which could effectively transfer the scaling information into $Sigma_p$ and ultimately allows the training process to be more efficient. SORSA adapters could be merged during inference, thus eliminating any inference latency. After all, SORSA shows a faster convergence than PiSSA and LoRA in our experiments. On the MATH benchmark, Llama 2 7B adapted using SORSA achieved 10.36% accuracy, outperforming LoRA (5.50%), Full FT (7.22%), and PiSSA (7.44%). On the GSM-8K benchmark, SORSA achieved 56.03% accuracy, surpassing LoRA (42.30%), Full FT (49.05%), and PiSSA (53.07%). We conclude that SORSA offers a new perspective on parameter-efficient fine-tuning, demonstrating remarkable performance. The code is available at https://github.com/Gunale0926/SORSA.

9/11/2024

SVFit: Parameter-Efficient Fine-Tuning of Large Pre-Trained Models Using Singular Values

Chengwei Sun, Jiwei Wei, Yujia Wu, Yiming Shi, Shiyuan He, Zeyu Ma, Ning Xie, Yang Yang

Large pre-trained models (LPMs) have demonstrated exceptional performance in diverse natural language processing and computer vision tasks. However, fully fine-tuning these models poses substantial memory challenges, particularly in resource-constrained environments. Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, mitigate this issue by adjusting only a small subset of parameters. Nevertheless, these methods typically employ random initialization for low-rank matrices, which can lead to inefficiencies in gradient descent and diminished generalizability due to suboptimal starting points. To address these limitations, we propose SVFit, a novel PEFT approach that leverages singular value decomposition (SVD) to initialize low-rank matrices using critical singular values as trainable parameters. Specifically, SVFit performs SVD on the pre-trained weight matrix to obtain the best rank-r approximation matrix, emphasizing the most critical singular values that capture over 99% of the matrix's information. These top-r singular values are then used as trainable parameters to scale the fundamental subspaces of the matrix, facilitating rapid domain adaptation. Extensive experiments across various pre-trained models in natural language understanding, text-to-image generation, and image classification tasks reveal that SVFit outperforms LoRA while requiring 16 times fewer trainable parameters.

9/11/2024

SARA: Singular-Value Based Adaptive Low-Rank Adaption

Jihao Gu, Shuai Chen, Zelin Wang, Yibo Zhang, Ping Gong

With the increasing number of parameters in large pre-trained models, LoRA as a parameter-efficient fine-tuning(PEFT) method is widely used for not adding inference overhead. The LoRA method assumes that weight changes during fine-tuning can be approximated by low-rank matrices. However, the rank values need to be manually verified to match different downstream tasks, and they cannot accommodate the varying importance of different layers in the model. In this work, we first analyze the relationship between the performance of different layers and their ranks using SVD. Based on this, we design the Singular-Value Based Adaptive Low-Rank Adaption(SARA), which adaptively finds the rank during initialization by performing SVD on the pre-trained weights. Additionally, we explore the Mixture-of-SARA(Mo-SARA), which significantly reduces the number of parameters by fine-tuning only multiple parallel sets of singular values controlled by a router. Extensive experiments on various complex tasks demonstrate the simplicity and parameter efficiency of our methods. They can effectively and adaptively find the most suitable rank for each layer of each model.

8/7/2024