NOLA: Compressing LoRA using Linear Combination of Random Basis

2310.02556

Published 5/1/2024 by Soroush Abbasi Koohpayegani, KL Navaneet, Parsa Nooralinejad, Soheil Kolouri, Hamed Pirsiavash

👀

Abstract

Fine-tuning Large Language Models (LLMs) and storing them for each downstream task or domain is impractical because of the massive model size (e.g., 350GB in GPT-3). Current literature, such as LoRA, showcases the potential of low-rank modifications to the original weights of an LLM, enabling efficient adaptation and storage for task-specific models. These methods can reduce the number of parameters needed to fine-tune an LLM by several orders of magnitude. Yet, these methods face two primary limitations: (1) the parameter count is lower-bounded by the rank one decomposition, and (2) the extent of reduction is heavily influenced by both the model architecture and the chosen rank. We introduce NOLA, which overcomes the rank one lower bound present in LoRA. It achieves this by re-parameterizing the low-rank matrices in LoRA using linear combinations of randomly generated matrices (basis) and optimizing the linear mixture coefficients only. This approach allows us to decouple the number of trainable parameters from both the choice of rank and the network architecture. We present adaptation results using GPT-2, LLaMA-2, and ViT in natural language and computer vision tasks. NOLA performs as well as LoRA models with much fewer number of parameters compared to LoRA with rank one, the best compression LoRA can archive. Particularly, on LLaMA-2 70B, our method is almost 20 times more compact than the most compressed LoRA without degradation in accuracy. Our code is available here: https://github.com/UCDvision/NOLA

Create account to get full access

Overview

Fine-tuning large language models (LLMs) like GPT-3 for each downstream task is impractical due to the massive model size (e.g., 350GB).
Existing methods like LoRA can reduce the number of parameters needed to fine-tune an LLM by several orders of magnitude.
However, these methods face limitations: (1) a lower bound on the parameter count due to the rank one decomposition, and (2) the extent of reduction is heavily influenced by the model architecture and chosen rank.
The paper introduces NOLA, a new approach that overcomes the rank one lower bound in LoRA.

Plain English Explanation

Large language models like GPT-3 are incredibly powerful, but they are also massive in size, often hundreds of gigabytes. This makes it impractical to fine-tune these models for every specific task or domain you want to use them for.

Researchers have developed methods like LoRA that can significantly reduce the number of parameters needed to fine-tune these models, making them much more practical to use. LoRA works by making small, targeted modifications to the original model weights, rather than retraining the entire model from scratch.

However, LoRA and similar methods have some limitations. They are still bounded by the minimum number of parameters required for the rank one decomposition, and the level of compression they can achieve is heavily influenced by the specific model architecture and the chosen rank (a parameter that controls the complexity of the modifications).

The paper introduces a new approach called NOLA that overcomes these limitations. NOLA uses a different way of re-parameterizing the low-rank modifications, which allows it to achieve even greater compression without sacrificing performance. The key insight is to use linear combinations of randomly generated matrices as the basis for the low-rank modifications, rather than being constrained to a rank one decomposition.

Technical Explanation

The paper presents NOLA, a new method for efficiently adapting large language models (LLMs) to specific tasks or domains. The key innovation is a novel re-parameterization of the low-rank modifications used in previous approaches like LoRA, DORA, and ALORA.

Whereas these prior methods were limited by a rank one lower bound on the number of parameters needed for the low-rank modifications, NOLA decouples the number of trainable parameters from both the choice of rank and the network architecture. This is achieved by representing the low-rank matrices as linear combinations of randomly generated basis matrices, and then optimizing only the linear mixture coefficients.

The authors demonstrate the effectiveness of NOLA on a range of tasks and models, including GPT-2, LLaMA-2, and Vision Transformers (ViT). They show that NOLA can achieve similar performance to LoRA-based models while using dramatically fewer parameters - up to 20 times fewer on the 70B parameter LLaMA-2 model.

Critical Analysis

The NOLA approach presented in this paper is a clever and promising advancement over existing low-rank adaptation methods for large language models. By breaking the rank one constraint, NOLA is able to achieve much greater compression without sacrificing task performance.

That said, the paper does not explore the limits of this compression - it's possible that further increasing the number of basis matrices could lead to even more efficient adaptations, at the cost of slightly increased training complexity. Additionally, the authors note that the extent of compression is still influenced by the model architecture, so NOLA may not be a one-size-fits-all solution.

Another potential area for future research is understanding the inductive biases introduced by the random basis matrices used in NOLA. While this re-parameterization allows for greater flexibility, it's not clear if there are any downsides or unintended consequences from this approach compared to the more structured low-rank decompositions used in prior work.

Overall, NOLA represents an exciting advance in efficient fine-tuning of large language models, and the authors have provided a strong technical foundation for further exploration and refinement of these techniques.

Conclusion

The NOLA method introduced in this paper offers a significant improvement over existing low-rank adaptation approaches for fine-tuning large language models. By decoupling the number of trainable parameters from the rank and model architecture, NOLA can achieve dramatic compression - up to 20 times fewer parameters compared to the best LoRA models, without any loss in task performance.

This advance has important implications for the practical use of massive language models, as it makes it much more feasible to adapt these models to a wide range of downstream applications and domains. The authors have provided a solid technical foundation, along with experimental results demonstrating the effectiveness of NOLA across multiple task types and model architectures.

While there are still some open questions and avenues for further research, this paper represents an important step forward in making large language models more accessible and usable in real-world settings. As the field of AI continues to push the boundaries of what is possible with these powerful models, innovations like NOLA will be crucial for unlocking their full potential.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

OLoRA: Orthonormal Low-Rank Adaptation of Large Language Models

Kerim Buyukakyuz

The advent of large language models (LLMs) has revolutionized natural language processing, enabling unprecedented capabilities in understanding and generating human-like text. However, the computational cost and convergence times associated with fine-tuning these models remain significant challenges. Low-Rank Adaptation (LoRA) has emerged as a promising method to mitigate these issues by introducing efficient fine-tuning techniques with a reduced number of trainable parameters. In this paper, we present OLoRA, an enhancement to the LoRA method that leverages orthonormal matrix initialization through QR decomposition. OLoRA significantly accelerates the convergence of LLM training while preserving the efficiency benefits of LoRA, such as the number of trainable parameters and GPU memory footprint. Our empirical evaluations demonstrate that OLoRA not only converges faster but also exhibits improved performance compared to standard LoRA across a variety of language modeling tasks. This advancement opens new avenues for more efficient and accessible fine-tuning of LLMs, potentially enabling broader adoption and innovation in natural language applications.

6/5/2024

cs.CL

PC-LoRA: Low-Rank Adaptation for Progressive Model Compression with Knowledge Distillation

Injoon Hwang, Haewon Park, Youngwan Lee, Jooyoung Yang, SunJae Maeng

Low-rank adaption (LoRA) is a prominent method that adds a small number of learnable parameters to the frozen pre-trained weights for parameter-efficient fine-tuning. Prompted by the question, ``Can we make its representation enough with LoRA weights solely at the final phase of finetuning without the pre-trained weights?'' In this work, we introduce Progressive Compression LoRA~(PC-LoRA), which utilizes low-rank adaptation (LoRA) to simultaneously perform model compression and fine-tuning. The PC-LoRA method gradually removes the pre-trained weights during the training process, eventually leaving only the low-rank adapters in the end. Thus, these low-rank adapters replace the whole pre-trained weights, achieving the goals of compression and fine-tuning at the same time. Empirical analysis across various models demonstrates that PC-LoRA achieves parameter and FLOPs compression rates of 94.36%/89.1% for vision models, e.g., ViT-B, and 93.42%/84.2% parameters and FLOPs compressions for language models, e.g., BERT.

6/14/2024

cs.CV cs.AI

New!Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead

Rickard Bruel-Gabrielsson, Jiacheng Zhu, Onkar Bhardwaj, Leshem Choshen, Kristjan Greenewald, Mikhail Yurochkin, Justin Solomon

Fine-tuning large language models (LLMs) with low-rank adapters (LoRAs) has become common practice, often yielding numerous copies of the same LLM differing only in their LoRA updates. This paradigm presents challenges for systems that serve real-time responses to queries that each involve a different LoRA. Prior works optimize the design of such systems but still require continuous loading and offloading of LoRAs, as it is infeasible to store thousands of LoRAs in GPU memory. To mitigate this issue, we investigate the efficacy of compression when serving LoRA adapters. We consider compressing adapters individually via SVD and propose a method for joint compression of LoRAs into a shared basis paired with LoRA-specific scaling matrices. Our experiments with up to 500 LoRAs demonstrate that compressed LoRAs preserve performance while offering major throughput gains in realistic serving scenarios with over a thousand LoRAs, maintaining 75% of the throughput of serving a single LoRA.

7/2/2024

cs.DC cs.AI cs.CL cs.LG

🌀

LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters

Klaudia Ba{l}azy, Mohammadreza Banaei, Karl Aberer, Jacek Tabor

The recent trend in scaling language models has led to a growing demand for parameter-efficient tuning (PEFT) methods such as LoRA (Low-Rank Adaptation). LoRA consistently matches or surpasses the full fine-tuning baseline with fewer parameters. However, handling numerous task-specific or user-specific LoRA modules on top of a base model still presents significant storage challenges. To address this, we introduce LoRA-XS (Low-Rank Adaptation with eXtremely Small number of parameters), a novel approach leveraging Singular Value Decomposition (SVD) for parameter-efficient fine-tuning. LoRA-XS introduces a small r x r weight matrix between frozen LoRA matrices, which are constructed by SVD of the original weight matrix. Training only r x r weight matrices ensures independence from model dimensions, enabling more parameter-efficient fine-tuning, especially for larger models. LoRA-XS achieves a remarkable reduction of trainable parameters by over 100x in 7B models compared to LoRA. Our benchmarking across various scales, including GLUE, GSM8k, and MATH benchmarks, shows that our approach outperforms LoRA and recent state-of-the-art approaches like VeRA in terms of parameter efficiency while maintaining competitive performance.

5/29/2024

cs.LG cs.AI cs.CL