Sparse High Rank Adapters

2406.13175

Published 6/21/2024 by Kartikeya Bhardwaj, Nilesh Prasad Pandey, Sweta Priyadarshi, Viswanath Ganapathy, Rafael Esteves, Shreya Kadambi, Shubhankar Borse, Paul Whatmough, Risheek Garrepalli, Mart Van Baalen and 2 others

cs.LG cs.AI

Abstract

Low Rank Adaptation (LoRA) has gained massive attention in the recent generative AI research. One of the main advantages of LoRA is its ability to be fused with pretrained models adding no overhead during inference. However, from a mobile deployment standpoint, we can either avoid inference overhead in the fused mode but lose the ability to switch adapters rapidly, or suffer significant (up to 30% higher) inference latency while enabling rapid switching in the unfused mode. LoRA also exhibits concept-loss when multiple adapters are used concurrently. In this paper, we propose Sparse High Rank Adapters (SHiRA), a new paradigm which incurs no inference overhead, enables rapid switching, and significantly reduces concept-loss. Specifically, SHiRA can be trained by directly tuning only 1-2% of the base model weights while leaving others unchanged. This results in a highly sparse adapter which can be switched directly in the fused mode. We further provide theoretical and empirical insights on how high sparsity in SHiRA can aid multi-adapter fusion by reducing concept loss. Our extensive experiments on LVMs and LLMs demonstrate that finetuning only a small fraction of the parameters in the base model is sufficient for many tasks while enabling both rapid switching and multi-adapter fusion. Finally, we provide a latency- and memory-efficient SHiRA implementation based on Parameter-Efficient Finetuning (PEFT) Library. This implementation trains at nearly the same speed as LoRA while consuming lower peak GPU memory, thus making SHiRA easy to adopt for practical use cases.

Create account to get full access

Overview

• This paper introduces a novel approach called Sparse High Rank Adapters (S-LoRA) that addresses the challenges of deploying Large Language Models (LLMs) with LoRA (Low-Rank Adaptation) at the edge.

• S-LoRA enables the deployment of thousands of LoRA-adapted models concurrently by using a sparse high-rank adapter design that reduces memory and computation requirements.

Plain English Explanation

• Large language models like GPT-3 are powerful, but they can be difficult to deploy on edge devices like smartphones or robots due to their large size and high computational demands.

• LoRA is a technique that allows these models to be fine-tuned to specific tasks while only updating a small portion of the model's parameters. This reduces the storage and computational requirements compared to fully fine-tuning the entire model.

• However, even with LoRA, deploying many LoRA-adapted models together on edge devices can still be challenging due to the memory and processing power needed.

• The S-LoRA approach described in this paper tackles this problem by using a sparse, high-rank adapter design. This means the LoRA adaptations take up less memory and can run more efficiently, allowing many LoRA-adapted models to be deployed together on edge devices.

• This enables applications like AR/VR assistants, multi-lingual chatbots, or personalized recommendation systems to leverage the power of LLMs on the edge.

Technical Explanation

• The key idea behind S-LoRA is to use a sparse, high-rank adapter architecture instead of the standard low-rank LoRA design.

• In the standard LoRA approach, the adaptations are implemented as low-rank matrix multiplications, which require less memory but can still be computationally expensive when deploying many LoRA-adapted models together.

• S-LoRA instead uses a sparse high-rank adapter, which has fewer non-zero parameters than a dense low-rank adapter, but maintains a similar level of modeling capacity.

• The authors show that S-LoRA can achieve comparable performance to standard LoRA while using significantly less memory and computation, enabling the deployment of thousands of LoRA-adapted models concurrently on edge devices.

• They evaluate S-LoRA on a range of language tasks and demonstrate its advantages over prior work on batch-processing LoRA models and allocating LoRA-adapted models.

Critical Analysis

• The paper acknowledges that S-LoRA may not be suitable for all use cases, as the sparse high-rank adapter design could impact model performance in some scenarios compared to standard LoRA.

• The authors also note that further research is needed to explore the tradeoffs between memory/computation efficiency and task performance for different adapter architectures and sparsity levels.

• Additionally, the paper does not address potential issues around the security or privacy implications of deploying many LoRA-adapted models on edge devices, which could be an important consideration for certain applications.

• Overall, the S-LoRA approach represents a promising step towards enabling the widespread deployment of LLMs on edge devices, but there are still some open challenges and areas for further investigation.

Conclusion

• This paper introduces Sparse High Rank Adapters (S-LoRA), a novel approach that enables the deployment of thousands of LoRA-adapted language models concurrently on edge devices.

• By using a sparse, high-rank adapter design, S-LoRA significantly reduces the memory and computational requirements compared to standard LoRA, making it feasible to deploy large numbers of personalized or specialized language models on edge devices.

• This has the potential to enable a wide range of applications, from AR/VR assistants to multi-lingual chatbots and personalized recommendation systems, by bringing the power of large language models to the edge.

• While S-LoRA has some limitations, the core ideas presented in this paper represent an important advancement in the field of deploying large language models in real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

📶

S-LoRA: Serving Thousands of Concurrent LoRA Adapters

Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, Ion Stoica

The pretrain-then-finetune paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present S-LoRA, a system designed for the scalable serving of many LoRA adapters. S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, S-LoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services. The code is available at https://github.com/S-LoRA/S-LoRA

6/6/2024

cs.LG cs.AI cs.DC

Batched Low-Rank Adaptation of Foundation Models

Yeming Wen, Swarat Chaudhuri

Low-Rank Adaptation (LoRA) has recently gained attention for fine-tuning foundation models by incorporating trainable low-rank matrices, thereby reducing the number of trainable parameters. While LoRA offers numerous advantages, its applicability for real-time serving to a diverse and global user base is constrained by its incapability to handle multiple task-specific adapters efficiently. This imposes a performance bottleneck in scenarios requiring personalized, task-specific adaptations for each incoming request. To mitigate this constraint, we introduce Fast LoRA (FLoRA), a framework in which each input example in a minibatch can be associated with its unique low-rank adaptation weights, allowing for efficient batching of heterogeneous requests. We empirically demonstrate that FLoRA retains the performance merits of LoRA, showcasing competitive results on the MultiPL-E code generation benchmark spanning over 8 languages and a multilingual speech recognition task across 6 languages.

4/29/2024

cs.LG cs.AI cs.CL

Unlocking the Global Synergies in Low-Rank Adapters

Zixi Zhang, Cheng Zhang, Xitong Gao, Robert D. Mullins, George A. Constantinides, Yiren Zhao

Low-rank Adaption (LoRA) has been the de-facto parameter-efficient fine-tuning technique for large language models. We present HeteroLoRA, a light-weight search algorithm that leverages zero-cost proxies to allocate the limited LoRA trainable parameters across the model for better fine-tuned performance. In addition to the allocation for the standard LoRA-adapted models, we also demonstrate the efficacy of HeteroLoRA by performing the allocation in a more challenging search space that includes LoRA modules and LoRA-adapted shortcut connections. Experiments show that HeteroLoRA enables improvements in model performance given the same parameter budge. For example, on MRPC, we see an improvement of 1.6% in accuracy with similar training parameter budget. We will open-source our algorithm once the paper is accepted.

6/24/2024

cs.LG cs.CL

ALoRA: Allocating Low-Rank Adaptation for Fine-tuning Large Language Models

Zequan Liu, Jiawen Lyn, Wei Zhu, Xing Tian, Yvette Graham

Parameter-efficient fine-tuning (PEFT) is widely studied for its effectiveness and efficiency in the era of large language models. Low-rank adaptation (LoRA) has demonstrated commendable performance as a popular and representative method. However, it is implemented with a fixed intrinsic rank that might not be the ideal setting for the downstream tasks. Recognizing the need for more flexible downstream task adaptation, we extend the methodology of LoRA to an innovative approach we call allocating low-rank adaptation (ALoRA) that enables dynamic adjustments to the intrinsic rank during the adaptation process. First, we propose a novel method, AB-LoRA, that can effectively estimate the importance score of each LoRA rank. Second, guided by AB-LoRA, we gradually prune abundant and negatively impacting LoRA ranks and allocate the pruned LoRA budgets to important Transformer modules needing higher ranks. We have conducted experiments on various tasks, and the experimental results demonstrate that our ALoRA method can outperform the recent baselines with comparable tunable parameters.

4/16/2024

cs.CL