SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs

Read original: arXiv:2405.16325 - Published 6/17/2024 by Mohammad Mozaffari, Amir Yazdanbakhsh, Zhao Zhang, Maryam Mehri Dehnavi

SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs

Overview

SLoPe is a new pretraining approach for large language models (LLMs) that combines sparse and low-rank adapter layers to achieve high parameter efficiency.
The key ideas are: 1) double-pruned sparse adapters to reduce model size, 2) lazy low-rank adapters to capture task-specific information with fewer parameters, and 3) joint pretraining of the sparse and low-rank adapters.
This approach aims to enable high-sparsity, parameter-efficient LLMs that can be fine-tuned effectively on diverse downstream tasks.

Plain English Explanation

SLoPe is a new way of training large language models (LLMs) that makes them more efficient and effective. Traditional LLMs can be very large and complex, requiring a lot of computing power and storage space. SLoPe tackles this problem by using two key techniques:

Sparse Adapters: SLoPe uses "sparse adapters" - small, sparse neural network layers that can be added to the LLM. These sparse adapters help reduce the overall size of the model without losing much performance.
Low-Rank Adapters: SLoPe also uses "low-rank adapters" - even smaller neural network layers that can capture task-specific information with fewer parameters. This further boosts the efficiency of the model.

By pretraining the LLM with both sparse and low-rank adapters, SLoPe creates a model that is much smaller and more efficient than traditional LLMs, while still maintaining good performance on a wide range of tasks. This could make it easier and cheaper to use LLMs in real-world applications.

Technical Explanation

SLoPe builds on prior work on sparse LLMs and parameter-efficient fine-tuning to create a novel pretraining approach for LLMs.

The key technical components of SLoPe are:

Double-Pruned Sparse Adapters: SLoPe uses sparse neural network layers that are "double-pruned" - first pruned during training, then pruned again at deployment time. This allows for very high sparsity (up to 99%) without significant performance degradation.
Lazy Low-Rank Adapters: In addition to the sparse adapters, SLoPe also uses low-rank adapter layers that can capture task-specific information with a small number of parameters. These "lazy" low-rank adapters are designed to be efficient and easy to fine-tune.
Joint Pretraining: SLoPe pretrains the base LLM along with the sparse and low-rank adapters simultaneously, allowing the different components to work together effectively.

The authors demonstrate that SLoPe-trained LLMs can achieve strong performance on a variety of downstream tasks while being much smaller and more efficient than traditional fine-tuned LLMs. This builds on related work on dense training for sparse inference and low-rank adapters for quantized training.

Critical Analysis

The SLoPe approach appears to be a promising step towards enabling highly sparse, parameter-efficient LLMs. The authors' experiments demonstrate substantial reductions in model size and inference time without significant performance degradation, which is an important advancement.

However, the paper does not fully address the potential limitations of sparse and low-rank adapters. For example, the authors note that the sparse adapters may not be able to capture all the nuances of the original LLM, and the low-rank adapters may struggle with certain types of tasks. Additionally, the joint pretraining process adds complexity and may be challenging to scale to very large models.

Further research is needed to explore the broader applicability of SLoPe, including its performance on more diverse datasets and tasks. The authors also acknowledge the need to investigate the robustness and reliability of sparse LLMs in real-world scenarios.

Conclusion

SLoPe presents a novel approach to pretraining large language models that combines sparse and low-rank adapter layers to achieve significant parameter efficiency without sacrificing much performance. This work builds on and advances the state of the art in efficient LLM architectures and training techniques.

The ability to create high-sparsity, parameter-efficient LLMs that can be fine-tuned effectively on a wide range of tasks has important implications for the deployment of these powerful language models in real-world applications, where computing resources and model size are often at a premium. Further research and refinement of the SLoPe approach could lead to even more efficient and capable LLMs in the future.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs

Mohammad Mozaffari, Amir Yazdanbakhsh, Zhao Zhang, Maryam Mehri Dehnavi

We propose SLoPe, a Double-Pruned Sparse Plus Lazy Low-rank Adapter Pretraining method for LLMs that improves the accuracy of sparse LLMs while accelerating their pretraining and inference and reducing their memory footprint. Sparse pretraining of LLMs reduces the accuracy of the model, to overcome this, prior work uses dense models during fine-tuning. SLoPe improves the accuracy of sparsely pretrained models by adding low-rank adapters in the final 1% iterations of pretraining without adding significant overheads to the model pretraining and inference. In addition, SLoPe uses a double-pruned backward pass formulation that prunes the transposed weight matrix using N:M sparsity structures to enable an accelerated sparse backward pass. SLoPe accelerates the training and inference of models with billions of parameters up to $1.14times$ and $1.34times$ respectively (OPT-33B and OPT-66B) while reducing their memory usage by up to $0.77times$ and $0.51times$ for training and inference respectively.

6/17/2024

SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining

Andi Han, Jiaxiang Li, Wei Huang, Mingyi Hong, Akiko Takeda, Pratik Jawanpuria, Bamdev Mishra

Large language models (LLMs) have shown impressive capabilities across various tasks. However, training LLMs from scratch requires significant computational power and extensive memory capacity. Recent studies have explored low-rank structures on weights for efficient fine-tuning in terms of parameters and memory, either through low-rank adaptation or factorization. While effective for fine-tuning, low-rank structures are generally less suitable for pretraining because they restrict parameters to a low-dimensional subspace. In this work, we propose to parameterize the weights as a sum of low-rank and sparse matrices for pretraining, which we call SLTrain. The low-rank component is learned via matrix factorization, while for the sparse component, we employ a simple strategy of uniformly selecting the sparsity support at random and learning only the non-zero entries with the fixed support. While being simple, the random fixed-support sparse learning strategy significantly enhances pretraining when combined with low-rank learning. Our results show that SLTrain adds minimal extra parameters and memory costs compared to pretraining with low-rank parameterization, yet achieves substantially better performance, which is comparable to full-rank training. Remarkably, when combined with quantization and per-layer updates, SLTrain can reduce memory requirements by up to 73% when pretraining the LLaMA 7B model.

6/5/2024

❗

Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh, Sean Lie, Mark Kurtz

Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs that achieve full accuracy recovery for fine-tuning tasks at up to 70% sparsity. We achieve this for the LLaMA-2 7B model by combining the SparseGPT one-shot pruning method and sparse pretraining of those models on a subset of the SlimPajama dataset mixed with a Python subset of The Stack dataset. We exhibit training acceleration due to sparsity on Cerebras CS-3 chips that closely matches theoretical scaling. In addition, we establish inference acceleration of up to 3x on CPUs by utilizing Neural Magic's DeepSparse engine and 1.7x on GPUs through Neural Magic's nm-vllm engine. The above gains are realized via sparsity alone, thus enabling further gains through additional use of quantization. Specifically, we show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x. We demonstrate these results across diverse, challenging tasks, including chat, instruction following, code generation, arithmetic reasoning, and summarization to prove their generality. This work paves the way for rapidly creating smaller and faster LLMs without sacrificing accuracy.

5/7/2024

SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models

Xudong Lu, Aojun Zhou, Yuhui Xu, Renrui Zhang, Peng Gao, Hongsheng Li

Large Language Models (LLMs) have become pivotal in advancing the field of artificial intelligence, yet their immense sizes pose significant challenges for both fine-tuning and deployment. Current post-training pruning methods, while reducing the sizes of LLMs, often fail to maintain their original performance. To address these challenges, this paper introduces SPP, a Sparsity-Preserved Parameter-efficient fine-tuning method. Different from existing post-training pruning approaches that struggle with performance retention, SPP proposes to employ lightweight learnable column and row matrices to optimize sparse LLM weights, keeping the structure and sparsity of pruned pre-trained models intact. By element-wise multiplication and residual addition, SPP ensures the consistency of model sparsity pattern and ratio during both training and weight-merging processes. We demonstrate the effectiveness of SPP by applying it to the LLaMA and LLaMA-2 model families with recent post-training pruning methods. Our results show that SPP significantly enhances the performance of models with different sparsity patterns (i.e. unstructured and N:M sparsity), especially for those with high sparsity ratios (e.g. 75%), making it a promising solution for the efficient fine-tuning of sparse LLMs. Code will be made available at https://github.com/Lucky-Lance/SPP.

5/28/2024