ReALLM: A general framework for LLM compression and fine-tuning

Read original: arXiv:2405.13155 - Published 5/24/2024 by Louis Leconte, Lisa Bedin, Van Minh Nguyen, Eric Moulines

📉

Overview

Introduces ReALLM, a novel approach for compressing and adapting pre-trained language models with a budget of less than 4 bits per parameter
Decomposes pre-trained matrices into a high-precision low-rank component and a vector-quantized latent representation using an autoencoder
Only updates the low-rank components during fine-tuning, adapting the encoder shape to each matrix
Represents each matrix with a small embedding and a neural decoder model, enabling efficient decompression

Plain English Explanation

ReALLM is a new way to make pre-trained language models more compact and adaptable, using less than 4 bits per parameter. The key idea is to break down the pre-trained matrices (the building blocks of the model) into two parts: a high-quality low-rank version, and a compressed latent representation using vector quantization.

During fine-tuning, only the low-rank part is updated, which helps the model adapt to new tasks without needing to update the full matrix. ReALLM also customizes the encoder shape (e.g., size of embeddings, number of bits for quantization) for each matrix, to find the most efficient representation.

The final model represents each matrix with a small embedding and a neural decoder network. To use the matrix, you only need the embedding and a single forward pass through the decoder - this is much more efficient than storing the full matrix. Even without any fine-tuning, this weight-only quantization approach outperforms other methods on language generation tasks. And with a small amount of fine-tuning, ReALLM can achieve state-of-the-art performance using only 2 bits per parameter.

Technical Explanation

ReALLM builds on prior work on low-rank matrix decomposition and post-training quantization for language models. The key novelty is in adaptively shaping the encoder, using an approach inspired by CompactIFAI and Feature-based Low-Rank Compression.

During pre-training, ReALLM decomposes each pre-trained matrix into a low-rank component and a vector-quantized latent representation using an autoencoder. Only the low-rank components are updated during fine-tuning, which helps the model adapt to new tasks efficiently.

The encoder shape (e.g., size of embeddings, number of bits for quantization) is customized for each matrix based on its characteristics. This allows ReALLM to find the most compact representation for each component.

The final model represents each matrix using a small embedding (e.g., 3 bits) and a neural decoder network. This enables efficient decompression, requiring only a single forward pass through the decoder.

Critical Analysis

The paper provides a thorough evaluation of ReALLM's performance on language generation tasks, showing state-of-the-art results with a budget of only 2 bits per parameter. However, the authors acknowledge that the approach may not generalize as well to other types of language models or tasks, and further research is needed to explore its broader applicability.

Additionally, while the adaptive encoder design is a key innovation, the paper does not provide much insight into how the specific encoder shapes are chosen for each matrix. More details on this tuning process and its underlying principles would help readers better understand the method.

Finally, the paper does not discuss potential issues around model interpretability or bias that may arise from the compressed representations. As language models are increasingly used in high-stakes applications, these are important considerations that warrant further investigation.

Conclusion

ReALLM presents a novel approach for compressing and adapting pre-trained language models, achieving state-of-the-art performance with a budget of just 2 bits per parameter. By decomposing pre-trained matrices and selectively updating the low-rank components, the method can efficiently fine-tune models for new tasks while maintaining a compact representation.

This work demonstrates the potential for advanced compression techniques to enable the deployment of large language models on resource-constrained devices, opening up new applications in edge computing and mobile settings. As AI systems become more ubiquitous, tools like ReALLM will be crucial for balancing model capability, efficiency, and cost-effectiveness.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📉

ReALLM: A general framework for LLM compression and fine-tuning

Louis Leconte, Lisa Bedin, Van Minh Nguyen, Eric Moulines

We introduce ReALLM, a novel approach for compression and memory-efficient adaptation of pre-trained language models that encompasses most of the post-training quantization and fine-tuning methods for a budget of <4 bits. Pre-trained matrices are decomposed into a high-precision low-rank component and a vector-quantized latent representation (using an autoencoder). During the fine-tuning step, only the low-rank components are updated. Our results show that pre-trained matrices exhibit different patterns. ReALLM adapts the shape of the encoder (small/large embedding, high/low bit VQ, etc.) to each matrix. ReALLM proposes to represent each matrix with a small embedding on $b$ bits and a neural decoder model $mathcal{D}_phi$ with its weights on $b_phi$ bits. The decompression of a matrix requires only one embedding and a single forward pass with the decoder. Our weight-only quantization algorithm yields the best results on language generation tasks (C4 and WikiText-2) for a budget of $3$ bits without any training. With a budget of $2$ bits, ReALLM achieves state-of-the art performance after fine-tuning on a small calibration dataset.

5/24/2024

💬

LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning

Han Guo, Philip Greengard, Eric P. Xing, Yoon Kim

We propose a simple approach for memory-efficient adaptation of pretrained language models. Our approach uses an iterative algorithm to decompose each pretrained matrix into a high-precision low-rank component and a memory-efficient quantized component. During finetuning, the quantized component remains fixed and only the low-rank component is updated. We present an integer linear programming formulation of the quantization component which enables dynamic configuration of quantization parameters (e.g., bit-width, block size) for each matrix given an overall target memory budget. We further explore a data-aware version of the algorithm which uses an approximation of the Fisher information matrix to weight the reconstruction objective during matrix decomposition. Experiments on finetuning RoBERTa and LLaMA-2 (7B and 70B) demonstrate that our low-rank plus quantized matrix decomposition approach (LQ-LoRA) outperforms strong QLoRA and GPTQ-LoRA baselines and enables aggressive quantization to sub-3 bits with only minor performance degradations. When finetuned on a language modeling calibration dataset, LQ-LoRA can also be used for model compression; in this setting our 2.75-bit LLaMA-2-70B model (which has 2.85 bits on average when including the low-rank components and requires 27GB of GPU memory) performs respectably compared to the 16-bit baseline.

8/28/2024

Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models

Bowen Ping, Shuo Wang, Hanqing Wang, Xu Han, Yuzhuang Xu, Yukun Yan, Yun Chen, Baobao Chang, Zhiyuan Liu, Maosong Sun

Fine-tuning is a crucial process for adapting large language models (LLMs) to diverse applications. In certain scenarios, such as multi-tenant serving, deploying multiple LLMs becomes necessary to meet complex demands. Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs. In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs (e.g., WizardMath for math problems). Motivated by the long-tail distribution of singular values in the delta weights, we propose a delta quantization approach using mixed-precision. This method employs higher-bit representation for singular vectors corresponding to larger singular values. We evaluate our approach on various fine-tuned LLMs, including math LLMs, code LLMs, chat LLMs, and even VLMs. Experimental results demonstrate that our approach performs comparably to full fine-tuned LLMs, surpassing both low-rank and low-bit baselines by a considerable margin. Additionally, we show that our method is compatible with various backbone LLMs, such as Llama-2, Llama-3, and Mistral, highlighting its generalizability.

6/14/2024

💬

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, Bohan Zhuang

Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for LLMs. In existing studies, activation outliers in particular channels are identified as the bottleneck to PTQ accuracy. They propose to transform the magnitudes from activations to weights, which however offers limited alleviation or suffers from unstable gradients, resulting in a severe performance drop at low-bitwidth. In this paper, we propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs. QLLM introduces an adaptive channel reassembly technique that reallocates the magnitude of outliers to other channels, thereby mitigating their impact on the quantization range. This is achieved by channel disassembly and channel assembly, which first breaks down the outlier channels into several sub-channels to ensure a more balanced distribution of activation magnitudes. Then similar channels are merged to maintain the original channel number for efficiency. Additionally, an adaptive strategy is designed to autonomously determine the optimal number of sub-channels for channel disassembly. To further compensate for the performance loss caused by quantization, we propose an efficient tuning method that only learns a small number of low-rank weights while freezing the pre-trained quantized model. After training, these low-rank parameters can be fused into the frozen weights without affecting inference. Extensive experiments on LLaMA-1 and LLaMA-2 show that QLLM can obtain accurate quantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B within 10 hours on a single A100-80G GPU, outperforming the previous state-of-the-art method by 7.89% on the average accuracy across five zero-shot tasks.

4/9/2024