Low-Rank Quantization-Aware Training for LLMs

2406.06385

Published 6/21/2024 by Yelysei Bondarenko, Riccardo Del Chiaro, Markus Nagel

Low-Rank Quantization-Aware Training for LLMs

Abstract

Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands. Quantization is one of the most effective ways to make them more compute and memory efficient. Quantization-aware training (QAT) methods, generally produce the best quantized performance, however it comes at the cost of potentially long training time and excessive memory usage, making it impractical when applying for LLMs. Inspired by parameter-efficient fine-tuning (PEFT) and low-rank adaptation (LoRA) literature, we propose LR-QAT -- a lightweight and memory-efficient QAT algorithm for LLMs. LR-QAT employs several components to save memory without sacrificing predictive performance: (a) low-rank auxiliary weights that are aware of the quantization grid; (b) a downcasting operator using fixed-point or double-packed integers and (c) checkpointing. Unlike most related work, our method (i) is inference-efficient, leading to no additional overhead compared to traditional PTQ; (ii) can be seen as a general extended pretraining framework, meaning that the resulting model can still be utilized for any downstream task afterwards; (iii) can be applied across a wide range of quantization settings, such as different choices quantization granularity, activation quantization, and seamlessly combined with many PTQ techniques. We apply LR-QAT to LLaMA-2/3 and Mistral model families and validate its effectiveness on several downstream tasks. Our method outperforms common post-training quantization (PTQ) approaches and reaches the same model performance as full-model QAT at the fraction of its memory usage. Specifically, we can train a 7B LLM on a single consumer grade GPU with 24GB of memory.

Create account to get full access

Overview

Presents a low-rank quantization-aware training (LR-QAT) method for efficient and accurate quantization of large language models (LLMs)
Focuses on reducing the complexity and memory footprint of quantized LLMs without significantly compromising their performance
Builds on previous work on quantization-aware training and low-rank adapters to enable efficient quantization of LLMs

Plain English Explanation

The paper introduces a new technique called "low-rank quantization-aware training" (LR-QAT) that can make large language models (LLMs) more efficient and accurate when quantized to lower bit-widths. Quantization is a way to compress the size of AI models by reducing the number of bits used to represent the model's parameters, which can make the models faster and more memory-efficient.

However, quantizing LLMs can often lead to a significant drop in performance. The key innovation of LR-QAT is that it combines two existing techniques - quantization-aware training and low-rank adapters - to enable efficient quantization of LLMs without losing too much accuracy. Quantization-aware training helps the model learn to cope with the quantization during training, while low-rank adapters reduce the complexity of the quantized model.

By using this combined approach, the researchers were able to quantize large language models like GPT-2 and BERT to 4-bit or even 2-bit precision with only a small drop in performance, while significantly reducing the model size and inference time. This could make it much more practical to deploy powerful LLMs on resource-constrained devices like smartphones or edge computing systems.

Technical Explanation

The paper presents a low-rank quantization-aware training (LR-QAT) method for efficiently quantizing large language models (LLMs) to low bit-widths. The key components of LR-QAT are:

Quantization-aware training: The model is trained with quantization simulated during the training process, allowing it to learn to cope with the quantization errors. This builds on previous work like QLLM and L4Q.
Low-rank adapters: The model's weight matrices are decomposed into low-rank factors, reducing the overall parameter count and memory footprint of the quantized model. This is inspired by the LOQT approach.
Mixed precision: Different parts of the model (e.g., attention layers vs. feed-forward layers) are quantized to different bit-widths, allowing for a more tailored and efficient quantization scheme. This builds on APTQ.

The authors evaluate LR-QAT on large language models like GPT-2 and BERT, demonstrating that it can achieve high accuracy even when quantizing the models to 4-bit or 2-bit precision. This results in significant reductions in model size and inference time, making the quantized models much more practical for deployment on resource-constrained devices.

Critical Analysis

The paper presents a well-designed and extensive evaluation of the LR-QAT method, demonstrating its effectiveness on a range of LLMs and tasks. The authors also acknowledge several limitations and areas for future work:

The proposed method may not be as effective for models with more complex architectures (e.g., large transformer-based LLMs with many layers and attention heads).
The performance of LR-QAT may degrade for extremely low bit-widths (e.g., 2-bit) on more challenging tasks, and further research is needed to improve quantization at such low precisions.
The authors only consider static quantization, and exploring dynamic quantization schemes could potentially lead to even greater efficiency gains.

Additionally, while the paper focuses on the technical aspects of LR-QAT, it would be valuable to explore the broader implications of such efficient quantization techniques for LLMs. For example, how might this enable the deployment of powerful language models on edge devices, and what are the potential societal impacts of making LLMs more accessible and energy-efficient?

Conclusion

The low-rank quantization-aware training (LR-QAT) method presented in this paper is a significant advancement in the field of efficient quantization of large language models. By combining quantization-aware training and low-rank adapters, the researchers have developed a technique that can accurately quantize LLMs to low bit-widths, resulting in substantial reductions in model size and inference time.

This work has important implications for the deployment of powerful language models on resource-constrained devices, as it makes it possible to run high-performance LLMs on smartphones, IoT devices, and other edge computing systems. As the field of AI continues to advance, techniques like LR-QAT will play a crucial role in ensuring that the benefits of these technologies are accessible to a wide range of users and applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, Bohan Zhuang

Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for LLMs. In existing studies, activation outliers in particular channels are identified as the bottleneck to PTQ accuracy. They propose to transform the magnitudes from activations to weights, which however offers limited alleviation or suffers from unstable gradients, resulting in a severe performance drop at low-bitwidth. In this paper, we propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs. QLLM introduces an adaptive channel reassembly technique that reallocates the magnitude of outliers to other channels, thereby mitigating their impact on the quantization range. This is achieved by channel disassembly and channel assembly, which first breaks down the outlier channels into several sub-channels to ensure a more balanced distribution of activation magnitudes. Then similar channels are merged to maintain the original channel number for efficiency. Additionally, an adaptive strategy is designed to autonomously determine the optimal number of sub-channels for channel disassembly. To further compensate for the performance loss caused by quantization, we propose an efficient tuning method that only learns a small number of low-rank weights while freezing the pre-trained quantized model. After training, these low-rank parameters can be fused into the frozen weights without affecting inference. Extensive experiments on LLaMA-1 and LLaMA-2 show that QLLM can obtain accurate quantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B within 10 hours on a single A100-80G GPU, outperforming the previous state-of-the-art method by 7.89% on the average accuracy across five zero-shot tasks.

4/9/2024

cs.CL cs.AI cs.LG

💬

L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models

Hyesung Jeon, Yulhwa Kim, Jae-joon Kim

Due to the high memory and computational costs associated with Large Language Models, model compression via quantization and parameter-efficient fine-tuning (PEFT) methods, such as low-rank adaptation (LoRA), are gaining popularity. This has led to active research on quantization-aware PEFT techniques, which aim to create models with high accuracy and low memory overhead. Among quantization methods, post-training quantization (PTQ) is more commonly used in previous works than quantization-aware training (QAT), despite QAT's potential for higher accuracy. This preference is due to PTQ's low training overhead. However, PTQ-based PEFT methods often utilize high-precision parameters, making it difficult to fully exploit the efficiency of quantization. Additionally, they have limited adaptation ability due to a reduced and constrained LoRA parameter structure. To overcome these challenges, we propose L4Q, which leverages joint quantization and fine-tuning to reduce QAT's memory overhead and produce models that consist entirely of quantized weights while achieving effective adaptation to downstream tasks. By design, L4Q allows quantization parameters to reflect weight updates, while weight updates reduce quantization errors. Our experiments demonstrate that this coupled quantization and fine-tuning approach yields superior accuracy compared to decoupled fine-tuning schemes in sub-4-bit quantization. Using the LLaMA model families and instructional datasets, we showcase L4Q's capabilities in language tasks and few-shot in-context learning.

5/24/2024

cs.LG cs.CL

LoQT: Low Rank Adapters for Quantized Training

Sebastian Loeschcke, Mads Toftrup, Michael J. Kastoryano, Serge Belongie, V'esteinn Sn{ae}bjarnarson

Training of large neural networks requires significant computational resources. Despite advances using low-rank adapters and quantization, pretraining of models such as LLMs on consumer hardware has not been possible without model sharding, offloading during training, or per-layer gradient updates. To address these limitations, we propose LoQT, a method for efficiently training quantized models. LoQT uses gradient-based tensor factorization to initialize low-rank trainable weight matrices that are periodically merged into quantized full-rank weight matrices. Our approach is suitable for both pretraining and fine-tuning of models, which we demonstrate experimentally for language modeling and downstream task adaptation. We find that LoQT enables efficient training of models up to 7B parameters on a consumer-grade 24GB GPU. We also demonstrate the feasibility of training a 13B parameter model using per-layer gradient updates on the same hardware.

5/28/2024

cs.LG cs.CL

ApiQ: Finetuning of 2-Bit Quantized Large Language Model

Baohao Liao, Christian Herold, Shahram Khadivi, Christof Monz

Memory-efficient finetuning of large language models (LLMs) has recently attracted huge attention with the increasing size of LLMs, primarily due to the constraints posed by GPU memory limitations and the effectiveness of these methods compared to full finetuning. Despite the advancements, current strategies for memory-efficient finetuning, such as QLoRA, exhibit inconsistent performance across diverse bit-width quantizations and multifaceted tasks. This inconsistency largely stems from the detrimental impact of the quantization process on preserved knowledge, leading to catastrophic forgetting and undermining the utilization of pretrained models for finetuning purposes. In this work, we introduce a novel quantization framework, ApiQ, designed to restore the lost information from quantization by concurrently initializing the LoRA components and quantizing the weights of LLMs. This approach ensures the maintenance of the original LLM's activation precision while mitigating the error propagation from shallower into deeper layers. Through comprehensive evaluations conducted on a spectrum of language tasks with various LLMs, ApiQ demonstrably minimizes activation error during quantization. Consequently, it consistently achieves superior finetuning results across various bit-widths.

6/24/2024

cs.LG cs.CL