CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs

2405.17233

Published 6/4/2024 by Haoyu Wang, Bei Liu, Hang Shao, Bo Xiao, Ke Zeng, Guanglu Wan, Yanmin Qian

CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs

Abstract

Parameter quantization for Large Language Models (LLMs) has attracted increasing attentions recently in reducing memory costs and improving computational efficiency. Early approaches have been widely adopted. However, the existing methods suffer from poor performance in low-bit (such as 2 to 3 bits) scenarios. In this paper, we present a novel and effective Column-Level Adaptive weight Quantization (CLAQ) framework by introducing three different types of adaptive strategies for LLM quantization. Firstly, a K-Means clustering based algorithm is proposed that allows dynamic generation of quantization centroids for each column of a parameter matrix. Secondly, we design an outlier-guided adaptive precision search strategy which can dynamically assign varying bit-widths to different columns. Finally, a dynamic outlier reservation scheme is developed to retain some parameters in their original float point precision, in trade off of boosted model performance. Experiments on various mainstream open source LLMs including LLaMA-1, LLaMA-2 and Yi demonstrate that our methods achieve the state-of-the-art results across different bit settings, especially in extremely low-bit scenarios. Code is available at https://github.com/fayuge/CLAQ.

Create account to get full access

Overview

This paper, titled "CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs," explores techniques for efficiently compressing large language models (LLMs) without significantly sacrificing their performance.
The researchers introduce a novel quantization method called CLAQ (Calibrated Low-Bit Activation Quantization) that can reduce model size and inference latency while maintaining high accuracy.
CLAQ builds on previous quantization approaches like BILLM, SLIM-LLM, and OneDBIT, pushing the limits of post-training quantization for LLMs.

Plain English Explanation

Large language models (LLMs) like GPT-3 and BERT have become incredibly powerful, but they are also very large and computationally intensive. This makes them challenging to deploy on resource-constrained devices like smartphones or edge devices.

The researchers in this paper have developed a new technique called CLAQ that can significantly reduce the size and computational requirements of LLMs without sacrificing too much of their performance. CLAQ works by quantizing, or compressing, the models' parameters and activations down to just a few bits, rather than the typical 32-bit or 16-bit floating-point representation.

By using CLAQ, the researchers were able to shrink LLMs by up to 8x in size and speed up their inference by 3-4x, while only losing a small amount of accuracy. This could make it much easier to deploy these powerful models on a wider range of devices, from smartphones to edge computing systems.

The key innovation in CLAQ is the way it calibrates and optimizes the quantization process to preserve as much of the model's original performance as possible. This builds on previous quantization techniques like BILLM, SLIM-LLM, and OneDBIT, but takes the compression even further.

Technical Explanation

The researchers propose a new quantization method called CLAQ (Calibrated Low-Bit Activation Quantization) that can efficiently compress LLMs down to just 2-4 bits per parameter, while maintaining high accuracy.

CLAQ works by first analyzing the distribution of activations and weights in the pre-trained LLM to determine the optimal quantization parameters. It then uses a novel calibration technique to further optimize the quantization, minimizing the information loss. This involves adjusting the quantization scale and zero-point to better match the original activation and weight values.

The researchers evaluate CLAQ on a range of LLMs, including GPT-2, GPT-3, and BERT, and find that it can achieve up to 8x model compression and 3-4x inference speedup with less than 1% accuracy loss compared to the original full-precision models. This outperforms previous state-of-the-art quantization techniques like BILLM, SLIM-LLM, and OneDBIT.

The researchers also analyze the impact of CLAQ on different model layers and attention heads, showing that it can effectively quantize both high- and low-importance components without significantly degrading performance.

Critical Analysis

The CLAQ technique presented in this paper is a significant advancement in the field of LLM compression and efficient inference. By pushing the limits of low-bit post-training quantization, the researchers have demonstrated the potential to deploy powerful LLMs on a wider range of hardware, from smartphones to edge devices.

However, the paper does not address some potential limitations and areas for further research. For instance, the evaluation is primarily focused on standard language modeling benchmarks, and it's unclear how CLAQ would perform on more specialized or domain-specific tasks. Additionally, the paper does not explore the impact of CLAQ on the models' robustness or ability to generalize to out-of-distribution samples.

Furthermore, the paper does not discuss the computational overhead or memory requirements of the CLAQ calibration process, which could be a concern for real-world deployment. It would be valuable to understand the trade-offs between the compression and quantization gains and the additional processing needed for CLAQ.

Despite these potential limitations, the CLAQ technique represents an important step forward in making LLMs more accessible and deployable on a wider range of hardware. Future research could explore ways to further optimize the quantization process, investigate the impact on model robustness and generalization, and address the computational overhead of the calibration step.

Conclusion

The CLAQ technique introduced in this paper demonstrates the potential to significantly compress and accelerate large language models without sacrificing too much of their performance. By pushing the limits of low-bit post-training quantization, the researchers have developed a method that can reduce model size by up to 8x and inference time by 3-4x, while maintaining high accuracy.

This could have important implications for the broader deployment of LLMs, making it easier to use these powerful models on resource-constrained devices like smartphones and edge computing systems. While the paper does not address all potential limitations, the CLAQ technique represents an important advancement in the field of efficient LLM inference and could pave the way for more widespread adoption of these transformative technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, Bohan Zhuang

Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for LLMs. In existing studies, activation outliers in particular channels are identified as the bottleneck to PTQ accuracy. They propose to transform the magnitudes from activations to weights, which however offers limited alleviation or suffers from unstable gradients, resulting in a severe performance drop at low-bitwidth. In this paper, we propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs. QLLM introduces an adaptive channel reassembly technique that reallocates the magnitude of outliers to other channels, thereby mitigating their impact on the quantization range. This is achieved by channel disassembly and channel assembly, which first breaks down the outlier channels into several sub-channels to ensure a more balanced distribution of activation magnitudes. Then similar channels are merged to maintain the original channel number for efficiency. Additionally, an adaptive strategy is designed to autonomously determine the optimal number of sub-channels for channel disassembly. To further compensate for the performance loss caused by quantization, we propose an efficient tuning method that only learns a small number of low-rank weights while freezing the pre-trained quantized model. After training, these low-rank parameters can be fused into the frozen weights without affecting inference. Extensive experiments on LLaMA-1 and LLaMA-2 show that QLLM can obtain accurate quantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B within 10 hours on a single A100-80G GPU, outperforming the previous state-of-the-art method by 7.89% on the average accuracy across five zero-shot tasks.

4/9/2024

cs.CL cs.AI cs.LG

BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, Xiaojuan Qi

Pretrained large language models (LLMs) exhibit exceptional general language processing capabilities but come with significant demands on memory and computational resources. As a powerful compression technology, binarization can extremely reduce model weights to a mere 1 bit, lowering the expensive computation and memory requirements. However, existing quantization techniques fall short of maintaining LLM performance under ultra-low bit-widths. In response to this challenge, we present BiLLM, a groundbreaking 1-bit post-training quantization scheme tailored for pretrained LLMs. Based on the weight distribution of LLMs, BiLLM first identifies and structurally selects salient weights, and minimizes the compression loss through an effective binary residual approximation strategy. Moreover, considering the bell-shaped distribution of the non-salient weights, we propose an optimal splitting search to group and binarize them accurately. BiLLM achieving for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families and evaluation metrics, outperforms SOTA quantization methods of LLM by significant margins. Moreover, BiLLM enables the binarization process of the LLM with 7 billion weights within 0.5 hours on a single GPU, demonstrating satisfactory time efficiency. Our code is available at https://github.com/Aaronhuang-778/BiLLM.

5/16/2024

cs.LG cs.AI cs.CL

Extreme Compression of Large Language Models via Additive Quantization

Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh

The emergence of accurate open large language models (LLMs) has led to a race towards performant quantization techniques which can enable their execution on end-user devices. In this paper, we revisit the problem of ``extreme'' LLM compression -- defined as targeting extremely low bit counts, such as 2 to 3 bits per parameter -- from the point of view of classic methods in Multi-Codebook Quantization (MCQ). Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval to advance the state-of-the-art in LLM compression, via two innovations: 1) learned additive quantization of weight matrices in input-adaptive fashion, and 2) joint optimization of codebook parameters across each transformer blocks. Broadly, AQLM is the first scheme that is Pareto optimal in terms of accuracy-vs-model-size when compressing to less than 3 bits per parameter, and significantly improves upon all known schemes in the extreme compression (2bit) regime. In addition, AQLM is practical: we provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed, while executing in a much smaller memory footprint.

6/11/2024

cs.LG cs.CL

ApiQ: Finetuning of 2-Bit Quantized Large Language Model

Baohao Liao, Christian Herold, Shahram Khadivi, Christof Monz

Memory-efficient finetuning of large language models (LLMs) has recently attracted huge attention with the increasing size of LLMs, primarily due to the constraints posed by GPU memory limitations and the effectiveness of these methods compared to full finetuning. Despite the advancements, current strategies for memory-efficient finetuning, such as QLoRA, exhibit inconsistent performance across diverse bit-width quantizations and multifaceted tasks. This inconsistency largely stems from the detrimental impact of the quantization process on preserved knowledge, leading to catastrophic forgetting and undermining the utilization of pretrained models for finetuning purposes. In this work, we introduce a novel quantization framework, ApiQ, designed to restore the lost information from quantization by concurrently initializing the LoRA components and quantizing the weights of LLMs. This approach ensures the maintenance of the original LLM's activation precision while mitigating the error propagation from shallower into deeper layers. Through comprehensive evaluations conducted on a spectrum of language tasks with various LLMs, ApiQ demonstrably minimizes activation error during quantization. Consequently, it consistently achieves superior finetuning results across various bit-widths.

6/24/2024

cs.LG cs.CL