SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

2405.14917

Published 5/27/2024 by Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Xianglong Liu, Luca Benini, Michele Magno, Xiaojuan Qi

cs.LG cs.CL

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

Abstract

Large language models (LLMs) achieve remarkable performance in natural language understanding but require substantial computation and memory resources. Post-training quantization (PTQ) is a powerful compression technique extensively investigated in LLMs. However, existing PTQ methods are still not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths. Standard PTQ methods using group-wise quantization suffer difficulties in quantizing LLMs accurately to such low-bit, but advanced methods remaining high-precision weights element-wisely are hard to realize their theoretical hardware efficiency. This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM. The scheme exploits the salience distribution of weights to determine optimal bit-width and quantizers for accurate LLM quantization, while aligning bit-width partition to groups for compact memory usage and fast integer inference. Specifically, the proposed SliM-LLM mainly relies on two novel techniques: (1) Salience-Determined Bit Allocation utilizes the clustering characteristics of salience distribution to allocate the bit-widths of each group, increasing the accuracy of quantized LLMs and maintaining the inference efficiency; (2) Salience-Weighted Quantizer Calibration optimizes the parameters of the quantizer by considering the element-wise salience within the group, balancing the maintenance of salient information and minimization of errors. Comprehensive experiments show that SliM-LLM significantly improves the accuracy of LLMs at ultra-low bits, e.g., 2-bit LLaMA-7B achieves a 5.5-times memory-saving than original model on NVIDIA A800 GPUs, and 48% decrease of perplexity compared to the state-of-the-art gradient-free PTQ method. Moreover, SliM-LLM+, which is integrated from the extension of SliM-LLM with gradient-based quantizers, further reduces perplexity by 35.1%.

Create account to get full access

Overview

SliM-LLM is a technique for efficiently quantizing large language models (LLMs) using a salience-driven mixed-precision approach.
The key idea is to selectively apply low-precision quantization to less important model parameters, while maintaining high-precision for the more salient parameters.
This allows for significant model size reduction and faster inference without compromising accuracy, compared to traditional uniform low-precision quantization.

Plain English Explanation

SliM-LLM is a method for making large language models (LLMs) smaller and faster, without losing too much of their performance. LLMs are powerful AI models that can understand and generate human-like text, but they can also be very large and computationally intensive.

The core insight behind SliM-LLM is that not all the numbers (parameters) inside an LLM are equally important. Some parameters are more "salient" or crucial for the model's performance than others. SliM-LLM leverages this by using low-precision numbers (fewer bits) for the less important parameters, while keeping high-precision for the more important ones.

This selective use of low-precision quantization allows SliM-LLM to significantly reduce the overall model size and speed up inference, without sacrificing too much accuracy. It's a bit like packing your suitcase for a trip - you might be able to get away with using smaller, lighter items for some things, while keeping your most essential items in the high-quality, full-size versions.

By combining this salience-driven approach with other techniques like QLLM, BiLLM, and APTQ, the researchers were able to achieve impressive model size reductions and speed improvements without significant accuracy loss.

Technical Explanation

The key technical aspects of SliM-LLM include:

Salience Estimation: The researchers developed a method to estimate the "salience" or importance of each parameter in the LLM, based on its contribution to the model's overall performance.
Mixed-Precision Quantization: Using the salience estimates, SliM-LLM selectively applies low-precision (e.g., 8-bit) quantization to the less important parameters, while maintaining high-precision (e.g., 16-bit) for the more salient ones.
Optimization Algorithm: The researchers designed a specialized optimization algorithm to find the optimal bitwidth allocation across the model parameters, balancing size/speed reduction and accuracy preservation.
Experimental Evaluation: SliM-LLM was evaluated on several large language models, including GPT-2 and GPT-3, across a variety of tasks and datasets. The results showed significant model size and inference time reductions (up to 4x) with only minor accuracy degradation, compared to uniform low-precision quantization approaches like QLLM and BiLLM.

Critical Analysis

The researchers acknowledge that SliM-LLM, like other post-training quantization techniques, may not be suitable for all types of LLMs or applications. The method relies on the assumption that model parameters can be meaningfully ranked by their salience, which may not hold true in all cases.

Additionally, the salience estimation process itself introduces some computational overhead, which could offset the benefits of reduced model size and faster inference in certain scenarios. The researchers suggest exploring ways to streamline this process in future work.

Furthermore, the paper does not provide a thorough analysis of the edge cases or failure modes of SliM-LLM, which would be valuable for users to understand the limitations and appropriate use cases of the technique.

Conclusion

Overall, SliM-LLM represents a promising approach for efficiently quantizing large language models while preserving their accuracy. By selectively applying low-precision quantization based on parameter salience, the technique can achieve significant reductions in model size and inference time without compromising performance.

This work, along with other advancements in post-training quantization of LLMs, could pave the way for more accessible and deployable large language models, with important implications for a wide range of natural language processing applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, Bohan Zhuang

Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for LLMs. In existing studies, activation outliers in particular channels are identified as the bottleneck to PTQ accuracy. They propose to transform the magnitudes from activations to weights, which however offers limited alleviation or suffers from unstable gradients, resulting in a severe performance drop at low-bitwidth. In this paper, we propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs. QLLM introduces an adaptive channel reassembly technique that reallocates the magnitude of outliers to other channels, thereby mitigating their impact on the quantization range. This is achieved by channel disassembly and channel assembly, which first breaks down the outlier channels into several sub-channels to ensure a more balanced distribution of activation magnitudes. Then similar channels are merged to maintain the original channel number for efficiency. Additionally, an adaptive strategy is designed to autonomously determine the optimal number of sub-channels for channel disassembly. To further compensate for the performance loss caused by quantization, we propose an efficient tuning method that only learns a small number of low-rank weights while freezing the pre-trained quantized model. After training, these low-rank parameters can be fused into the frozen weights without affecting inference. Extensive experiments on LLaMA-1 and LLaMA-2 show that QLLM can obtain accurate quantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B within 10 hours on a single A100-80G GPU, outperforming the previous state-of-the-art method by 7.89% on the average accuracy across five zero-shot tasks.

4/9/2024

cs.CL cs.AI cs.LG

BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, Xiaojuan Qi

Pretrained large language models (LLMs) exhibit exceptional general language processing capabilities but come with significant demands on memory and computational resources. As a powerful compression technology, binarization can extremely reduce model weights to a mere 1 bit, lowering the expensive computation and memory requirements. However, existing quantization techniques fall short of maintaining LLM performance under ultra-low bit-widths. In response to this challenge, we present BiLLM, a groundbreaking 1-bit post-training quantization scheme tailored for pretrained LLMs. Based on the weight distribution of LLMs, BiLLM first identifies and structurally selects salient weights, and minimizes the compression loss through an effective binary residual approximation strategy. Moreover, considering the bell-shaped distribution of the non-salient weights, we propose an optimal splitting search to group and binarize them accurately. BiLLM achieving for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families and evaluation metrics, outperforms SOTA quantization methods of LLM by significant margins. Moreover, BiLLM enables the binarization process of the LLM with 7 billion weights within 0.5 hours on a single GPU, demonstrating satisfactory time efficiency. Our code is available at https://github.com/Aaronhuang-778/BiLLM.

5/16/2024

cs.LG cs.AI cs.CL

💬

Evaluating Quantized Large Language Models

Shiyao Li, Xuefei Ning, Luning Wang, Tengxuan Liu, Xiangsheng Shi, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang

Post-training quantization (PTQ) has emerged as a promising technique to reduce the cost of large language models (LLMs). Specifically, PTQ can effectively mitigate memory consumption and reduce computational overhead in LLMs. To meet the requirements of both high efficiency and performance across diverse scenarios, a comprehensive evaluation of quantized LLMs is essential to guide the selection of quantization methods. This paper presents a thorough evaluation of these factors by evaluating the effect of PTQ on Weight, Activation, and KV Cache on 11 model families, including OPT, LLaMA2, Falcon, Bloomz, Mistral, ChatGLM, Vicuna, LongChat, StableLM, Gemma, and Mamba, with parameters ranging from 125M to 180B. The evaluation encompasses five types of tasks: basic NLP, emergent ability, trustworthiness, dialogue, and long-context tasks. Moreover, we also evaluate the state-of-the-art (SOTA) quantization methods to demonstrate their applicability. Based on the extensive experiments, we systematically summarize the effect of quantization, provide recommendations to apply quantization techniques, and point out future directions. The code can be found in https://github.com/thu-nics/qllm-eval.

6/7/2024

cs.CL cs.AI

🐍

Combining multiple post-training techniques to achieve most efficient quantized LLMs

Sayeh Sharify, Zifei Xu, Wanzin Yazar, Xin Wang

Large Language Models (LLMs) have distinguished themselves with outstanding performance in complex language modeling tasks, yet they come with significant computational and storage challenges. This paper explores the potential of quantization to mitigate these challenges. We systematically study the combined application of two well-known post-training techniques, SmoothQuant and GPTQ, and provide a comprehensive analysis of their interactions and implications for advancing LLM quantization. We enhance the versatility of both techniques by enabling quantization to microscaling (MX) formats, expanding their applicability beyond their initial fixed-point format targets. We show that by applying GPTQ and SmoothQuant, and employing MX formats for quantizing models, we can achieve a significant reduction in the size of OPT models by up to 4x and LLaMA models by up to 3x with a negligible perplexity increase of 1-3%.

5/14/2024

cs.LG cs.AI