QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

2310.08041

Published 4/9/2024 by Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, Bohan Zhuang

💬

Abstract

Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for LLMs. In existing studies, activation outliers in particular channels are identified as the bottleneck to PTQ accuracy. They propose to transform the magnitudes from activations to weights, which however offers limited alleviation or suffers from unstable gradients, resulting in a severe performance drop at low-bitwidth. In this paper, we propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs. QLLM introduces an adaptive channel reassembly technique that reallocates the magnitude of outliers to other channels, thereby mitigating their impact on the quantization range. This is achieved by channel disassembly and channel assembly, which first breaks down the outlier channels into several sub-channels to ensure a more balanced distribution of activation magnitudes. Then similar channels are merged to maintain the original channel number for efficiency. Additionally, an adaptive strategy is designed to autonomously determine the optimal number of sub-channels for channel disassembly. To further compensate for the performance loss caused by quantization, we propose an efficient tuning method that only learns a small number of low-rank weights while freezing the pre-trained quantized model. After training, these low-rank parameters can be fused into the frozen weights without affecting inference. Extensive experiments on LLaMA-1 and LLaMA-2 show that QLLM can obtain accurate quantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B within 10 hours on a single A100-80G GPU, outperforming the previous state-of-the-art method by 7.89% on the average accuracy across five zero-shot tasks.

Get summaries of the top AI research delivered straight to your inbox:

Overview

Large language models (LLMs) excel at natural language processing (NLP) tasks, but their high computational requirements make widespread deployment challenging.
Quantization-Aware Training (QAT) offers a potential solution, but its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for LLMs.
In existing studies, activation outliers in specific channels have been identified as a bottleneck to PTQ accuracy.
The paper proposes QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs.

Plain English Explanation

Large language models (LLMs) are powerful tools that excel at understanding and generating human-like text. However, they require a lot of computing power, which can make it challenging to deploy them widely. One way to address this is through quantization, which involves reducing the precision of the model's internal calculations to use less memory and computation.

Quantization-Aware Training (QAT) is a technique that trains the model to be quantized from the start. While this can work well, it's a complex process that requires a lot of additional training time. An alternative approach is Post-Training Quantization (PTQ), which quantizes the model after it's been trained.

Previous research has found that certain activation outliers - unusually high or low values in specific parts of the model - can be a problem for PTQ, causing a drop in the model's accuracy. The paper introduces a new method called QLLM that tries to address this issue.

Technical Explanation

QLLM is a PTQ technique designed specifically for LLMs. It introduces an adaptive channel reassembly technique that helps mitigate the impact of activation outliers.

The process works like this:

Channel Disassembly: QLLM first breaks down the model's channels (the different parts that process the input data) with outliers into several sub-channels. This helps create a more balanced distribution of activation magnitudes.
Channel Assembly: QLLM then merges similar sub-channels back together to maintain the original number of channels for efficiency.
Adaptive Strategy: QLLM automatically determines the optimal number of sub-channels to use for the channel disassembly step.

To further improve performance, QLLM also includes an efficient tuning method that only updates a small number of low-rank weights in the quantized model, rather than retraining the entire model.

Experiments on large language models like LLaMA-1 and LLaMA-2 show that QLLM can produce accurate quantized models efficiently. For example, QLLM was able to quantize the 4-bit LLaMA-2-70B model in just 10 hours on a single GPU, outperforming previous state-of-the-art methods by a significant margin.

Critical Analysis

The paper presents a promising approach to addressing the challenges of quantizing large language models, which is an important step towards making these powerful models more accessible and deployable. The adaptive channel reassembly technique seems like a clever way to mitigate the impact of activation outliers, a known issue with PTQ.

However, the paper does not provide much detail on the specific tradeoffs or limitations of the QLLM method. It would be helpful to know more about the computational and memory overhead of the channel disassembly and assembly process, as well as how the method performs on a wider range of language models and tasks.

Additionally, the paper only evaluates QLLM on a few specific language models and tasks. Further research would be needed to understand how well the technique generalizes to other LLM architectures and applications.

Conclusion

Overall, the QLLM method represents an interesting advancement in the field of quantizing large language models. By adaptively reassembling the model's channels to mitigate the impact of activation outliers, the technique offers a promising path towards more efficient and deployable LLMs. As the field of natural language processing continues to advance, innovations like QLLM will be crucial in bringing these powerful models to a wider range of real-world applications.

Related Papers

APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models

Ziyi Guan, Hantao Huang, Yupeng Su, Hong Huang, Ngai Wong, Hao Yu

Large Language Models (LLMs) have greatly advanced the natural language processing paradigm. However, the high computational load and huge model sizes pose a grand challenge for deployment on edge devices. To this end, we propose APTQ (Attention-aware Post-Training Mixed-Precision Quantization) for LLMs, which considers not only the second-order information of each layer's weights, but also, for the first time, the nonlinear effect of attention outputs on the entire model. We leverage the Hessian trace as a sensitivity metric for mixed-precision quantization, ensuring an informed precision reduction that retains model performance. Experiments show APTQ surpasses previous quantization methods, achieving an average of 4 bit width a 5.22 perplexity nearly equivalent to full precision in the C4 dataset. In addition, APTQ attains state-of-the-art zero-shot accuracy of 68.24% and 70.48% at an average bitwidth of 3.8 in LLaMa-7B and LLaMa-13B, respectively, demonstrating its effectiveness to produce high-quality quantized LLMs.

4/17/2024

cs.LG cs.AI cs.CL

📉

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, Song Han

Large language models (LLMs) have fundamentally transformed the capabilities of numerous applications, from natural language processing to more intricate domain-specific tasks in robotics and autonomous driving. Moreover, the importance of on-device LLMs has grown significantly in the recent years. Running LLMs on edge devices not only promises reduced latency and improved user experience but also aligns with the increasing need for user privacy, as data processing can occur locally. However, the astronomical model sizes of modern LLMs and constraints of the edge devices, primarily in terms of memory size and bandwidth, pose significant deployment challenges. In this paper, we propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. Our method is based on the observation that weights are not equally important: protecting only 1% of salient weights can greatly reduce quantization error. We then propose to search for the optimal per-channel scaling that protects the salient weights by observing the activation, not weights. AWQ does not rely on any backpropagation or reconstruction, so it can well preserve LLMs' generalization ability on different domains and modalities, without overfitting to the calibration set. AWQ outperforms existing work on various language modeling and domain-specific benchmarks (coding and math). Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. Alongside AWQ, we implement TinyChat, an efficient and flexible inference framework tailored for on-device LLM/VLMs, offering more than 3x speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also democratizes the deployment of the 70B Llama-2 model on mobile GPUs.

4/23/2024

cs.CL

Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization

Aniruddha Nrusimha, Mayank Mishra, Naigang Wang, Dan Alistarh, Rameswar Panda, Yoon Kim

We consider the problem of accurate quantization for language models, where both the weights and activations are uniformly quantized to 4 bits per parameter, the lowest bitwidth format natively supported by GPU hardware. In this context, the key challenge is activation quantization: it is known that language models contain outlier channels whose values on average are orders of magnitude higher than than other channels, which prevents accurate low-bitwidth quantization with known techniques. We systematically study this phenomena and find that these outlier channels emerge early in training, and that they occur more frequently in layers with residual streams. We then propose a simple strategy which regularizes a layer's inputs via quantization-aware training (QAT) and its outputs via activation kurtosis regularization. We show that regularizing both the inputs and outputs is crucial for preventing a model's migrating the difficulty in input quantization to the weights, which makes post-training quantization (PTQ) of weights more difficult. When combined with weight PTQ, we show that our approach can obtain a W4A4 model that performs competitively to the standard-precision W16A16 baseline.

4/5/2024

cs.LG cs.CL

CBQ: Cross-Block Quantization for Large Language Models

Xin Ding, Xiaoyu Liu, Zhijun Tu, Yun Zhang, Wei Li, Jie Hu, Hanting Chen, Yehui Tang, Zhiwei Xiong, Baoqun Yin, Yunhe Wang

Post-training quantization (PTQ) has played a key role in compressing large language models (LLMs) with ultra-low costs. However, existing PTQ methods only focus on handling the outliers within one layer or one block, which ignores the dependency of blocks and leads to severe performance degradation in low-bit settings. In this paper, we propose CBQ, a cross-block reconstruction-based PTQ method for LLMs. CBQ employs a cross-block dependency using a homologous reconstruction scheme, establishing long-range dependencies across multiple blocks to minimize error accumulation. Furthermore, CBQ incorporates a coarse-to-fine preprocessing (CFP) strategy for suppressing weight and activation outliers, coupled with an adaptive LoRA-Rounding technique for precise weight quantization. These innovations enable CBQ to not only handle extreme outliers effectively but also improve overall quantization accuracy. Extensive experiments show that CBQ achieves superior low-bit quantization (W4A4, W4A8, W2A16) and outperforms existing state-of-the-art methods across various LLMs and datasets. Notably, CBQ quantizes the 4-bit LLAMA1-65B model within only 4.3 hours on a single GPU, achieving a commendable tradeoff between performance and quantization efficiency.

4/16/2024

cs.LG cs.CL