Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

Read original: arXiv:2311.05161 - Published 7/19/2024 by Janghwan Lee, Minsoo Kim, Seungcheol Baek, Seok Joong Hwang, Wonyong Sung, Jungwook Choi

💬

Overview

Large Language Models (LLMs) are powerful in natural language processing tasks, but their deployment is often limited by large parameter sizes and high computational demands.
This paper focuses on post-training quantization (PTQ) of LLMs, specifically 4-bit weight and 8-bit activation (W4A8) quantization, to improve computational efficiency.
The paper introduces two novel techniques: activation-quantization-aware scaling (AQAS) and sequence-length-aware calibration (SLAC) to enhance PTQ by considering the combined effects on weights and activations, and aligning calibration sequence lengths to target tasks.
The paper also introduces a hybrid data format called dINT, which combines integer and denormal representations to address the underflow issue in W4A8 quantization.

Plain English Explanation

Large Language Models (LLMs) are computer programs that can understand and generate human-like text. They are very good at tasks like answering questions, translating languages, and summarizing information. However, these models often require a lot of computing power and memory to run, which can make them difficult to use in real-world applications.

The researchers in this paper looked at ways to make LLMs more efficient and easier to use. They focused on a technique called "quantization," which involves compressing the data used by the model to make it smaller and faster to process.

Specifically, the researchers looked at a type of quantization called "4-bit weight and 8-bit activation" (W4A8) quantization. This means that the numbers used to represent the model's "weights" (the internal parameters that determine how the model behaves) are compressed to 4 bits, and the "activations" (the intermediate results calculated by the model) are compressed to 8 bits.

The researchers introduced two new techniques to improve this type of quantization:

Activation-Quantization-Aware Scaling (AQAS): This method considers the combined effects of quantizing both the weights and the activations, which can lead to better performance.
Sequence-Length-Aware Calibration (SLAC): This method aligns the calibration (the process of adjusting the quantization parameters) to the specific task the model is being used for, which can also improve performance.

The researchers also introduced a new data format called "dINT," which combines integer and denormal representations to address a problem where small values are rounded to zero during quantization.

Overall, the researchers found that their techniques significantly improved the accuracy of quantized LLMs, making them perform almost as well as the original, uncompressed models. They also showed that their methods can lead to a 2x improvement in hardware efficiency, meaning the models can run faster on the same hardware.

Technical Explanation

The paper focuses on post-training quantization (PTQ) of large language models (LLMs), specifically 4-bit weight and 8-bit activation (W4A8) quantization, to enhance computational efficiency. The authors present two innovative techniques:

Activation-Quantization-Aware Scaling (AQAS): This method considers the combined effects of quantizing both the weights and the activations, which can lead to better performance compared to weight-only quantization approaches.
Sequence-Length-Aware Calibration (SLAC): This technique aligns the calibration (the process of adjusting the quantization parameters) to the specific task the model is being used for, ensuring that the quantized model performs well on the target application.

Additionally, the authors introduce a hybrid data format called "dINT," which combines integer and denormal representations to address the underflow issue in W4A8 quantization, where small values are rounded to zero.

The researchers evaluate their techniques on large language models, including OPT and LLaMA, and demonstrate that their methods significantly boost task accuracies to levels comparable with full-precision models. They also develop arithmetic units compatible with dINT and show that their approach yields a 2x hardware efficiency improvement compared to 8-bit integer MAC units.

Critical Analysis

The paper presents a comprehensive study on post-training quantization for LLMs, addressing the important challenge of improving computational efficiency without significantly compromising model performance. The proposed techniques, AQAS and SLAC, offer meaningful advancements in the field of LLM quantization and post-training optimization.

One potential limitation of the study is the evaluation on a relatively limited set of LLMs (OPT and LLaMA). It would be valuable to see the techniques applied to a broader range of state-of-the-art LLMs to assess their generalizability. Additionally, the authors mention the need for developing specialized hardware units compatible with the dINT format, which may pose engineering challenges and require further research.

While the paper provides a strong technical foundation, it would be beneficial to explore the implications of these techniques for real-world deployment scenarios, such as the trade-offs between model performance, energy efficiency, and hardware constraints. Addressing these practical considerations could further strengthen the impact and applicability of the proposed methods.

Conclusion

This paper presents innovative techniques for post-training quantization of large language models, specifically targeting 4-bit weight and 8-bit activation (W4A8) quantization. The researchers introduce Activation-Quantization-Aware Scaling (AQAS) and Sequence-Length-Aware Calibration (SLAC) to enhance the quantization process and address the underflow issue in W4A8 quantization through the development of a hybrid data format called dINT.

The researchers' evaluations demonstrate that their techniques can significantly boost the task accuracies of quantized LLMs, reaching levels comparable to full-precision models. Furthermore, they show that their methods can lead to a 2x improvement in hardware efficiency, making quantized LLMs more viable for real-world applications with limited computing resources.

These advancements in LLM quantization and post-training optimization represent an important step towards making powerful language models more accessible and practical for deployment in a wide range of scenarios, from edge devices to cloud-based services.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

Janghwan Lee, Minsoo Kim, Seungcheol Baek, Seok Joong Hwang, Wonyong Sung, Jungwook Choi

Large Language Models (LLMs) are proficient in natural language processing tasks, but their deployment is often restricted by extensive parameter sizes and computational demands. This paper focuses on post-training quantization (PTQ) in LLMs, specifically 4-bit weight and 8-bit activation (W4A8) quantization, to enhance computational efficiency -- a topic less explored compared to weight-only quantization. We present two innovative techniques: activation-quantization-aware scaling (AQAS) and sequence-length-aware calibration (SLAC) to enhance PTQ by considering the combined effects on weights and activations and aligning calibration sequence lengths to target tasks. Moreover, we introduce dINT, a hybrid data format combining integer and denormal representations, to address the underflow issue in W4A8 quantization, where small values are rounded to zero. Through rigorous evaluations of LLMs, including OPT and LLaMA, we demonstrate that our techniques significantly boost task accuracies to levels comparable with full-precision models. By developing arithmetic units compatible with dINT, we further confirm that our methods yield a 2$times$ hardware efficiency improvement compared to 8-bit integer MAC unit.

7/19/2024

📉

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, Song Han

Large language models (LLMs) have transformed numerous AI applications. On-device LLM is becoming increasingly important: running LLMs locally on edge devices can reduce the cloud computing cost and protect users' privacy. However, the astronomical model size and the limited hardware resource pose significant deployment challenges. We propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. AWQ finds that not all weights in an LLM are equally important. Protecting only 1% salient weights can greatly reduce quantization error. To identify salient weight channels, we should refer to the activation distribution, not weights. To avoid the hardware-inefficient mix-precision quantization, we mathematically derive that scaling up the salient channels can reduce the quantization error. AWQ employs an equivalent transformation to scale the salient weight channels to protect them. The scale is determined by collecting the activation statistics offline. AWQ does not rely on any backpropagation or reconstruction, so it generalizes to different domains and modalities without overfitting the calibration set. AWQ outperforms existing work on various language modeling and domain-specific benchmarks (coding and math). Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. Alongside AWQ, we implement TinyChat, an efficient and flexible inference framework tailored for 4-bit on-device LLM/VLMs. With kernel fusion and platform-aware weight packing, TinyChat offers more than 3x speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also democratizes the deployment of the 70B Llama-2 model on mobile GPUs.

7/19/2024

💬

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, Bohan Zhuang

Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for LLMs. In existing studies, activation outliers in particular channels are identified as the bottleneck to PTQ accuracy. They propose to transform the magnitudes from activations to weights, which however offers limited alleviation or suffers from unstable gradients, resulting in a severe performance drop at low-bitwidth. In this paper, we propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs. QLLM introduces an adaptive channel reassembly technique that reallocates the magnitude of outliers to other channels, thereby mitigating their impact on the quantization range. This is achieved by channel disassembly and channel assembly, which first breaks down the outlier channels into several sub-channels to ensure a more balanced distribution of activation magnitudes. Then similar channels are merged to maintain the original channel number for efficiency. Additionally, an adaptive strategy is designed to autonomously determine the optimal number of sub-channels for channel disassembly. To further compensate for the performance loss caused by quantization, we propose an efficient tuning method that only learns a small number of low-rank weights while freezing the pre-trained quantized model. After training, these low-rank parameters can be fused into the frozen weights without affecting inference. Extensive experiments on LLaMA-1 and LLaMA-2 show that QLLM can obtain accurate quantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B within 10 hours on a single A100-80G GPU, outperforming the previous state-of-the-art method by 7.89% on the average accuracy across five zero-shot tasks.

4/9/2024

MobileQuant: Mobile-friendly Quantization for On-device Language Models

Fuwen Tan, Royson Lee, {L}ukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, Brais Martinez

Large language models (LLMs) have revolutionized language processing, delivering outstanding results across multiple applications. However, deploying LLMs on edge devices poses several challenges with respect to memory, energy, and compute costs, limiting their widespread use in devices such as mobile phones. A promising solution is to reduce the number of bits used to represent weights and activations. While existing works have found partial success at quantizing LLMs to lower bitwidths, e.g. 4-bit weights, quantizing activations beyond 16 bits often leads to large computational overheads due to poor on-device quantization support, or a considerable accuracy drop. Yet, 8-bit activations are very attractive for on-device deployment as they would enable LLMs to fully exploit mobile-friendly hardware, e.g. Neural Processing Units (NPUs). In this work, we make a first attempt to facilitate the on-device deployment of LLMs using integer-only quantization. We first investigate the limitations of existing quantization methods for on-device deployment, with a special focus on activation quantization. We then address these limitations by introducing a simple post-training quantization method, named MobileQuant, that extends previous weight equivalent transformation works by jointly optimizing the weight transformation and activation range parameters in an end-to-end manner. MobileQuant demonstrates superior capabilities over existing methods by 1) achieving near-lossless quantization on a wide range of LLM benchmarks, 2) reducing latency and energy consumption by 20%-50% compared to current on-device quantization strategies, 3) requiring limited compute budget, 4) being compatible with mobile-friendly compute units, e.g. NPU.

8/27/2024