I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models

Read original: arXiv:2405.17849 - Published 6/6/2024 by Xing Hu, Yuan Cheng, Dawei Yang, Zhihang Yuan, Jiangyong Yu, Chen Xu, Sifan Zhou

🤯

Overview

Post-training quantization (PTQ) is a technique to accelerate the inference of large language models (LLMs)
Existing PTQ methods still require a significant number of floating-point operations during inference, including quantization, de-quantization, and non-linear operations
This limitation hinders the deployment of LLMs on edge and cloud devices

Plain English Explanation

Post-training quantization (PTQ) is a method that can make large language models (LLMs) run faster during the final deployment stage. However, current PTQ approaches still need to perform many floating-point mathematical operations, such as converting data between different number formats and executing complex non-linear functions. This is a problem because it makes it difficult to use LLMs on devices with limited computing power, like smartphones or small servers.

To address this issue, the researchers propose a new PTQ framework called I-LLM that can run LLMs using only integer-based calculations. The key ideas are:

Fully-Smooth Block-Reconstruction (FSBR) to smooth out variations in the values being processed, which is important for keeping accuracy high when using only integers.
Dynamic Integer-only MatMul (DI-MatMul) to perform matrix multiplication with integers, rather than the usual floating-point numbers.
DI-ClippedSoftmax, DI-Exp, and DI-Normalization to efficiently execute common non-linear operations using only integer bit shifting.

The researchers show that their I-LLM approach can match the accuracy of the original floating-point LLM, while only using 4-bit integers for both the weights and activations. This is a significant improvement over previous methods that required higher precision.

Technical Explanation

The key technical challenge the researchers address is the large variation in activation values across different channels and tokens, both for linear operations like matrix multiplication and non-linear operations like normalization and softmax. This variation makes it difficult to quantize the model to use only integers without losing significant accuracy.

To solve this, the researchers first develop Fully-Smooth Block-Reconstruction (FSBR) to aggressively smooth the variations in activations and weights across channels. This helps preserve the important information when using low-precision integers.

Next, they introduce Dynamic Integer-only MatMul (DI-MatMul) to enable efficient integer-only matrix multiplication. This dynamically quantizes the inputs and outputs to use the optimal integer scale for each operation.

Finally, the researchers design DI-ClippedSoftmax, DI-Exp, and DI-Normalization to execute common non-linear operators using only efficient bit shift operations, without losing accuracy.

Experiments show that their I-LLM framework can achieve comparable accuracy to the original floating-point model, while only using 4-bit integers for both weights and activations. This outperforms previous non-integer quantization methods that still required some floating-point computations.

Critical Analysis

The researchers have made impressive progress in developing an integer-only quantization approach for large language models. By addressing the key challenges of activation variation, they have been able to achieve high accuracy with very low-precision 4-bit integers.

However, the paper does not discuss the potential limitations of this approach. For example, it's unclear how the techniques would scale to even larger and more complex LLMs, or whether there are any specific model architectures or tasks where the integer-only quantization would be less effective.

Additionally, the researchers could have provided more insight into the computational efficiency gains of their methods in real-world deployment scenarios. While they show the accuracy is maintained, quantifying the actual inference speedup and memory savings would help readers understand the practical benefits.

Overall, this is an important step forward in making large language models more efficient and deployable on resource-constrained hardware. But there is still room for further research to fully understand the strengths, weaknesses, and broader applicability of integer-only quantization for LLMs.

Conclusion

The researchers have developed a novel integer-only post-training quantization framework called I-LLM that can run large language models with negligible accuracy loss compared to the original floating-point model. By addressing the key challenge of activation variation, they have been able to achieve highly efficient 4-bit quantization.

This work represents a significant advancement in making LLMs more practical for deployment on edge and cloud devices with limited computing power. If these techniques can be further refined and scaled, it could unlock new applications and use cases for large language models that were previously infeasible due to the high computational requirements.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤯

I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models

Xing Hu, Yuan Cheng, Dawei Yang, Zhihang Yuan, Jiangyong Yu, Chen Xu, Sifan Zhou

Post-training quantization (PTQ) serves as a potent technique to accelerate the inference of large language models (LLMs). Nonetheless, existing works still necessitate a considerable number of floating-point (FP) operations during inference, including additional quantization and de-quantization, as well as non-linear operators such as RMSNorm and Softmax. This limitation hinders the deployment of LLMs on the edge and cloud devices. In this paper, we identify the primary obstacle to integer-only quantization for LLMs lies in the large fluctuation of activations across channels and tokens in both linear and non-linear operations. To address this issue, we propose I-LLM, a novel integer-only fully-quantized PTQ framework tailored for LLMs. Specifically, (1) we develop Fully-Smooth Block-Reconstruction (FSBR) to aggressively smooth inter-channel variations of all activations and weights. (2) to alleviate degradation caused by inter-token variations, we introduce a novel approach called Dynamic Integer-only MatMul (DI-MatMul). This method enables dynamic quantization in full-integer matrix multiplication by dynamically quantizing the input and outputs with integer-only operations. (3) we design DI-ClippedSoftmax, DI-Exp, and DI-Normalization, which utilize bit shift to execute non-linear operators efficiently while maintaining accuracy. The experiment shows that our I-LLM achieves comparable accuracy to the FP baseline and outperforms non-integer quantization methods. For example, I-LLM can operate at W4A4 with negligible loss of accuracy. To our knowledge, we are the first to bridge the gap between integer-only quantization and LLMs. We've published our code on anonymous.4open.science, aiming to contribute to the advancement of this field.

6/6/2024

💬

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, Bohan Zhuang

Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for LLMs. In existing studies, activation outliers in particular channels are identified as the bottleneck to PTQ accuracy. They propose to transform the magnitudes from activations to weights, which however offers limited alleviation or suffers from unstable gradients, resulting in a severe performance drop at low-bitwidth. In this paper, we propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs. QLLM introduces an adaptive channel reassembly technique that reallocates the magnitude of outliers to other channels, thereby mitigating their impact on the quantization range. This is achieved by channel disassembly and channel assembly, which first breaks down the outlier channels into several sub-channels to ensure a more balanced distribution of activation magnitudes. Then similar channels are merged to maintain the original channel number for efficiency. Additionally, an adaptive strategy is designed to autonomously determine the optimal number of sub-channels for channel disassembly. To further compensate for the performance loss caused by quantization, we propose an efficient tuning method that only learns a small number of low-rank weights while freezing the pre-trained quantized model. After training, these low-rank parameters can be fused into the frozen weights without affecting inference. Extensive experiments on LLaMA-1 and LLaMA-2 show that QLLM can obtain accurate quantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B within 10 hours on a single A100-80G GPU, outperforming the previous state-of-the-art method by 7.89% on the average accuracy across five zero-shot tasks.

4/9/2024

ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models

Chao Zeng, Songwei Liu, Yusheng Xie, Hong Liu, Xiaojian Wang, Miao Wei, Shu Yang, Fangmin Chen, Xing Mei

Large Language Models (LLMs) have revolutionized natural language processing tasks. However, their practical application is constrained by substantial memory and computational demands. Post-training quantization (PTQ) is considered an effective method to accelerate LLM inference. Despite its growing popularity in LLM model compression, PTQ deployment faces two major challenges. First, low-bit quantization leads to performance degradation. Second, restricted by the limited integer computing unit type on GPUs, quantized matrix operations with different precisions cannot be effectively accelerated. To address these issues, we introduce a novel arbitrary-bit quantization algorithm and inference framework, ABQ-LLM. It achieves superior performance across various quantization settings and enables efficient arbitrary-precision quantized inference on the GPU. ABQ-LLM introduces several key innovations: (1) a distribution correction method for transformer blocks to mitigate distribution differences caused by full quantization of weights and activations, improving performance at low bit-widths. (2) the bit balance strategy to counteract performance degradation from asymmetric distribution issues at very low bit-widths (e.g., 2-bit). (3) an innovative quantization acceleration framework that reconstructs the quantization matrix multiplication of arbitrary precision combinations based on BTC (Binary TensorCore) equivalents, gets rid of the limitations of INT4/INT8 computing units. ABQ-LLM can convert each component bit width gain into actual acceleration gain, maximizing performance under mixed precision(e.g., W6A6, W2A8). Based on W2*A8 quantization configuration on LLaMA-7B model, it achieved a WikiText2 perplexity of 7.59 (2.17$downarrow $ vs 9.76 in AffineQuant). Compared to SmoothQuant, we realized 1.6$times$ acceleration improvement and 2.7$times$ memory compression gain.

8/26/2024

💬

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

Janghwan Lee, Minsoo Kim, Seungcheol Baek, Seok Joong Hwang, Wonyong Sung, Jungwook Choi

Large Language Models (LLMs) are proficient in natural language processing tasks, but their deployment is often restricted by extensive parameter sizes and computational demands. This paper focuses on post-training quantization (PTQ) in LLMs, specifically 4-bit weight and 8-bit activation (W4A8) quantization, to enhance computational efficiency -- a topic less explored compared to weight-only quantization. We present two innovative techniques: activation-quantization-aware scaling (AQAS) and sequence-length-aware calibration (SLAC) to enhance PTQ by considering the combined effects on weights and activations and aligning calibration sequence lengths to target tasks. Moreover, we introduce dINT, a hybrid data format combining integer and denormal representations, to address the underflow issue in W4A8 quantization, where small values are rounded to zero. Through rigorous evaluations of LLMs, including OPT and LLaMA, we demonstrate that our techniques significantly boost task accuracies to levels comparable with full-precision models. By developing arithmetic units compatible with dINT, we further confirm that our methods yield a 2$times$ hardware efficiency improvement compared to 8-bit integer MAC unit.

7/19/2024