Mitigating Quantization Errors Due to Activation Spikes in GLU-Based LLMs

Read original: arXiv:2405.14428 - Published 5/24/2024 by Jaewoo Yang, Hayun Kim, Younghoon Kim

🏋️

Overview

Modern large language models (LLMs) have achieved state-of-the-art performance through architectural improvements, but require significant computational resources for inference.
To reduce the inference cost, post-training quantization (PTQ) has become a popular approach, where the model's weights and activations are converted to lower-precision formats like INT8.
This paper reveals challenges with activation quantization in feed-forward network (FFN) modules that use Gated Linear Unit (GLU) variants, which are widely used in modern LLMs like the LLaMA family.

Plain English Explanation

Large language models (LLMs) are powerful AI systems that can understand and generate human-like text. To make these models more efficient and practical to use, researchers have been exploring ways to reduce the computational cost of running them, such as quantization.

Quantization involves converting the model's weights and activations (intermediate calculations) to lower-precision data types, like 8-bit integers instead of 32-bit floating-point numbers. This can significantly reduce the memory and processing power required to run the model, but it also introduces the risk of losing some accuracy.

The paper focuses on a particular challenge with quantizing the activations in a type of module called a Gated Linear Unit (GLU), which is commonly used in the feed-forward networks of modern LLMs. The researchers found that the activations in these GLU modules can sometimes have "spikes" - unusually large values that are difficult to quantize without losing a lot of performance.

Through their analysis, the researchers identified patterns in where these activation spikes occur (often in the early and late layers of the model) and how they are concentrated on specific tokens rather than spread across the entire sequence. Based on these insights, they propose two new techniques, called Quantization-free Module (QFeM) and Quantization-free Prefix (QFeP), to help isolate and preserve these important activation spikes during the quantization process.

Technical Explanation

The paper explores the challenges of activation quantization in Gated Linear Unit (GLU) variants, which are widely used in the feed-forward networks (FFNs) of modern large language models (LLMs) like the LLaMA family.

The researchers found that severe local quantization errors, caused by excessive magnitudes of activation in GLU variants, can significantly degrade the performance of the quantized LLM. They refer to these problematic activations as "activation spikes."

Through their analysis, the researchers made the following key observations:

The activation spikes occur in the FFN of specific layers, particularly in the early and late layers of the model.
The activation spikes are dedicated to a couple of tokens, rather than being shared across an entire sequence.

Based on these findings, the researchers propose two new techniques to address the activation spike issue:

Quantization-free Module (QFeM): This method isolates the activation spikes by skipping the quantization of certain FFN modules, preserving their full-precision values.
Quantization-free Prefix (QFeP): This technique identifies the tokens associated with activation spikes and applies QFeM only to those specific token embeddings, rather than the entire model.

The researchers validate the effectiveness of these methods through extensive experiments on the latest LLMs with GLU variants, including LLaMA-2/3, Mistral, Mixtral, SOLAR, and Gemma. They show that their techniques can enhance the performance of current state-of-the-art activation quantization methods, such as SmoothQuant, which struggle to control the activation spikes.

Critical Analysis

The researchers have identified an important challenge in quantizing the activations of large language models that use GLU variants in their feed-forward networks. Their proposed solutions, QFeM and QFeP, appear to be effective in addressing this issue based on the experiments presented in the paper.

However, it's worth noting that the researchers' methods involve selectively skipping the quantization of certain model components, which could limit the overall compression and efficiency gains achieved through quantization. Additionally, the techniques may add some complexity to the quantization process, which could impact ease of use or deployment.

Further research could explore alternative approaches to mitigating the activation spike problem, such as activation-aware weight quantization or spiking language models, to see if they can provide similar performance improvements while maintaining the overall efficiency gains of quantization.

Conclusion

This paper sheds light on a significant challenge in quantizing large language models that use Gated Linear Unit (GLU) variants in their feed-forward networks. The researchers' insights into the patterns of "activation spikes" and their proposed techniques, Quantization-free Module (QFeM) and Quantization-free Prefix (QFeP), offer a promising approach to addressing this issue.

By combining multiple post-training techniques, the researchers have demonstrated the ability to enhance the performance of quantized LLMs, which is an important step towards making these powerful AI systems more efficient and accessible. Further research in this area could lead to even more effective and practical quantization solutions for large language models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏋️

Mitigating Quantization Errors Due to Activation Spikes in GLU-Based LLMs

Jaewoo Yang, Hayun Kim, Younghoon Kim

Modern large language models (LLMs) have established state-of-the-art performance through architectural improvements, but still require significant computational cost for inference. In an effort to reduce the inference cost, post-training quantization (PTQ) has become a popular approach, quantizing weights and activations to lower precision, such as INT8. In this paper, we reveal the challenges of activation quantization in GLU variants, which are widely used in feed-forward network (FFN) of modern LLMs, such as LLaMA family. The problem is that severe local quantization errors, caused by excessive magnitudes of activation in GLU variants, significantly degrade the performance of the quantized LLM. We denote these activations as activation spikes. Our further observations provide a systematic pattern of activation spikes: 1) The activation spikes occur in the FFN of specific layers, particularly in the early and late layers, 2) The activation spikes are dedicated to a couple of tokens, rather than being shared across a sequence. Based on our observations, we propose two empirical methods, Quantization-free Module (QFeM) and Quantization-free Prefix (QFeP), to isolate the activation spikes during quantization. Our extensive experiments validate the effectiveness of the proposed methods for the activation quantization, especially with coarse-grained scheme, of latest LLMs with GLU variants, including LLaMA-2/3, Mistral, Mixtral, SOLAR, and Gemma. In particular, our methods enhance the current alleviation techniques (e.g., SmoothQuant) that fail to control the activation spikes. Code is available at https://github.com/onnoo/activation-spikes.

5/24/2024

💬

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

Janghwan Lee, Minsoo Kim, Seungcheol Baek, Seok Joong Hwang, Wonyong Sung, Jungwook Choi

Large Language Models (LLMs) are proficient in natural language processing tasks, but their deployment is often restricted by extensive parameter sizes and computational demands. This paper focuses on post-training quantization (PTQ) in LLMs, specifically 4-bit weight and 8-bit activation (W4A8) quantization, to enhance computational efficiency -- a topic less explored compared to weight-only quantization. We present two innovative techniques: activation-quantization-aware scaling (AQAS) and sequence-length-aware calibration (SLAC) to enhance PTQ by considering the combined effects on weights and activations and aligning calibration sequence lengths to target tasks. Moreover, we introduce dINT, a hybrid data format combining integer and denormal representations, to address the underflow issue in W4A8 quantization, where small values are rounded to zero. Through rigorous evaluations of LLMs, including OPT and LLaMA, we demonstrate that our techniques significantly boost task accuracies to levels comparable with full-precision models. By developing arithmetic units compatible with dINT, we further confirm that our methods yield a 2$times$ hardware efficiency improvement compared to 8-bit integer MAC unit.

7/19/2024

MobileQuant: Mobile-friendly Quantization for On-device Language Models

Fuwen Tan, Royson Lee, {L}ukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, Brais Martinez

Large language models (LLMs) have revolutionized language processing, delivering outstanding results across multiple applications. However, deploying LLMs on edge devices poses several challenges with respect to memory, energy, and compute costs, limiting their widespread use in devices such as mobile phones. A promising solution is to reduce the number of bits used to represent weights and activations. While existing works have found partial success at quantizing LLMs to lower bitwidths, e.g. 4-bit weights, quantizing activations beyond 16 bits often leads to large computational overheads due to poor on-device quantization support, or a considerable accuracy drop. Yet, 8-bit activations are very attractive for on-device deployment as they would enable LLMs to fully exploit mobile-friendly hardware, e.g. Neural Processing Units (NPUs). In this work, we make a first attempt to facilitate the on-device deployment of LLMs using integer-only quantization. We first investigate the limitations of existing quantization methods for on-device deployment, with a special focus on activation quantization. We then address these limitations by introducing a simple post-training quantization method, named MobileQuant, that extends previous weight equivalent transformation works by jointly optimizing the weight transformation and activation range parameters in an end-to-end manner. MobileQuant demonstrates superior capabilities over existing methods by 1) achieving near-lossless quantization on a wide range of LLM benchmarks, 2) reducing latency and energy consumption by 20%-50% compared to current on-device quantization strategies, 3) requiring limited compute budget, 4) being compatible with mobile-friendly compute units, e.g. NPU.

8/27/2024

💬

Evaluating Quantized Large Language Models

Shiyao Li, Xuefei Ning, Luning Wang, Tengxuan Liu, Xiangsheng Shi, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang

Post-training quantization (PTQ) has emerged as a promising technique to reduce the cost of large language models (LLMs). Specifically, PTQ can effectively mitigate memory consumption and reduce computational overhead in LLMs. To meet the requirements of both high efficiency and performance across diverse scenarios, a comprehensive evaluation of quantized LLMs is essential to guide the selection of quantization methods. This paper presents a thorough evaluation of these factors by evaluating the effect of PTQ on Weight, Activation, and KV Cache on 11 model families, including OPT, LLaMA2, Falcon, Bloomz, Mistral, ChatGLM, Vicuna, LongChat, StableLM, Gemma, and Mamba, with parameters ranging from 125M to 180B. The evaluation encompasses five types of tasks: basic NLP, emergent ability, trustworthiness, dialogue, and long-context tasks. Moreover, we also evaluate the state-of-the-art (SOTA) quantization methods to demonstrate their applicability. Based on the extensive experiments, we systematically summarize the effect of quantization, provide recommendations to apply quantization techniques, and point out future directions. The code can be found in https://github.com/thu-nics/qllm-eval.

6/7/2024