CBQ: Cross-Block Quantization for Large Language Models

2312.07950

Published 4/16/2024 by Xin Ding, Xiaoyu Liu, Zhijun Tu, Yun Zhang, Wei Li, Jie Hu, Hanting Chen, Yehui Tang, Zhiwei Xiong, Baoqun Yin and 1 other

cs.LG cs.CL

CBQ: Cross-Block Quantization for Large Language Models

Abstract

Post-training quantization (PTQ) has played a key role in compressing large language models (LLMs) with ultra-low costs. However, existing PTQ methods only focus on handling the outliers within one layer or one block, which ignores the dependency of blocks and leads to severe performance degradation in low-bit settings. In this paper, we propose CBQ, a cross-block reconstruction-based PTQ method for LLMs. CBQ employs a cross-block dependency using a homologous reconstruction scheme, establishing long-range dependencies across multiple blocks to minimize error accumulation. Furthermore, CBQ incorporates a coarse-to-fine preprocessing (CFP) strategy for suppressing weight and activation outliers, coupled with an adaptive LoRA-Rounding technique for precise weight quantization. These innovations enable CBQ to not only handle extreme outliers effectively but also improve overall quantization accuracy. Extensive experiments show that CBQ achieves superior low-bit quantization (W4A4, W4A8, W2A16) and outperforms existing state-of-the-art methods across various LLMs and datasets. Notably, CBQ quantizes the 4-bit LLAMA1-65B model within only 4.3 hours on a single GPU, achieving a commendable tradeoff between performance and quantization efficiency.

Get summaries of the top AI research delivered straight to your inbox:

Overview

This paper presents a novel technique called Cross-Block Quantization (CBQ) for efficiently and accurately quantizing large language models (LLMs) to low bitwidths.
The key idea is to exploit the correlation between the values in different weight blocks to enable more effective quantization, resulting in improved model accuracy compared to previous post-training quantization methods.
The authors also introduce techniques to mitigate the impact of outlier channels and optimize the inference latency of the quantized model.

Plain English Explanation

The paper discusses a new way to compress large language models, which are complex AI systems that can understand and generate human-like text. These models can be very large and require a lot of memory and computing power, making them difficult to use on devices with limited resources like smartphones.

The researchers developed a technique called Cross-Block Quantization (CBQ) that can shrink the size of these language models without losing too much of their accuracy. The main insight is that the values in different parts of the model's weights are often correlated, so the researchers can exploit these relationships to compress the model more effectively.

Additionally, the paper introduces ways to address some specific challenges with quantizing language models, such as dealing with outlier values that can degrade the model's performance. The researchers also show how to optimize the speed of the quantized model during inference (when the model is actually being used to generate text), so it can run quickly on devices with limited resources.

Overall, this research provides a promising approach to making large language models more practical to use in a wider range of applications by compressing them without sacrificing too much of their original capabilities.

Technical Explanation

The paper introduces Cross-Block Quantization (CBQ), a novel post-training quantization technique for efficiently compressing large language models (LLMs) to low bitwidths while maintaining high accuracy.

The key idea behind CBQ is to exploit the correlation between the values in different weight blocks of the LLM. Traditional post-training quantization methods [<a class="ltx_ref" href="https://aimodels.fyi/papers/arxiv/qllm-accurate-efficient-low-bitwidth-quantization-large">QLLM</a>] optimize the quantization parameters independently for each weight block, missing opportunities for more effective compression. In contrast, CBQ learns a shared quantization codebook across multiple weight blocks, allowing it to better capture the underlying structure of the model weights.

To further improve the quantization, the authors introduce techniques to [<a class="ltx_ref" href="https://aimodels.fyi/papers/arxiv/accurate-block-quantization-llms-outliers">mitigate the impact of outlier channels</a>] that can degrade the model's performance, as well as an [<a class="ltx_ref" href="https://aimodels.fyi/papers/arxiv/mitigating-impact-outlier-channels-language-model-quantization">optimization method to reduce the inference latency</a>] of the quantized model.

Experiments on popular LLMs like GPT-2 and BERT show that CBQ can achieve significantly higher accuracy compared to previous post-training quantization methods, while also providing faster inference speeds.

Critical Analysis

The paper presents a comprehensive set of techniques to enable efficient and accurate quantization of large language models. The authors thoroughly evaluate their methods on a range of popular LLMs and demonstrate impressive results.

However, one potential limitation is that the experiments are primarily focused on post-training quantization. It would be interesting to see how the CBQ techniques could be combined with [<a class="ltx_ref" href="https://aimodels.fyi/papers/arxiv/qaq-quality-adaptive-quantization-llm-kv-cache">adaptive quantization approaches</a>] that adjust the bitwidths during model training or inference. This could potentially lead to even greater compression and efficiency gains.

Additionally, while the authors address the issue of outlier channels, there may be other model-specific challenges that could arise when deploying quantized LLMs in real-world applications. Further research on the robustness and generalization of these techniques would be valuable.

Finally, the paper does not explicitly discuss the potential environmental or societal impacts of its findings. As large language models become more widespread, it will be important to consider the energy efficiency and carbon footprint of deploying these systems, particularly in resource-constrained settings. The authors could have provided a more holistic discussion of the implications of their work.

Conclusion

The Cross-Block Quantization (CBQ) technique presented in this paper offers a promising approach to efficiently compressing large language models without sacrificing too much of their performance. By exploiting the correlations between different weight blocks, CBQ can achieve higher accuracy compared to previous post-training quantization methods.

The additional techniques introduced in the paper, such as mitigating the impact of outlier channels and optimizing inference latency, further enhance the practicality of deploying quantized LLMs on devices with limited resources. This research represents an important step towards making powerful language models more accessible and energy-efficient, with potential applications in a wide range of domains, from mobile devices to edge computing.

As language models continue to grow in size and complexity, techniques like CBQ will become increasingly crucial for enabling their widespread adoption and real-world impact.

Related Papers

💬

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, Bohan Zhuang

Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for LLMs. In existing studies, activation outliers in particular channels are identified as the bottleneck to PTQ accuracy. They propose to transform the magnitudes from activations to weights, which however offers limited alleviation or suffers from unstable gradients, resulting in a severe performance drop at low-bitwidth. In this paper, we propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs. QLLM introduces an adaptive channel reassembly technique that reallocates the magnitude of outliers to other channels, thereby mitigating their impact on the quantization range. This is achieved by channel disassembly and channel assembly, which first breaks down the outlier channels into several sub-channels to ensure a more balanced distribution of activation magnitudes. Then similar channels are merged to maintain the original channel number for efficiency. Additionally, an adaptive strategy is designed to autonomously determine the optimal number of sub-channels for channel disassembly. To further compensate for the performance loss caused by quantization, we propose an efficient tuning method that only learns a small number of low-rank weights while freezing the pre-trained quantized model. After training, these low-rank parameters can be fused into the frozen weights without affecting inference. Extensive experiments on LLaMA-1 and LLaMA-2 show that QLLM can obtain accurate quantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B within 10 hours on a single A100-80G GPU, outperforming the previous state-of-the-art method by 7.89% on the average accuracy across five zero-shot tasks.

4/9/2024

cs.CL cs.AI cs.LG

APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models

Ziyi Guan, Hantao Huang, Yupeng Su, Hong Huang, Ngai Wong, Hao Yu

Large Language Models (LLMs) have greatly advanced the natural language processing paradigm. However, the high computational load and huge model sizes pose a grand challenge for deployment on edge devices. To this end, we propose APTQ (Attention-aware Post-Training Mixed-Precision Quantization) for LLMs, which considers not only the second-order information of each layer's weights, but also, for the first time, the nonlinear effect of attention outputs on the entire model. We leverage the Hessian trace as a sensitivity metric for mixed-precision quantization, ensuring an informed precision reduction that retains model performance. Experiments show APTQ surpasses previous quantization methods, achieving an average of 4 bit width a 5.22 perplexity nearly equivalent to full precision in the C4 dataset. In addition, APTQ attains state-of-the-art zero-shot accuracy of 68.24% and 70.48% at an average bitwidth of 3.8 in LLaMa-7B and LLaMa-13B, respectively, demonstrating its effectiveness to produce high-quality quantized LLMs.

4/17/2024

cs.LG cs.AI cs.CL

📉

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, Song Han

Large language models (LLMs) have fundamentally transformed the capabilities of numerous applications, from natural language processing to more intricate domain-specific tasks in robotics and autonomous driving. Moreover, the importance of on-device LLMs has grown significantly in the recent years. Running LLMs on edge devices not only promises reduced latency and improved user experience but also aligns with the increasing need for user privacy, as data processing can occur locally. However, the astronomical model sizes of modern LLMs and constraints of the edge devices, primarily in terms of memory size and bandwidth, pose significant deployment challenges. In this paper, we propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. Our method is based on the observation that weights are not equally important: protecting only 1% of salient weights can greatly reduce quantization error. We then propose to search for the optimal per-channel scaling that protects the salient weights by observing the activation, not weights. AWQ does not rely on any backpropagation or reconstruction, so it can well preserve LLMs' generalization ability on different domains and modalities, without overfitting to the calibration set. AWQ outperforms existing work on various language modeling and domain-specific benchmarks (coding and math). Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. Alongside AWQ, we implement TinyChat, an efficient and flexible inference framework tailored for on-device LLM/VLMs, offering more than 3x speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also democratizes the deployment of the 70B Llama-2 model on mobile GPUs.

4/23/2024

cs.CL

When Quantization Affects Confidence of Large Language Models?

Irina Proskurina, Luc Brun, Guillaume Metzler, Julien Velcin

Recent studies introduced effective compression techniques for Large Language Models (LLMs) via post-training quantization or low-bit weight representation. Although quantized weights offer storage efficiency and allow for faster inference, existing works have indicated that quantization might compromise performance and exacerbate biases in LLMs. This study investigates the confidence and calibration of quantized models, considering factors such as language model type and scale as contributors to quantization loss. Firstly, we reveal that quantization with GPTQ to 4-bit results in a decrease in confidence regarding true labels, with varying impacts observed among different language models. Secondly, we observe fluctuations in the impact on confidence across different scales. Finally, we propose an explanation for quantization loss based on confidence levels, indicating that quantization disproportionately affects samples where the full model exhibited low confidence levels in the first place.

5/2/2024

cs.CL cs.AI