LLM-QBench: A Benchmark Towards the Best Practice for Post-training Quantization of Large Language Models

2405.06001

Published 5/13/2024 by Ruihao Gong, Yang Yong, Shiqiao Gu, Yushi Huang, Yunchen Zhang, Xianglong Liu, Dacheng Tao

LLM-QBench: A Benchmark Towards the Best Practice for Post-training Quantization of Large Language Models

Abstract

Recent advancements in large language models (LLMs) are propelling us toward artificial general intelligence, thanks to their remarkable emergent abilities and reasoning capabilities. However, the substantial computational and memory requirements of LLMs limit their widespread adoption. Quan- tization, a key compression technique, offers a viable solution to mitigate these demands by compressing and accelerating LLMs, albeit with poten- tial risks to model accuracy. Numerous studies have aimed to minimize the accuracy loss associated with quantization. However, the quantization configurations in these studies vary and may not be optimized for hard- ware compatibility. In this paper, we focus on identifying the most effective practices for quantizing LLMs, with the goal of balancing performance with computational efficiency. For a fair analysis, we develop a quantization toolkit LLMC, and design four crucial principles considering the inference efficiency, quantized accuracy, calibration cost, and modularization. By benchmarking on various models and datasets with over 500 experiments, three takeaways corresponding to calibration data, quantization algorithm, and quantization schemes are derived. Finally, a best practice of LLM PTQ pipeline is constructed. All the benchmark results and the toolkit can be found at https://github.com/ModelTC/llmc.

Create account to get full access

Overview

Presents a new benchmark called LLM-QBench for evaluating post-training quantization of large language models
Aims to establish best practices and guidelines for effectively quantizing large language models
Examines various quantization techniques and their impact on model performance across different tasks and datasets

Plain English Explanation

LLM-QBench is a new benchmark designed to help researchers and developers find the best ways to reduce the size and computational demands of large language models without significantly impacting their performance.

Large language models like GPT-3 have become incredibly powerful, but they also require a lot of memory and processing power to run. Quantization is a technique that can compress these models by reducing the precision of the numerical values used to represent the model parameters. However, finding the right balance between compression and performance is crucial.

The LLM-QBench benchmark evaluates different quantization techniques across a variety of tasks and datasets. This allows researchers to identify the "best practices" - the quantization methods that provide the greatest efficiency gains with the smallest impact on a model's capabilities. Techniques like APTQ and CBQ are examined, and the results can help quantify the trade-offs between model scale, precision, and capabilities.

By establishing a common benchmark, this research aims to accelerate progress in making large language models more compact and efficient, which could unlock their use in a wider range of real-world applications.

Technical Explanation

The LLM-QBench benchmark covers a range of quantization techniques and evaluates their impact on large language model performance across various tasks and datasets. This includes standard post-training quantization methods as well as more advanced techniques like APTQ and CBQ.

The benchmark assesses quantized models on a diverse set of tasks, including natural language understanding, text generation, and task-specific fine-tuning. This enables a comprehensive analysis of how quantization impacts model capabilities across a range of applications.

The results from LLM-QBench are intended to establish best practices and guidelines for effectively quantizing large language models. By identifying the sweet spot between compression and performance, this work aims to help make these powerful models more practical and accessible for real-world deployment.

Critical Analysis

While the LLM-QBench benchmark provides a valuable framework for evaluating quantization techniques, the paper acknowledges that the findings may be specific to the particular models and datasets used in the study. Extending the benchmark to include a broader range of large language models and tasks would help strengthen the generalizability of the conclusions.

Additionally, the paper does not delve deeply into the underlying mechanisms or theory behind the different quantization methods. A more thorough exploration of the trade-offs and design considerations for each technique could provide additional insights for practitioners.

Lastly, the benchmark focuses primarily on post-training quantization, but there may be value in also exploring quantization-aware training approaches, which could potentially yield even greater efficiency gains. Incorporating such methods into future iterations of the benchmark could further enhance its utility.

Conclusion

The LLM-QBench benchmark represents an important step towards establishing best practices for quantizing large language models. By comprehensively evaluating a range of quantization techniques across diverse tasks and datasets, this research aims to help unlock the potential of these powerful models for a wider range of real-world applications.

The findings from LLM-QBench can guide developers and researchers in choosing the most effective quantization strategies, balancing the competing demands of model performance, memory usage, and computational efficiency. As the field of large language models continues to evolve, this benchmark can serve as a valuable tool for driving progress and ensuring that these transformative technologies can be deployed more widely and sustainably.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Evaluating the Generalization Ability of Quantized LLMs: Benchmark, Analysis, and Toolbox

Yijun Liu, Yuan Meng, Fang Wu, Shenhao Peng, Hang Yao, Chaoyu Guan, Chen Tang, Xinzhu Ma, Zhi Wang, Wenwu Zhu

Large language models (LLMs) have exhibited exciting progress in multiple scenarios, while the huge computational demands hinder their deployments in lots of real-world applications. As an effective means to reduce memory footprint and inference cost, quantization also faces challenges in performance degradation at low bit-widths. Understanding the impact of quantization on LLM capabilities, especially the generalization ability, is crucial. However, the community's main focus remains on the algorithms and models of quantization, with insufficient attention given to whether the quantized models can retain the strong generalization abilities of LLMs. In this work, we fill this gap by providing a comprehensive benchmark suite for this research topic, including an evaluation system, detailed analyses, and a general toolbox. Specifically, based on the dominant pipeline in LLM quantization, we primarily explore the impact of calibration data distribution on the generalization of quantized LLMs and conduct the benchmark using more than 40 datasets within two main scenarios. Based on this benchmark, we conduct extensive experiments with two well-known LLMs (English and Chinese) and four quantization algorithms to investigate this topic in-depth, yielding several counter-intuitive and valuable findings, e.g., models quantized using a calibration set with the same distribution as the test data are not necessarily optimal. Besides, to facilitate future research, we also release a modular-designed toolbox, which decouples the overall pipeline into several separate components, e.g., base LLM module, dataset module, quantizer module, etc. and allows subsequent researchers to easily assemble their methods through a simple configuration. Our benchmark suite is publicly available at https://github.com/TsingmaoAI/MI-optimize

6/21/2024

cs.LG cs.AI cs.CL

A Comprehensive Evaluation of Quantization Strategies for Large Language Models

Renren Jin, Jiangcun Du, Wuwei Huang, Wei Liu, Jian Luan, Bin Wang, Deyi Xiong

Increasing the number of parameters in large language models (LLMs) usually improves performance in downstream tasks but raises compute and memory costs, making deployment difficult in resource-limited settings. Quantization techniques, which reduce the bits needed for model weights or activations with minimal performance loss, have become popular due to the rise of LLMs. However, most quantization studies use pre-trained LLMs, and the impact of quantization on instruction-tuned LLMs and the relationship between perplexity and benchmark performance of quantized LLMs are not well understood. Evaluation of quantized LLMs is often limited to language modeling and a few classification tasks, leaving their performance on other benchmarks unclear. To address these gaps, we propose a structured evaluation framework consisting of three critical dimensions: (1) knowledge & capacity, (2) alignment, and (3) efficiency, and conduct extensive experiments across ten diverse benchmarks. Our experimental results indicate that LLMs with 4-bit quantization can retain performance comparable to their non-quantized counterparts, and perplexity can serve as a proxy metric for quantized LLMs on most benchmarks. Furthermore, quantized LLMs with larger parameter scales can outperform smaller LLMs. Despite the memory savings achieved through quantization, it can also slow down the inference speed of LLMs. Consequently, substantial engineering efforts and hardware support are imperative to achieve a balanced optimization of decoding speed and memory consumption in the context of quantized LLMs.

6/7/2024

cs.CL cs.AI

💬

Evaluating Quantized Large Language Models

Shiyao Li, Xuefei Ning, Luning Wang, Tengxuan Liu, Xiangsheng Shi, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang

Post-training quantization (PTQ) has emerged as a promising technique to reduce the cost of large language models (LLMs). Specifically, PTQ can effectively mitigate memory consumption and reduce computational overhead in LLMs. To meet the requirements of both high efficiency and performance across diverse scenarios, a comprehensive evaluation of quantized LLMs is essential to guide the selection of quantization methods. This paper presents a thorough evaluation of these factors by evaluating the effect of PTQ on Weight, Activation, and KV Cache on 11 model families, including OPT, LLaMA2, Falcon, Bloomz, Mistral, ChatGLM, Vicuna, LongChat, StableLM, Gemma, and Mamba, with parameters ranging from 125M to 180B. The evaluation encompasses five types of tasks: basic NLP, emergent ability, trustworthiness, dialogue, and long-context tasks. Moreover, we also evaluate the state-of-the-art (SOTA) quantization methods to demonstrate their applicability. Based on the extensive experiments, we systematically summarize the effect of quantization, provide recommendations to apply quantization techniques, and point out future directions. The code can be found in https://github.com/thu-nics/qllm-eval.

6/7/2024

cs.CL cs.AI

💬

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, Bohan Zhuang

Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for LLMs. In existing studies, activation outliers in particular channels are identified as the bottleneck to PTQ accuracy. They propose to transform the magnitudes from activations to weights, which however offers limited alleviation or suffers from unstable gradients, resulting in a severe performance drop at low-bitwidth. In this paper, we propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs. QLLM introduces an adaptive channel reassembly technique that reallocates the magnitude of outliers to other channels, thereby mitigating their impact on the quantization range. This is achieved by channel disassembly and channel assembly, which first breaks down the outlier channels into several sub-channels to ensure a more balanced distribution of activation magnitudes. Then similar channels are merged to maintain the original channel number for efficiency. Additionally, an adaptive strategy is designed to autonomously determine the optimal number of sub-channels for channel disassembly. To further compensate for the performance loss caused by quantization, we propose an efficient tuning method that only learns a small number of low-rank weights while freezing the pre-trained quantized model. After training, these low-rank parameters can be fused into the frozen weights without affecting inference. Extensive experiments on LLaMA-1 and LLaMA-2 show that QLLM can obtain accurate quantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B within 10 hours on a single A100-80G GPU, outperforming the previous state-of-the-art method by 7.89% on the average accuracy across five zero-shot tasks.

4/9/2024

cs.CL cs.AI cs.LG