Evaluating Quantized Large Language Models

2402.18158

Published 6/7/2024 by Shiyao Li, Xuefei Ning, Luning Wang, Tengxuan Liu, Xiangsheng Shi, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang

cs.CL cs.AI

💬

Abstract

Post-training quantization (PTQ) has emerged as a promising technique to reduce the cost of large language models (LLMs). Specifically, PTQ can effectively mitigate memory consumption and reduce computational overhead in LLMs. To meet the requirements of both high efficiency and performance across diverse scenarios, a comprehensive evaluation of quantized LLMs is essential to guide the selection of quantization methods. This paper presents a thorough evaluation of these factors by evaluating the effect of PTQ on Weight, Activation, and KV Cache on 11 model families, including OPT, LLaMA2, Falcon, Bloomz, Mistral, ChatGLM, Vicuna, LongChat, StableLM, Gemma, and Mamba, with parameters ranging from 125M to 180B. The evaluation encompasses five types of tasks: basic NLP, emergent ability, trustworthiness, dialogue, and long-context tasks. Moreover, we also evaluate the state-of-the-art (SOTA) quantization methods to demonstrate their applicability. Based on the extensive experiments, we systematically summarize the effect of quantization, provide recommendations to apply quantization techniques, and point out future directions. The code can be found in https://github.com/thu-nics/qllm-eval.

Create account to get full access

Overview

This paper evaluates the impact of post-training quantization (PTQ) on large language models (LLMs) to reduce their memory usage and computational requirements.
The researchers tested 11 different LLM model families, including OPT, LLaMA2, Falcon, Bloomz, and others, with model sizes ranging from 125 million to 180 billion parameters.
They examined the effects of quantizing different components of the models, including weights, activations, and key-value caches, and evaluated performance across a variety of task types.
The paper also compares the effectiveness of different state-of-the-art quantization techniques and provides recommendations for applying quantization to LLMs.

Plain English Explanation

Large language models (LLMs) like GPT-3 and BERT are incredibly powerful, but they also require a lot of memory and computing power to run. This can make them expensive and difficult to use, especially on smaller devices or in resource-constrained environments.

To address this, the researchers in this paper looked at a technique called post-training quantization (PTQ). PTQ is a way to "compress" the LLMs by reducing the precision of the numbers used to represent the model's parameters and activations. This can significantly reduce the model's memory footprint and computational requirements without drastically reducing its performance.

The researchers tested PTQ on 11 different LLM families, ranging from 125 million parameters all the way up to 180 billion parameters. They looked at how quantizing different parts of the model (the weights, activations, and key-value caches) affected the model's performance on a variety of tasks, including basic language understanding, emergent abilities, trustworthiness, dialogue, and long-context tasks.

Overall, the results showed that PTQ can be an effective way to make LLMs more efficient without sacrificing too much in terms of their capabilities. The researchers provide recommendations on how to best apply quantization techniques to different types of LLMs and highlight areas for future research.

Technical Explanation

The researchers in this paper conducted a comprehensive evaluation of post-training quantization (PTQ) techniques for large language models (LLMs). PTQ is a method of compressing LLMs by reducing the precision of the numbers used to represent the model's parameters and activations, which can significantly reduce the model's memory usage and computational requirements.

The researchers tested PTQ on 11 different LLM model families, including OPT, LLaMA2, Falcon, Bloomz, and others, with model sizes ranging from 125 million to 180 billion parameters. They examined the effects of quantizing different components of the models, including weights, activations, and key-value caches, and evaluated the models' performance across five types of tasks: basic NLP, emergent ability, trustworthiness, dialogue, and long-context tasks.

The researchers also evaluated the effectiveness of state-of-the-art quantization methods, such as QLLM and LLM-QBench, to demonstrate their applicability to LLMs.

Based on the extensive experiments, the researchers systematically summarized the effects of quantization on LLMs and provided recommendations for applying quantization techniques. They also identified future research directions, such as exploring the impact of outliers and calibration sets on quantization performance.

Critical Analysis

The researchers in this paper provide a comprehensive and well-designed evaluation of post-training quantization (PTQ) techniques for large language models (LLMs). The breadth of the model families and task types tested, as well as the comparison of state-of-the-art quantization methods, make this a valuable contribution to the field.

However, the paper does not delve into the potential limitations or caveats of the quantization techniques. For example, it would be helpful to understand how the quantization methods might perform on more specialized or domain-specific LLMs, or how they might handle rare or out-of-distribution inputs.

Additionally, the paper focuses on the technical aspects of quantization and its impact on model performance, but it does not explore the potential implications for real-world deployment and use cases. Further research could investigate the tradeoffs between model efficiency and other factors, such as model interpretability, fairness, and safety, when applying quantization techniques.

Overall, this paper provides a strong foundation for understanding the effects of PTQ on LLMs and offers a solid starting point for future research in this area. By encouraging readers to think critically about the research and its potential limitations, the paper helps to advance the field in a thoughtful and responsible manner.

Conclusion

This paper presents a comprehensive evaluation of post-training quantization (PTQ) techniques for large language models (LLMs), with the goal of reducing the memory consumption and computational overhead of these powerful models. The researchers tested 11 different LLM model families, ranging from 125 million to 180 billion parameters, and examined the effects of quantizing various model components, including weights, activations, and key-value caches.

The results demonstrate that PTQ can be an effective way to make LLMs more efficient without significantly compromising their performance on a variety of tasks, including basic language understanding, emergent abilities, trustworthiness, dialogue, and long-context tasks. The researchers also provide recommendations for applying quantization techniques to different types of LLMs and identify areas for future research, such as exploring the impact of outliers and calibration sets on quantization performance.

Overall, this paper makes a valuable contribution to the field of large language model optimization, providing a comprehensive and well-designed evaluation of quantization strategies that can help guide the development of more efficient and accessible LLMs for a wide range of applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

A Comprehensive Evaluation of Quantization Strategies for Large Language Models

Renren Jin, Jiangcun Du, Wuwei Huang, Wei Liu, Jian Luan, Bin Wang, Deyi Xiong

Increasing the number of parameters in large language models (LLMs) usually improves performance in downstream tasks but raises compute and memory costs, making deployment difficult in resource-limited settings. Quantization techniques, which reduce the bits needed for model weights or activations with minimal performance loss, have become popular due to the rise of LLMs. However, most quantization studies use pre-trained LLMs, and the impact of quantization on instruction-tuned LLMs and the relationship between perplexity and benchmark performance of quantized LLMs are not well understood. Evaluation of quantized LLMs is often limited to language modeling and a few classification tasks, leaving their performance on other benchmarks unclear. To address these gaps, we propose a structured evaluation framework consisting of three critical dimensions: (1) knowledge & capacity, (2) alignment, and (3) efficiency, and conduct extensive experiments across ten diverse benchmarks. Our experimental results indicate that LLMs with 4-bit quantization can retain performance comparable to their non-quantized counterparts, and perplexity can serve as a proxy metric for quantized LLMs on most benchmarks. Furthermore, quantized LLMs with larger parameter scales can outperform smaller LLMs. Despite the memory savings achieved through quantization, it can also slow down the inference speed of LLMs. Consequently, substantial engineering efforts and hardware support are imperative to achieve a balanced optimization of decoding speed and memory consumption in the context of quantized LLMs.

6/7/2024

cs.CL cs.AI

LLM-QBench: A Benchmark Towards the Best Practice for Post-training Quantization of Large Language Models

Ruihao Gong, Yang Yong, Shiqiao Gu, Yushi Huang, Yunchen Zhang, Xianglong Liu, Dacheng Tao

Recent advancements in large language models (LLMs) are propelling us toward artificial general intelligence, thanks to their remarkable emergent abilities and reasoning capabilities. However, the substantial computational and memory requirements of LLMs limit their widespread adoption. Quan- tization, a key compression technique, offers a viable solution to mitigate these demands by compressing and accelerating LLMs, albeit with poten- tial risks to model accuracy. Numerous studies have aimed to minimize the accuracy loss associated with quantization. However, the quantization configurations in these studies vary and may not be optimized for hard- ware compatibility. In this paper, we focus on identifying the most effective practices for quantizing LLMs, with the goal of balancing performance with computational efficiency. For a fair analysis, we develop a quantization toolkit LLMC, and design four crucial principles considering the inference efficiency, quantized accuracy, calibration cost, and modularization. By benchmarking on various models and datasets with over 500 experiments, three takeaways corresponding to calibration data, quantization algorithm, and quantization schemes are derived. Finally, a best practice of LLM PTQ pipeline is constructed. All the benchmark results and the toolkit can be found at https://github.com/ModelTC/llmc.

5/13/2024

cs.LG cs.AI cs.CL

🐍

Combining multiple post-training techniques to achieve most efficient quantized LLMs

Sayeh Sharify, Zifei Xu, Wanzin Yazar, Xin Wang

Large Language Models (LLMs) have distinguished themselves with outstanding performance in complex language modeling tasks, yet they come with significant computational and storage challenges. This paper explores the potential of quantization to mitigate these challenges. We systematically study the combined application of two well-known post-training techniques, SmoothQuant and GPTQ, and provide a comprehensive analysis of their interactions and implications for advancing LLM quantization. We enhance the versatility of both techniques by enabling quantization to microscaling (MX) formats, expanding their applicability beyond their initial fixed-point format targets. We show that by applying GPTQ and SmoothQuant, and employing MX formats for quantizing models, we can achieve a significant reduction in the size of OPT models by up to 4x and LLaMA models by up to 3x with a negligible perplexity increase of 1-3%.

5/14/2024

cs.LG cs.AI

💬

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, Bohan Zhuang

Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for LLMs. In existing studies, activation outliers in particular channels are identified as the bottleneck to PTQ accuracy. They propose to transform the magnitudes from activations to weights, which however offers limited alleviation or suffers from unstable gradients, resulting in a severe performance drop at low-bitwidth. In this paper, we propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs. QLLM introduces an adaptive channel reassembly technique that reallocates the magnitude of outliers to other channels, thereby mitigating their impact on the quantization range. This is achieved by channel disassembly and channel assembly, which first breaks down the outlier channels into several sub-channels to ensure a more balanced distribution of activation magnitudes. Then similar channels are merged to maintain the original channel number for efficiency. Additionally, an adaptive strategy is designed to autonomously determine the optimal number of sub-channels for channel disassembly. To further compensate for the performance loss caused by quantization, we propose an efficient tuning method that only learns a small number of low-rank weights while freezing the pre-trained quantized model. After training, these low-rank parameters can be fused into the frozen weights without affecting inference. Extensive experiments on LLaMA-1 and LLaMA-2 show that QLLM can obtain accurate quantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B within 10 hours on a single A100-80G GPU, outperforming the previous state-of-the-art method by 7.89% on the average accuracy across five zero-shot tasks.

4/9/2024

cs.CL cs.AI cs.LG