Combining multiple post-training techniques to achieve most efficient quantized LLMs

2405.07135

Published 5/14/2024 by Sayeh Sharify, Zifei Xu, Wanzin Yazar, Xin Wang

🐍

Abstract

Large Language Models (LLMs) have distinguished themselves with outstanding performance in complex language modeling tasks, yet they come with significant computational and storage challenges. This paper explores the potential of quantization to mitigate these challenges. We systematically study the combined application of two well-known post-training techniques, SmoothQuant and GPTQ, and provide a comprehensive analysis of their interactions and implications for advancing LLM quantization. We enhance the versatility of both techniques by enabling quantization to microscaling (MX) formats, expanding their applicability beyond their initial fixed-point format targets. We show that by applying GPTQ and SmoothQuant, and employing MX formats for quantizing models, we can achieve a significant reduction in the size of OPT models by up to 4x and LLaMA models by up to 3x with a negligible perplexity increase of 1-3%.

Create account to get full access

Overview

Large language models (LLMs) have demonstrated impressive performance in complex language tasks, but come with significant computational and storage challenges.
This paper explores the potential of quantization techniques to mitigate these challenges.
The researchers systematically study the combined application of two well-known post-training techniques, SmoothQuant and GPTQ, and analyze their interactions and implications for advancing LLM quantization.
The techniques are enhanced to enable quantization to microscaling (MX) formats, expanding their applicability beyond fixed-point formats.

Plain English Explanation

Large language models (LLMs) are powerful AI systems that can perform complex language tasks with outstanding results. However, these models require a lot of computing power and storage space, which can be a challenge. This paper explores a technique called quantization that can help reduce the size and computational demands of LLMs without significantly impacting their performance.

The researchers combined two well-known quantization techniques, SmoothQuant and GPTQ, and tested them on popular LLM models like OPT and LLaMA. They found that by using these techniques and a new quantization format called MX, they could reduce the size of OPT models by up to 4 times and LLaMA models by up to 3 times, with only a small increase in perplexity (a measure of how well the model can predict text).

Technical Explanation

The paper systematically studies the combined application of two well-known post-training quantization techniques, SmoothQuant and GPTQ, and analyzes their interactions and implications for advancing LLM quantization. The researchers enhance the versatility of both techniques by enabling quantization to microscaling (MX) formats, expanding their applicability beyond their initial fixed-point format targets.

The experiments show that by applying GPTQ and SmoothQuant, and employing MX formats for quantizing models, the researchers can achieve a significant reduction in the size of OPT models by up to 4x and LLaMA models by up to 3x, with a negligible perplexity increase of 1-3%.

Critical Analysis

The paper provides a comprehensive analysis of the interactions between GPTQ and SmoothQuant, as well as the implications of using MX formats for LLM quantization. However, the researchers do not address the potential impact of quantization on model confidence, which has been explored in other studies, such as When Quantization Affects Confidence in Large Language Models.

Additionally, the paper does not discuss the potential trade-offs between model size reduction and other performance metrics, such as compressibility or inference latency. Further research may be needed to fully understand the implications of these techniques in real-world deployments.

Conclusion

This paper demonstrates the potential of quantization techniques, specifically the combined application of SmoothQuant and GPTQ with MX formats, to significantly reduce the size of large language models without compromising their performance. The findings have important implications for the deployment of LLMs in resource-constrained environments, such as edge devices or mobile applications, where storage and computational efficiency are crucial. The research contributes to the ongoing efforts to make LLMs more accessible and practical for a wider range of applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Evaluating Quantized Large Language Models

Shiyao Li, Xuefei Ning, Luning Wang, Tengxuan Liu, Xiangsheng Shi, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang

Post-training quantization (PTQ) has emerged as a promising technique to reduce the cost of large language models (LLMs). Specifically, PTQ can effectively mitigate memory consumption and reduce computational overhead in LLMs. To meet the requirements of both high efficiency and performance across diverse scenarios, a comprehensive evaluation of quantized LLMs is essential to guide the selection of quantization methods. This paper presents a thorough evaluation of these factors by evaluating the effect of PTQ on Weight, Activation, and KV Cache on 11 model families, including OPT, LLaMA2, Falcon, Bloomz, Mistral, ChatGLM, Vicuna, LongChat, StableLM, Gemma, and Mamba, with parameters ranging from 125M to 180B. The evaluation encompasses five types of tasks: basic NLP, emergent ability, trustworthiness, dialogue, and long-context tasks. Moreover, we also evaluate the state-of-the-art (SOTA) quantization methods to demonstrate their applicability. Based on the extensive experiments, we systematically summarize the effect of quantization, provide recommendations to apply quantization techniques, and point out future directions. The code can be found in https://github.com/thu-nics/qllm-eval.

6/7/2024

cs.CL cs.AI

LLM-QBench: A Benchmark Towards the Best Practice for Post-training Quantization of Large Language Models

Ruihao Gong, Yang Yong, Shiqiao Gu, Yushi Huang, Yunchen Zhang, Xianglong Liu, Dacheng Tao

Recent advancements in large language models (LLMs) are propelling us toward artificial general intelligence, thanks to their remarkable emergent abilities and reasoning capabilities. However, the substantial computational and memory requirements of LLMs limit their widespread adoption. Quan- tization, a key compression technique, offers a viable solution to mitigate these demands by compressing and accelerating LLMs, albeit with poten- tial risks to model accuracy. Numerous studies have aimed to minimize the accuracy loss associated with quantization. However, the quantization configurations in these studies vary and may not be optimized for hard- ware compatibility. In this paper, we focus on identifying the most effective practices for quantizing LLMs, with the goal of balancing performance with computational efficiency. For a fair analysis, we develop a quantization toolkit LLMC, and design four crucial principles considering the inference efficiency, quantized accuracy, calibration cost, and modularization. By benchmarking on various models and datasets with over 500 experiments, three takeaways corresponding to calibration data, quantization algorithm, and quantization schemes are derived. Finally, a best practice of LLM PTQ pipeline is constructed. All the benchmark results and the toolkit can be found at https://github.com/ModelTC/llmc.

5/13/2024

cs.LG cs.AI cs.CL

💬

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, Bohan Zhuang

Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for LLMs. In existing studies, activation outliers in particular channels are identified as the bottleneck to PTQ accuracy. They propose to transform the magnitudes from activations to weights, which however offers limited alleviation or suffers from unstable gradients, resulting in a severe performance drop at low-bitwidth. In this paper, we propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs. QLLM introduces an adaptive channel reassembly technique that reallocates the magnitude of outliers to other channels, thereby mitigating their impact on the quantization range. This is achieved by channel disassembly and channel assembly, which first breaks down the outlier channels into several sub-channels to ensure a more balanced distribution of activation magnitudes. Then similar channels are merged to maintain the original channel number for efficiency. Additionally, an adaptive strategy is designed to autonomously determine the optimal number of sub-channels for channel disassembly. To further compensate for the performance loss caused by quantization, we propose an efficient tuning method that only learns a small number of low-rank weights while freezing the pre-trained quantized model. After training, these low-rank parameters can be fused into the frozen weights without affecting inference. Extensive experiments on LLaMA-1 and LLaMA-2 show that QLLM can obtain accurate quantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B within 10 hours on a single A100-80G GPU, outperforming the previous state-of-the-art method by 7.89% on the average accuracy across five zero-shot tasks.

4/9/2024

cs.CL cs.AI cs.LG

A Comprehensive Evaluation of Quantization Strategies for Large Language Models

Renren Jin, Jiangcun Du, Wuwei Huang, Wei Liu, Jian Luan, Bin Wang, Deyi Xiong

Increasing the number of parameters in large language models (LLMs) usually improves performance in downstream tasks but raises compute and memory costs, making deployment difficult in resource-limited settings. Quantization techniques, which reduce the bits needed for model weights or activations with minimal performance loss, have become popular due to the rise of LLMs. However, most quantization studies use pre-trained LLMs, and the impact of quantization on instruction-tuned LLMs and the relationship between perplexity and benchmark performance of quantized LLMs are not well understood. Evaluation of quantized LLMs is often limited to language modeling and a few classification tasks, leaving their performance on other benchmarks unclear. To address these gaps, we propose a structured evaluation framework consisting of three critical dimensions: (1) knowledge & capacity, (2) alignment, and (3) efficiency, and conduct extensive experiments across ten diverse benchmarks. Our experimental results indicate that LLMs with 4-bit quantization can retain performance comparable to their non-quantized counterparts, and perplexity can serve as a proxy metric for quantized LLMs on most benchmarks. Furthermore, quantized LLMs with larger parameter scales can outperform smaller LLMs. Despite the memory savings achieved through quantization, it can also slow down the inference speed of LLMs. Consequently, substantial engineering efforts and hardware support are imperative to achieve a balanced optimization of decoding speed and memory consumption in the context of quantized LLMs.

6/7/2024

cs.CL cs.AI