AdpQ: A Zero-shot Calibration Free Adaptive Post Training Quantization Method for LLMs

2405.13358

Published 5/24/2024 by Alireza Ghaffari, Sharareh Younesian, Vahid Partovi Nia, Boxing Chen, Masoud Asgharian

🏋️

Abstract

The ever-growing computational complexity of Large Language Models (LLMs) necessitates efficient deployment strategies. The current state-of-the-art approaches for Post-training Quantization (PTQ) often require calibration to achieve the desired accuracy. This paper presents AdpQ, a novel zero-shot adaptive PTQ method for LLMs that achieves the state-of-the-art performance in low-precision quantization (e.g. 3-bit) without requiring any calibration data. Inspired by Adaptive LASSO regression model, our proposed approach tackles the challenge of outlier activations by separating salient weights using an adaptive soft-thresholding method. Guided by Adaptive LASSO, this method ensures that the quantized weights distribution closely follows the originally trained weights and eliminates the need for calibration data entirely, setting our method apart from popular approaches such as SpQR and AWQ. Furthermore, our method offers an additional benefit in terms of privacy preservation by eliminating any calibration or training data. We also delve deeper into the information-theoretic underpinnings of the proposed method. We demonstrate that it leverages the Adaptive LASSO to minimize the Kullback-Leibler divergence between the quantized weights and the originally trained weights. This minimization ensures the quantized model retains the Shannon information content of the original model to a great extent, guaranteeing efficient deployment without sacrificing accuracy or information. Our results achieve the same accuracy as the existing methods on various LLM benchmarks while the quantization time is reduced by at least 10x, solidifying our contribution to efficient and privacy-preserving LLM deployment.

Create account to get full access

Overview

This paper presents a novel zero-shot adaptive post-training quantization (PTQ) method called AdpQ for efficient deployment of Large Language Models (LLMs).
AdpQ achieves state-of-the-art performance in low-precision quantization (e.g. 3-bit) without requiring any calibration data, unlike existing approaches.
The method is inspired by Adaptive LASSO regression and tackles the challenge of outlier activations by separating salient weights using an adaptive soft-thresholding technique.
AdpQ eliminates the need for calibration data entirely, offering improved privacy preservation compared to popular methods like SpQR and AWQ.

Plain English Explanation

Large language models (LLMs) have become incredibly powerful, but they also require a lot of computational resources to run. The authors of this paper have developed a new technique called AdpQ that can make these models much more efficient to deploy without sacrificing their accuracy.

The key insight is that when you try to "compress" an LLM by reducing the precision of its weights (e.g., from 32-bit floating-point to just 3-bits), there are certain weights that are more important than others. AdpQ uses a technique inspired by Adaptive LASSO regression to identify these salient weights and preserve them, while compressing the less important weights.

Importantly, AdpQ can do this without requiring any additional calibration data, which is a common limitation of other quantization techniques. This not only makes the quantization process faster, but it also helps preserve the privacy of the original training data, since no extra data needs to be collected.

The authors show that AdpQ can achieve the same accuracy as existing methods on various LLM benchmarks, while reducing the quantization time by at least 10x. This represents a significant improvement in the efficiency and privacy-preserving capabilities of LLM deployment.

Technical Explanation

The core contribution of this paper is the introduction of AdpQ, a novel zero-shot adaptive post-training quantization (PTQ) method for efficient deployment of Large Language Models (LLMs).

Existing state-of-the-art PTQ approaches, such as SpQR and AWQ, often require calibration data to achieve the desired accuracy. In contrast, AdpQ is a zero-shot method that does not rely on any calibration data, making it more privacy-preserving.

The key innovation of AdpQ is its use of Adaptive LASSO regression to tackle the challenge of outlier activations. By separating salient weights using an adaptive soft-thresholding technique, AdpQ ensures that the quantized weights distribution closely follows the originally trained weights, eliminating the need for calibration data.

The authors also provide an information-theoretic analysis of AdpQ, showing that it minimizes the Kullback-Leibler divergence between the quantized weights and the originally trained weights. This minimization guarantees that the quantized model retains the Shannon information content of the original model to a great extent, ensuring efficient deployment without sacrificing accuracy or information.

The paper presents extensive experiments on various LLM benchmarks, demonstrating that AdpQ achieves the same accuracy as existing methods while reducing the quantization time by at least 10x. This substantial improvement in efficiency and privacy-preserving capabilities makes AdpQ a valuable contribution to the field of efficient LLM deployment.

Critical Analysis

The paper presents a compelling solution to the challenge of efficiently deploying large language models without sacrificing accuracy or privacy. The authors' use of Adaptive LASSO regression to identify and preserve salient weights is a novel and promising approach.

One potential limitation of the study is the lack of evaluation on a broader range of LLM architectures and tasks. While the authors demonstrate the effectiveness of AdpQ on various benchmarks, it would be valuable to see how the method performs on a more diverse set of models and applications, including diffusion models and other emerging LLM paradigms.

Additionally, the paper does not provide a detailed analysis of the computational and memory footprint of AdpQ compared to other PTQ methods. Understanding the trade-offs between the various approaches in terms of deployment efficiency would be helpful for practitioners looking to adopt these techniques.

Overall, the AdpQ method presents a promising step forward in the quest for efficient and privacy-preserving LLM deployment. The authors' combination of multiple post-training techniques to achieve state-of-the-art performance is an intriguing area for further exploration.

Conclusion

This paper introduces AdpQ, a novel zero-shot adaptive post-training quantization (PTQ) method that achieves state-of-the-art performance in low-precision quantization of Large Language Models (LLMs) without requiring any calibration data. By leveraging Adaptive LASSO regression to identify and preserve salient weights, AdpQ offers a significant improvement in efficiency and privacy-preserving capabilities compared to existing PTQ approaches.

The authors' information-theoretic analysis of the method, demonstrating its ability to retain the Shannon information content of the original model, provides a strong theoretical foundation for the observed empirical results. AdpQ's potential to enable efficient and privacy-preserving LLM deployment can have far-reaching implications for the broader adoption and societal impact of these powerful models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

OAC: Output-adaptive Calibration for Accurate Post-training Quantization

Ali Edalati (Huawei Noah's Ark Lab), Alireza Ghaffari (Huawei Noah's Ark Lab, Department of Mathematics and Statistics, McGill University), Masoud Asgharian (Department of Mathematics and Statistics, McGill University), Lu Hou (Huawei Noah's Ark Lab), Boxing Chen (Huawei Noah's Ark Lab), Vahid Partovi Nia (Huawei Noah's Ark Lab)

Deployment of Large Language Models (LLMs) has major computational costs, due to their rapidly expanding size. Compression of LLMs reduces the memory footprint, latency, and energy required for their inference. Post-training Quantization (PTQ) techniques have been developed to compress LLMs while avoiding expensive re-training. Most PTQ approaches formulate the quantization error based on a layer-wise $ell_2$ loss, ignoring the model output. Then, each layer is calibrated using its layer-wise Hessian to update the weights towards minimizing the $ell_2$ quantization error. The Hessian is also used for detecting the most salient weights to quantization. Such PTQ approaches are prone to accuracy drop in low-precision quantization. We propose Output-adaptive Calibration (OAC) to incorporate the model output in the calibration process. We formulate the quantization error based on the distortion of the output cross-entropy loss. OAC approximates the output-adaptive Hessian for each layer under reasonable assumptions to reduce the computational complexity. The output-adaptive Hessians are used to update the weight matrices and detect the salient weights towards maintaining the model output. Our proposed method outperforms the state-of-the-art baselines such as SpQR and BiLLM, especially, at extreme low-precision (2-bit and binary) quantization.

5/27/2024

cs.LG cs.CL

APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models

Ziyi Guan, Hantao Huang, Yupeng Su, Hong Huang, Ngai Wong, Hao Yu

Large Language Models (LLMs) have greatly advanced the natural language processing paradigm. However, the high computational load and huge model sizes pose a grand challenge for deployment on edge devices. To this end, we propose APTQ (Attention-aware Post-Training Mixed-Precision Quantization) for LLMs, which considers not only the second-order information of each layer's weights, but also, for the first time, the nonlinear effect of attention outputs on the entire model. We leverage the Hessian trace as a sensitivity metric for mixed-precision quantization, ensuring an informed precision reduction that retains model performance. Experiments show APTQ surpasses previous quantization methods, achieving an average of 4 bit width a 5.22 perplexity nearly equivalent to full precision in the C4 dataset. In addition, APTQ attains state-of-the-art zero-shot accuracy of 68.24% and 70.48% at an average bitwidth of 3.8 in LLaMa-7B and LLaMa-13B, respectively, demonstrating its effectiveness to produce high-quality quantized LLMs.

4/17/2024

cs.LG cs.AI cs.CL

💬

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, Bohan Zhuang

Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for LLMs. In existing studies, activation outliers in particular channels are identified as the bottleneck to PTQ accuracy. They propose to transform the magnitudes from activations to weights, which however offers limited alleviation or suffers from unstable gradients, resulting in a severe performance drop at low-bitwidth. In this paper, we propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs. QLLM introduces an adaptive channel reassembly technique that reallocates the magnitude of outliers to other channels, thereby mitigating their impact on the quantization range. This is achieved by channel disassembly and channel assembly, which first breaks down the outlier channels into several sub-channels to ensure a more balanced distribution of activation magnitudes. Then similar channels are merged to maintain the original channel number for efficiency. Additionally, an adaptive strategy is designed to autonomously determine the optimal number of sub-channels for channel disassembly. To further compensate for the performance loss caused by quantization, we propose an efficient tuning method that only learns a small number of low-rank weights while freezing the pre-trained quantized model. After training, these low-rank parameters can be fused into the frozen weights without affecting inference. Extensive experiments on LLaMA-1 and LLaMA-2 show that QLLM can obtain accurate quantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B within 10 hours on a single A100-80G GPU, outperforming the previous state-of-the-art method by 7.89% on the average accuracy across five zero-shot tasks.

4/9/2024

cs.CL cs.AI cs.LG

Outliers and Calibration Sets have Diminishing Effect on Quantization of Modern LLMs

Davide Paglieri, Saurabh Dash, Tim Rocktaschel, Jack Parker-Holder

Post-Training Quantization (PTQ) enhances the efficiency of Large Language Models (LLMs) by enabling faster operation and compatibility with more accessible hardware through reduced memory usage, at the cost of small performance drops. We explore the role of calibration sets in PTQ, specifically their effect on hidden activations in various notable open-source LLMs. Calibration sets are crucial for evaluating activation magnitudes and identifying outliers, which can distort the quantization range and negatively impact performance. Our analysis reveals a marked contrast in quantization effectiveness across models. The older OPT model, upon which much of the quantization literature is based, shows significant performance deterioration and high susceptibility to outliers with varying calibration sets. In contrast, newer models like Llama-2 7B, Llama-3 8B, Command-R 35B, and Mistral 7B demonstrate strong robustness, with Mistral 7B showing near-immunity to outliers and stable activations. These findings suggest a shift in PTQ strategies might be needed. As advancements in pre-training methods reduce the relevance of outliers, there is an emerging need to reassess the fundamentals of current quantization literature. The emphasis should pivot towards optimizing inference speed, rather than primarily focusing on outlier preservation, to align with the evolving characteristics of state-of-the-art LLMs.

6/6/2024

cs.LG cs.AI cs.CL