Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs

Read original: arXiv:2309.05516 - Published 5/24/2024 by Wenhua Cheng, Weiwei Zhang, Haihao Shen, Yiyang Cai, Xin He, Kaokao Lv, Yi Liu

🤔

Overview

Large Language Models (LLMs) excel at language-related tasks but require substantial memory and storage
Weight-only quantization is a promising solution to address this challenge
Previous research has shown that fine-tuning through up and down rounding can enhance performance
This study introduces SignRound, a method that utilizes signed gradient descent (SignSGD) to optimize rounding values and weight clipping within 200 steps, combining the strengths of Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ)
SignRound achieves outstanding results compared to recent methods across 2 to 4 bits, while maintaining low tuning costs and without introducing any additional inference overhead

Plain English Explanation

Large Language Models (LLMs) are a type of artificial intelligence that are exceptionally good at language-related tasks, such as understanding and generating human-like text. However, these models require a significant amount of memory and storage space to run, which can be a challenge for many real-world applications.

To address this issue, the researchers in this study have developed a new method called SignRound. SignRound uses a technique called "signed gradient descent" to quickly optimize the rounding values and weight clipping of the LLM, which helps to reduce the overall memory and storage requirements without significantly impacting the model's performance.

Compared to other recent methods, SignRound achieves outstanding results across different bit sizes, with absolute average accuracy improvements ranging from 6.91% to 33.22% at 2 bits. Additionally, SignRound demonstrates robust generalization to various recent models and can achieve near-lossless quantization in most scenarios at 4 bits.

The key innovation of SignRound is its ability to combine the strengths of two different quantization techniques, Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ), while only requiring a relatively small number of tuning steps (200) to achieve these impressive results.

Technical Explanation

This study introduces a method called SignRound, which utilizes signed gradient descent (SignSGD) to optimize rounding values and weight clipping for weight-only quantization of large language models (LLMs).

The researchers designed SignRound to combine the strengths of Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ) techniques. QAT involves fine-tuning the model during the quantization process, while PTQ applies quantization after the model has been trained.

SignRound optimizes the rounding values and weight clipping within just 200 steps, significantly reducing the tuning cost compared to previous methods. The researchers found that this approach can achieve outstanding results across 2 to 4 bits, with absolute average accuracy improvements ranging from 6.91% to 33.22% at 2 bits, while maintaining low tuning costs and without introducing any additional inference overhead.

The researchers also demonstrate that SignRound is able to generalize robustly to various recent models, and it can achieve near-lossless quantization in most scenarios at 4 bits.

Critical Analysis

The researchers have presented a promising approach to address the substantial memory and storage requirements of large language models (LLMs) through weight-only quantization. The SignRound method they introduced combines the strengths of Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ) techniques, which is a novel contribution.

One potential limitation of the study is that it focuses primarily on the quantization of LLMs and does not explore the implications of this approach for other types of neural networks or machine learning models. It would be valuable to see if the SignRound method can be effectively applied to a broader range of models and applications.

Additionally, while the researchers have demonstrated the effectiveness of SignRound across various bit sizes and models, it would be helpful to have a more in-depth analysis of the tradeoffs and potential drawbacks of this approach. For example, how does SignRound compare to other quantization techniques in terms of inference speed, memory footprint, and energy efficiency?

Overall, the SignRound method presented in this study is a promising contribution to the field of model compression and deployment, and the researchers have provided a solid foundation for further exploration and refinement of this technique.

Conclusion

This study has introduced SignRound, a novel method for weight-only quantization of large language models (LLMs) that combines the strengths of Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ) techniques. SignRound is able to achieve outstanding results across 2 to 4 bits, with significant accuracy improvements compared to recent methods, while maintaining low tuning costs and without introducing additional inference overhead.

The researchers have demonstrated the robustness and generalization capabilities of SignRound, showcasing its potential to enable the efficient deployment of LLMs in real-world applications. This work represents an important step forward in addressing the substantial memory and storage requirements of these powerful language models, and the publicly available source code can be found here.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤔

Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs

Wenhua Cheng, Weiwei Zhang, Haihao Shen, Yiyang Cai, Xin He, Kaokao Lv, Yi Liu

Large Language Models (LLMs) have demonstrated exceptional proficiency in language-related tasks, but their deployment poses significant challenges due to substantial memory and storage requirements. Weight-only quantization has emerged as a promising solution to address these challenges. Previous research suggests that fine-tuning through up and down rounding can enhance performance. In this study, we introduce SignRound, a method that utilizes signed gradient descent (SignSGD) to optimize rounding values and weight clipping within just 200 steps. SignRound integrates the advantages of Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ), achieving exceptional results across 2 to 4 bits while maintaining low tuning costs and avoiding additional inference overhead. For example, SignRound achieves absolute average accuracy improvements ranging from 6.91% to 33.22% at 2 bits. It also demonstrates robust generalization to recent models and achieves near-lossless quantization in most scenarios at 4 bits. The source code is publicly available at url{https://github.com/intel/auto-round}.

5/24/2024

⚙️

FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization

Jung Hyun Lee, Jeonghoon Kim, Se Jung Kwon, Dongsoo Lee

Post-training quantization (PTQ) has been gaining popularity for the deployment of deep neural networks on resource-limited devices since unlike quantization-aware training, neither a full training dataset nor end-to-end training is required at all. As PTQ schemes based on reconstructing each layer or block output turn out to be effective to enhance quantized model performance, recent works have developed algorithms to devise and learn a new weight-rounding scheme so as to better reconstruct each layer or block output. In this work, we propose a simple yet effective new weight-rounding mechanism for PTQ, coined emph{FlexRound}, based on element-wise division instead of typical element-wise addition such that FlexRound enables jointly learning a common quantization grid size as well as a different scale for each pre-trained weight. Thanks to the reciprocal rule of derivatives induced by element-wise division, FlexRound is inherently able to exploit pre-trained weights when updating their corresponding scales, and thus, flexibly quantize pre-trained weights depending on their magnitudes. We empirically validate the efficacy of FlexRound on a wide range of models and tasks. To the best of our knowledge, our work is the first to carry out comprehensive experiments on not only image classification and natural language understanding but also natural language generation. Moreover, we demonstrate, for the first time, that large language models can be efficiently quantized, with only a negligible impact on performance compared to half-precision baselines, achieved by reconstructing the output in a block-by-block manner. Our code is available at url{https://github.com/onliwad101/FlexRound_LRQ}.

7/17/2024

💬

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

Janghwan Lee, Minsoo Kim, Seungcheol Baek, Seok Joong Hwang, Wonyong Sung, Jungwook Choi

Large Language Models (LLMs) are proficient in natural language processing tasks, but their deployment is often restricted by extensive parameter sizes and computational demands. This paper focuses on post-training quantization (PTQ) in LLMs, specifically 4-bit weight and 8-bit activation (W4A8) quantization, to enhance computational efficiency -- a topic less explored compared to weight-only quantization. We present two innovative techniques: activation-quantization-aware scaling (AQAS) and sequence-length-aware calibration (SLAC) to enhance PTQ by considering the combined effects on weights and activations and aligning calibration sequence lengths to target tasks. Moreover, we introduce dINT, a hybrid data format combining integer and denormal representations, to address the underflow issue in W4A8 quantization, where small values are rounded to zero. Through rigorous evaluations of LLMs, including OPT and LLaMA, we demonstrate that our techniques significantly boost task accuracies to levels comparable with full-precision models. By developing arithmetic units compatible with dINT, we further confirm that our methods yield a 2$times$ hardware efficiency improvement compared to 8-bit integer MAC unit.

7/19/2024

LeanQuant: Accurate Large Language Model Quantization with Loss-Error-Aware Grid

Tianyi Zhang, Anshumali Shrivastava

Large language models (LLMs) have numerous applications across various domains, but their high computational and memory demands pose significant deployment challenges. Weight quantization is an effective technique for reducing the decoding latency and memory requirements of LLMs. Existing approaches primarily aim to maintain the quality of quantized models by preserving outliers in input features, but they still suffer significant quality loss at lower bit widths. Our approach builds on Optimal Brain Quantization (OBQ), an iterative weight-update-based quantization framework. We identify a key limitation of OBQ, specifically that its uniform quantization grid is suboptimal for maintaining model quality, as it introduces large errors to the task loss. To address this, we propose LeanQuant, which learns a loss-error-aware quantization grid by leveraging the inverse diagonal Hessian. Extensive empirical evaluations demonstrate that LeanQuant is both efficient and accurate; it can quantize a 70-billion-parameter model in 6 hours using a single 32GB GPU and performs favorably compared to competitive baselines in the 4-bit, 3-bit, and 2-bit regions.

7/16/2024