FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization

Read original: arXiv:2306.00317 - Published 7/17/2024 by Jung Hyun Lee, Jeonghoon Kim, Se Jung Kwon, Dongsoo Lee

⚙️

Overview

This paper proposes a new weight-rounding mechanism called "FlexRound" for post-training quantization (PTQ) of deep neural networks.
PTQ allows for efficient deployment of deep learning models on resource-limited devices without requiring a full training dataset or end-to-end retraining.
The FlexRound method uses element-wise division instead of typical element-wise addition to jointly learn a common quantization grid size and a different scale for each pre-trained weight.
The authors demonstrate the effectiveness of FlexRound on a wide range of models and tasks, including image classification, natural language understanding, and natural language generation.
They also show that large language models can be efficiently quantized with only a negligible impact on performance by reconstructing the output in a block-by-block manner.

Plain English Explanation

Deep neural networks have become incredibly powerful and useful, but they can also be computationally expensive and resource-intensive, making them difficult to deploy on devices with limited processing power or memory, like smartphones or embedded systems. Post-training quantization (PTQ) is a technique that can help address this problem by reducing the precision of the network's weights and activations without the need for retraining the entire model from scratch.

In this paper, the researchers propose a new weight-rounding mechanism called "FlexRound" that is designed to work well with PTQ. The key idea behind FlexRound is to use element-wise division instead of the more common element-wise addition when adjusting the scale of each weight. This allows the method to jointly learn a common quantization grid size and a unique scale for each pre-trained weight, depending on its magnitude.

The advantage of this approach is that it can more effectively reconstruct the output of each layer or block in the neural network, which is crucial for maintaining the model's performance after quantization. The researchers show that FlexRound works well across a wide range of models and tasks, including image classification, natural language understanding, and even natural language generation.

Importantly, the researchers also demonstrate that large language models, which are often extremely computationally expensive, can be efficiently quantized using FlexRound with only a small impact on their performance. This is a significant finding, as it could enable the deployment of these powerful models on a much wider range of devices.

Technical Explanation

The key contribution of this work is the introduction of a new weight-rounding mechanism for post-training quantization (PTQ) called "FlexRound". Unlike traditional PTQ schemes that rely on reconstructing each layer or block output, FlexRound uses element-wise division instead of element-wise addition to jointly learn a common quantization grid size and a different scale for each pre-trained weight.

The authors argue that this approach is more effective at reconstructing the output of each layer or block, which is crucial for maintaining the performance of the quantized model. The reciprocal rule of derivatives induced by element-wise division allows FlexRound to exploit pre-trained weights when updating their corresponding scales, enabling it to flexibly quantize the weights based on their magnitudes.

To validate the effectiveness of FlexRound, the authors conduct comprehensive experiments on a wide range of models and tasks, including image classification, natural language understanding, and natural language generation. They show that FlexRound outperforms other PTQ methods across these diverse domains.

Notably, the authors also demonstrate, for the first time, that large language models can be efficiently quantized with only a negligible impact on performance. This is achieved by reconstructing the output in a block-by-block manner, which allows the model to maintain its accuracy even after quantization.

Critical Analysis

The paper presents a novel and promising approach to post-training quantization of deep neural networks. The key strength of the FlexRound method is its ability to jointly learn a common quantization grid size and unique scales for each pre-trained weight, which allows it to effectively reconstruct the output of each layer or block.

One potential limitation of the work is that it does not explore the performance of FlexRound on more complex or specialized tasks, such as medical image analysis or multi-modal learning. Additionally, the authors do not provide a detailed analysis of the computational overhead or memory footprint of the FlexRound method compared to other PTQ approaches.

Future research could investigate the scalability of FlexRound to extremely large models, such as the latest generation of transformer-based language models. Exploring the integration of FlexRound with other PTQ techniques, such as low-rank quantization, could also lead to further performance improvements.

Overall, the FlexRound method presented in this paper represents a significant step forward in the field of post-training quantization and has the potential to enable the efficient deployment of deep learning models on resource-constrained devices.

Conclusion

This paper introduces a novel weight-rounding mechanism called "FlexRound" for post-training quantization of deep neural networks. By using element-wise division instead of element-wise addition, FlexRound is able to jointly learn a common quantization grid size and unique scales for each pre-trained weight, allowing it to effectively reconstruct the output of each layer or block.

The authors demonstrate the effectiveness of FlexRound across a wide range of models and tasks, including image classification, natural language understanding, and natural language generation. Notably, they show that FlexRound can be used to efficiently quantize large language models with only a negligible impact on performance, a significant finding that could enable the deployment of these powerful models on a much wider range of devices.

Overall, the FlexRound method presented in this paper represents an important advancement in the field of post-training quantization and has the potential to play a key role in the future deployment of deep learning models on resource-limited devices.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

⚙️

FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization

Jung Hyun Lee, Jeonghoon Kim, Se Jung Kwon, Dongsoo Lee

Post-training quantization (PTQ) has been gaining popularity for the deployment of deep neural networks on resource-limited devices since unlike quantization-aware training, neither a full training dataset nor end-to-end training is required at all. As PTQ schemes based on reconstructing each layer or block output turn out to be effective to enhance quantized model performance, recent works have developed algorithms to devise and learn a new weight-rounding scheme so as to better reconstruct each layer or block output. In this work, we propose a simple yet effective new weight-rounding mechanism for PTQ, coined emph{FlexRound}, based on element-wise division instead of typical element-wise addition such that FlexRound enables jointly learning a common quantization grid size as well as a different scale for each pre-trained weight. Thanks to the reciprocal rule of derivatives induced by element-wise division, FlexRound is inherently able to exploit pre-trained weights when updating their corresponding scales, and thus, flexibly quantize pre-trained weights depending on their magnitudes. We empirically validate the efficacy of FlexRound on a wide range of models and tasks. To the best of our knowledge, our work is the first to carry out comprehensive experiments on not only image classification and natural language understanding but also natural language generation. Moreover, we demonstrate, for the first time, that large language models can be efficiently quantized, with only a negligible impact on performance compared to half-precision baselines, achieved by reconstructing the output in a block-by-block manner. Our code is available at url{https://github.com/onliwad101/FlexRound_LRQ}.

7/17/2024

🤔

Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs

Wenhua Cheng, Weiwei Zhang, Haihao Shen, Yiyang Cai, Xin He, Kaokao Lv, Yi Liu

Large Language Models (LLMs) have demonstrated exceptional proficiency in language-related tasks, but their deployment poses significant challenges due to substantial memory and storage requirements. Weight-only quantization has emerged as a promising solution to address these challenges. Previous research suggests that fine-tuning through up and down rounding can enhance performance. In this study, we introduce SignRound, a method that utilizes signed gradient descent (SignSGD) to optimize rounding values and weight clipping within just 200 steps. SignRound integrates the advantages of Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ), achieving exceptional results across 2 to 4 bits while maintaining low tuning costs and avoiding additional inference overhead. For example, SignRound achieves absolute average accuracy improvements ranging from 6.91% to 33.22% at 2 bits. It also demonstrates robust generalization to recent models and achieves near-lossless quantization in most scenarios at 4 bits. The source code is publicly available at url{https://github.com/intel/auto-round}.

5/24/2024

LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices

Jung Hyun Lee, Jeonghoon Kim, June Yong Yang, Se Jung Kwon, Eunho Yang, Kang Min Yoo, Dongsoo Lee

With the commercialization of large language models (LLMs), weight-activation quantization has emerged to compress and accelerate LLMs, achieving high throughput while reducing inference costs. However, existing post-training quantization (PTQ) techniques for quantizing weights and activations of LLMs still suffer from non-negligible accuracy drops, especially on massive multitask language understanding. To address this issue, we propose Low-Rank Quantization (LRQ) $-$ a simple yet effective post-training weight quantization method for LLMs that reconstructs the outputs of an intermediate Transformer block by leveraging low-rank weight-scaling matrices, replacing the conventional full weight-scaling matrices that entail as many learnable scales as their associated weights. Thanks to parameter sharing via low-rank structure, LRQ only needs to learn significantly fewer parameters while enabling the individual scaling of weights, thus boosting the generalization capability of quantized LLMs. We show the superiority of LRQ over prior LLM PTQ works under (i) $8$-bit weight and per-tensor activation quantization, (ii) $4$-bit weight and $8$-bit per-token activation quantization, and (iii) low-bit weight-only quantization schemes. Our code is available at url{https://github.com/onliwad101/FlexRound_LRQ} to inspire LLM researchers and engineers.

7/17/2024

⛏️

CDQuant: Accurate Post-training Weight Quantization of Large Pre-trained Models using Greedy Coordinate Descent

Pranav Ajit Nair, Arun Sai Suggala

Large language models (LLMs) have recently demonstrated remarkable performance across diverse language tasks. But their deployment is often constrained by their substantial computational and storage requirements. Quantization has emerged as a key technique for addressing this challenge, enabling the compression of large models with minimal impact on performance. The recent GPTQ algorithm, a post-training quantization (PTQ) method, has proven highly effective for compressing LLMs, sparking a wave of research that leverages GPTQ as a core component. Recognizing the pivotal role of GPTQ in the PTQ landscape, we introduce CDQuant, a simple and scalable alternative to GPTQ with improved performance. CDQuant uses coordinate descent to minimize the layer-wise reconstruction loss to achieve high-quality quantized weights. Our algorithm is easy to implement and scales efficiently to models with hundreds of billions of parameters. Through extensive evaluation on the PaLM2 model family, we demonstrate that CDQuant consistently outperforms GPTQ across diverse model sizes and quantization levels. In particular, for INT2 quantization of PaLM2-Otter, CDQuant achieves a 10% reduction in perplexity compared to GPTQ.

6/27/2024