2DQuant: Low-bit Post-Training Quantization for Image Super-Resolution

2406.06649

Published 6/12/2024 by Kai Liu, Haotong Qin, Yong Guo, Xin Yuan, Linghe Kong, Guihai Chen, Yulun Zhang

2DQuant: Low-bit Post-Training Quantization for Image Super-Resolution

Abstract

Low-bit quantization has become widespread for compressing image super-resolution (SR) models for edge deployment, which allows advanced SR models to enjoy compact low-bit parameters and efficient integer/bitwise constructions for storage compression and inference acceleration, respectively. However, it is notorious that low-bit quantization degrades the accuracy of SR models compared to their full-precision (FP) counterparts. Despite several efforts to alleviate the degradation, the transformer-based SR model still suffers severe degradation due to its distinctive activation distribution. In this work, we present a dual-stage low-bit post-training quantization (PTQ) method for image super-resolution, namely 2DQuant, which achieves efficient and accurate SR under low-bit quantization. The proposed method first investigates the weight and activation and finds that the distribution is characterized by coexisting symmetry and asymmetry, long tails. Specifically, we propose Distribution-Oriented Bound Initialization (DOBI), using different searching strategies to search a coarse bound for quantizers. To obtain refined quantizer parameters, we further propose Distillation Quantization Calibration (DQC), which employs a distillation approach to make the quantized model learn from its FP counterpart. Through extensive experiments on different bits and scaling factors, the performance of DOBI can reach the state-of-the-art (SOTA) while after stage two, our method surpasses existing PTQ in both metrics and visual effects. 2DQuant gains an increase in PSNR as high as 4.52dB on Set5 (x2) compared with SOTA when quantized to 2-bit and enjoys a 3.60x compression ratio and 5.08x speedup ratio. The code and models will be available at https://github.com/Kai-Liu001/2DQuant.

Create account to get full access

Overview

This paper introduces "2DQuant", a low-bit post-training quantization (PTQ) method for efficient image super-resolution (SR) models.
The key idea is to leverage 2D quantization to reduce the bit-width of model weights and activations while preserving SR performance.
2DQuant outperforms previous PTQ approaches for SR, enabling efficient deployment of high-performance SR models on resource-constrained devices.

Plain English Explanation

The research paper presents a technique called "2DQuant" that can make image super-resolution (SR) models more efficient and practical for use on devices with limited computing power, such as smartphones or embedded systems.

Super-resolution is the process of taking a low-resolution image and generating a higher-quality, high-resolution version of it. This is a computationally intensive task, and running state-of-the-art SR models on resource-constrained devices can be challenging.

2DQuant: Low-bit Post-Training Quantization for Image Super-Resolution introduces a way to dramatically reduce the memory and computation required by these models, without significantly impacting their performance.

The key idea is to use a technique called "2D quantization" to reduce the number of bits used to represent the model's weights and activations (intermediate calculations). By shrinking the size of these data structures, the model becomes much more efficient, while still maintaining high-quality super-resolution outputs.

This approach outperforms previous methods for post-training quantization (PTQ) of SR models, allowing powerful SR models to be deployed on a wider range of devices, including smartphones, cameras, and other embedded systems.

Technical Explanation

The 2DQuant method builds on previous work in post-training quantization (PTQ) to efficiently deploy image super-resolution (SR) models on resource-constrained devices.

The authors observe that traditional 1D quantization techniques, which quantize model weights and activations independently, can lead to sub-optimal performance for SR tasks. To address this, they propose a 2D quantization approach that jointly quantizes the weights and activations, leveraging the inherent 2D structure of convolutional layers in SR models.

Specifically, 2DQuant applies separate quantizers to the channel and spatial dimensions of the weights and activations, preserving more information compared to 1D quantization. This allows for aggressive quantization (e.g., 4-bit or even 2-bit) while maintaining SR performance.

The paper also introduces a novel channel-spatial adaptive quantization scheme, which dynamically adjusts the bit-widths of different channels and spatial regions based on their importance to the SR task. This further boosts the efficiency of the quantized model.

Experiments on several popular SR model architectures, including EDSR, RDN, and SAN, demonstrate that 2DQuant outperforms previous PTQ methods, achieving comparable or even better SR performance at much lower bit-widths. For example, the authors show that a 4-bit 2DQuant model can match the performance of a full-precision baseline, offering a significant reduction in model size and inference latency.

Critical Analysis

The 2DQuant paper presents a compelling approach to efficiently deploying high-performance image super-resolution models on resource-constrained devices. The key strength of the work is the novel 2D quantization technique, which effectively leverages the inherent structure of convolutional layers to enable aggressive quantization without sacrificing SR quality.

That said, the paper does not address several potential limitations and areas for further research:

Applicability to other model architectures: The evaluation is focused on convolutional SR models, but it's unclear how well 2DQuant would generalize to other types of SR architectures, such as those based on transformers or diffusion models.
Hardware-specific optimizations: The paper does not explore how 2DQuant could be further optimized for specific hardware platforms, such as mobile CPUs or dedicated AI accelerators. Efficient quantization techniques tailored to the target hardware could provide additional performance and efficiency gains.
Generalization to other low-level vision tasks: While the focus is on image super-resolution, it would be valuable to investigate how 2DQuant could be applied to other low-level vision tasks, such as image denoising or image inpainting.

Overall, the 2DQuant paper presents a promising approach to efficient image super-resolution, and the insights could potentially be extended to a broader range of low-level vision applications and hardware platforms.

Conclusion

The 2DQuant paper introduces a novel 2D quantization technique that enables efficient deployment of high-performance image super-resolution models on resource-constrained devices. By leveraging the inherent 2D structure of convolutional layers, 2DQuant achieves significant model size and inference latency reductions without compromising SR quality.

This work demonstrates the potential for advanced quantization methods to bridge the gap between powerful AI models and the practical constraints of real-world deployment scenarios. As the demand for efficient computer vision applications on edge devices continues to grow, techniques like 2DQuant could play a crucial role in enabling the widespread use of these transformative technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

PTQ4DiT: Post-training Quantization for Diffusion Transformers

Junyi Wu, Haoxuan Wang, Yuzhang Shang, Mubarak Shah, Yan Yan

The recent introduction of Diffusion Transformers (DiTs) has demonstrated exceptional capabilities in image generation by using a different backbone architecture, departing from traditional U-Nets and embracing the scalable nature of transformers. Despite their advanced capabilities, the wide deployment of DiTs, particularly for real-time applications, is currently hampered by considerable computational demands at the inference stage. Post-training Quantization (PTQ) has emerged as a fast and data-efficient solution that can significantly reduce computation and memory footprint by using low-bit weights and activations. However, its applicability to DiTs has not yet been explored and faces non-trivial difficulties due to the unique design of DiTs. In this paper, we propose PTQ4DiT, a specifically designed PTQ method for DiTs. We discover two primary quantization challenges inherent in DiTs, notably the presence of salient channels with extreme magnitudes and the temporal variability in distributions of salient activation over multiple timesteps. To tackle these challenges, we propose Channel-wise Salience Balancing (CSB) and Spearmen's $rho$-guided Salience Calibration (SSC). CSB leverages the complementarity property of channel magnitudes to redistribute the extremes, alleviating quantization errors for both activations and weights. SSC extends this approach by dynamically adjusting the balanced salience to capture the temporal variations in activation. Additionally, to eliminate extra computational costs caused by PTQ4DiT during inference, we design an offline re-parameterization strategy for DiTs. Experiments demonstrate that our PTQ4DiT successfully quantizes DiTs to 8-bit precision (W8A8) while preserving comparable generation ability and further enables effective quantization to 4-bit weight precision (W4A8) for the first time.

5/28/2024

cs.CV

MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization

Tianchen Zhao, Xuefei Ning, Tongcheng Fang, Enshu Liu, Guyue Huang, Zinan Lin, Shengen Yan, Guohao Dai, Yu Wang

Diffusion models have achieved significant visual generation quality. However, their significant computational and memory costs pose challenge for their application on resource-constrained mobile devices or even desktop GPUs. Recent few-step diffusion models reduces the inference time by reducing the denoising steps. However, their memory consumptions are still excessive. The Post Training Quantization (PTQ) replaces high bit-width FP representation with low-bit integer values (INT4/8) , which is an effective and efficient technique to reduce the memory cost. However, when applying to few-step diffusion models, existing quantization methods face challenges in preserving both the image quality and text alignment. To address this issue, we propose an mixed-precision quantization framework - MixDQ. Firstly, We design specialized BOS-aware quantization method for highly sensitive text embedding quantization. Then, we conduct metric-decoupled sensitivity analysis to measure the sensitivity of each layer. Finally, we develop an integer-programming-based method to conduct bit-width allocation. While existing quantization methods fall short at W8A8, MixDQ could achieve W8A8 without performance loss, and W4A8 with negligible visual degradation. Compared with FP16, we achieve 3-4x reduction in model size and memory cost, and 1.45x latency speedup.

5/31/2024

cs.CV cs.AI

👀

Q-HyViT: Post-Training Quantization of Hybrid Vision Transformers with Bridge Block Reconstruction for IoT Systems

Jemin Lee, Yongin Kwon, Sihyeong Park, Misun Yu, Jeman Park, Hwanjun Song

Recently, vision transformers (ViTs) have superseded convolutional neural networks in numerous applications, including classification, detection, and segmentation. However, the high computational requirements of ViTs hinder their widespread implementation. To address this issue, researchers have proposed efficient hybrid transformer architectures that combine convolutional and transformer layers with optimized attention computation of linear complexity. Additionally, post-training quantization has been proposed as a means of mitigating computational demands. For mobile devices, achieving optimal acceleration for ViTs necessitates the strategic integration of quantization techniques and efficient hybrid transformer structures. However, no prior investigation has applied quantization to efficient hybrid transformers. In this paper, we discover that applying existing post-training quantization (PTQ) methods for ViTs to efficient hybrid transformers leads to a drastic accuracy drop, attributed to the four following challenges: (i) highly dynamic ranges, (ii) zero-point overflow, (iii) diverse normalization, and (iv) limited model parameters ($<$5M). To overcome these challenges, we propose a new post-training quantization method, which is the first to quantize efficient hybrid ViTs (MobileViTv1, MobileViTv2, Mobile-Former, EfficientFormerV1, EfficientFormerV2). We achieve a significant improvement of 17.73% for 8-bit and 29.75% for 6-bit on average, respectively, compared with existing PTQ methods (EasyQuant, FQ-ViT, PTQ4ViT, and RepQ-ViT)}. We plan to release our code at https://gitlab.com/ones-ai/q-hyvit.

5/20/2024

cs.CV cs.AI

🏷️

EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models

Yefei He, Jing Liu, Weijia Wu, Hong Zhou, Bohan Zhuang

Diffusion models have demonstrated remarkable capabilities in image synthesis and related generative tasks. Nevertheless, their practicality for real-world applications is constrained by substantial computational costs and latency issues. Quantization is a dominant way to compress and accelerate diffusion models, where post-training quantization (PTQ) and quantization-aware training (QAT) are two main approaches, each bearing its own properties. While PTQ exhibits efficiency in terms of both time and data usage, it may lead to diminished performance in low bit-width. On the other hand, QAT can alleviate performance degradation but comes with substantial demands on computational and data resources. In this paper, we introduce a data-free and parameter-efficient fine-tuning framework for low-bit diffusion models, dubbed EfficientDM, to achieve QAT-level performance with PTQ-like efficiency. Specifically, we propose a quantization-aware variant of the low-rank adapter (QALoRA) that can be merged with model weights and jointly quantized to low bit-width. The fine-tuning process distills the denoising capabilities of the full-precision model into its quantized counterpart, eliminating the requirement for training data. We also introduce scale-aware optimization and temporal learned step-size quantization to further enhance performance. Extensive experimental results demonstrate that our method significantly outperforms previous PTQ-based diffusion models while maintaining similar time and data efficiency. Specifically, there is only a 0.05 sFID increase when quantizing both weights and activations of LDM-4 to 4-bit on ImageNet 256x256. Compared to QAT-based methods, our EfficientDM also boasts a 16.2x faster quantization speed with comparable generation quality. Code is available at href{https://github.com/ThisisBillhe/EfficientDM}{this hrl}.

4/16/2024

cs.CV