FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search

Read original: arXiv:2308.03290 - Published 5/2/2024 by Jordan Dotzel, Gang Wu, Andrew Li, Muhammad Umar, Yun Ni, Mohamed S. Abdelfattah, Zhiru Zhang, Liqun Cheng, Martin G. Dixon, Norman P. Jouppi and 2 others

➖

Overview

This paper proposes a new technique called FLIQS (Flexible and Lightweight Integer and Quantization Search) for efficiently searching and discovering optimal mixed-precision quantization strategies for deep neural networks.
The key innovations are a one-shot mixed-precision quantization search that eliminates the need for retraining, and the ability to search both integer and low-precision floating-point quantization simultaneously.
The authors evaluate FLIQS on multiple convolutional and vision transformer networks, demonstrating improved accuracy compared to prior uniform precision, manual mixed-precision, and integer quantization search methods.

Plain English Explanation

Deep neural networks (DNNs) have become incredibly powerful, but they also require a lot of computing power and memory to run. Quantization is a technique that can compress these models and make them more efficient, by representing the numbers in the network with fewer bits.

However, finding the right way to quantize a model is tricky. Prior methods have either compromised on accuracy or used a lot of memory. This paper proposes a new approach called FLIQS that can search for the optimal mixed-precision quantization strategy in a single step, without needing to retrain the model.

FLIQS can explore both integer quantization and low-precision floating-point quantization, which allows it to discover the best trade-off between model size, speed, and accuracy. The authors show that FLIQS can improve the accuracy of ResNet models on ImageNet by over 1% compared to previous quantization methods, while using the same model size.

For the first time, the paper also explores mixed-precision floating-point quantization, and shows FLIQS can improve the performance of MobileNetV2 by up to 0.98% compared to prior state-of-the-art FP8 models.

Finally, the authors extend FLIQS to simultaneously search the quantization strategy and the neural network architecture, further improving ImageNet accuracy by 2.69% on a MobileNetV2 search space.

Technical Explanation

The key innovations in this paper are:

One-shot mixed-precision quantization search: Prior methods have either performed a post-training quantization search, which compromises accuracy, or a differentiable quantization search, which leads to high memory usage. FLIQS eliminates the need for retraining by performing a one-shot search that discovers the optimal mixed-precision quantization strategy.
Searching integer and low-precision floating-point: FLIQS can search both integer quantization and low-precision floating-point quantization, allowing it to find the best trade-off between model size, speed, and accuracy.

The authors evaluate FLIQS on multiple convolutional and vision transformer networks, including ResNet-18, ResNet-50, and MobileNetV2. Compared to prior methods:

For integer models, FLIQS increases the accuracy of ResNet-18 on ImageNet by 1.31% and ResNet-50 by 0.90%, with equivalent model cost.
For low-precision floating-point, FLIQS improves MobileNetV2 by up to 0.98% compared to prior state-of-the-art FP8 models.
By extending FLIQS to search the quantization strategy and neural architecture simultaneously, the authors improve ImageNet accuracy by 2.69% on a MobileNetV2 search space.

Critical Analysis

The paper provides a compelling solution to the problem of efficiently discovering optimal mixed-precision quantization strategies for deep neural networks. The one-shot search approach and the ability to explore both integer and low-precision floating-point quantization are significant contributions.

However, the paper does not address the potential computational overhead of the FLIQS search process itself. While the method eliminates the need for retraining, the search may still be computationally expensive, especially when extended to search the quantization strategy and neural architecture simultaneously.

Additionally, the paper focuses on image classification tasks, and it would be valuable to see how FLIQS performs on a wider range of applications, such as natural language processing or speech recognition, to better understand its broader applicability.

Further research could also explore the robustness of FLIQS-optimized models to distributional shift or adversarial attacks, as quantization can sometimes introduce vulnerabilities.

Conclusion

This paper presents a novel mixed-precision quantization search method called FLIQS that can efficiently discover optimal quantization strategies for deep neural networks. By eliminating the need for retraining and exploring both integer and low-precision floating-point quantization, FLIQS is able to achieve significant accuracy improvements over prior methods while maintaining similar model cost.

The ability to simultaneously search the quantization strategy and neural architecture is particularly promising, as it demonstrates the potential for this approach to further optimize model efficiency. While the paper focuses on image classification tasks, the underlying principles of FLIQS could have broad applicability to a wide range of deep learning applications.

As hardware continues to evolve and support more diverse quantization capabilities, techniques like FLIQS will become increasingly important for deploying high-performance, resource-efficient deep neural networks in real-world settings.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

➖

FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search

Jordan Dotzel, Gang Wu, Andrew Li, Muhammad Umar, Yun Ni, Mohamed S. Abdelfattah, Zhiru Zhang, Liqun Cheng, Martin G. Dixon, Norman P. Jouppi, Quoc V. Le, Sheng Li

Quantization has become a mainstream compression technique for reducing model size, computational requirements, and energy consumption for modern deep neural networks (DNNs). With improved numerical support in recent hardware, including multiple variants of integer and floating point, mixed-precision quantization has become necessary to achieve high-quality results with low model cost. Prior mixed-precision methods have performed either a post-training quantization search, which compromises on accuracy, or a differentiable quantization search, which leads to high memory usage from branching. Therefore, we propose the first one-shot mixed-precision quantization search that eliminates the need for retraining in both integer and low-precision floating point models. We evaluate our search (FLIQS) on multiple convolutional and vision transformer networks to discover Pareto-optimal models. Our approach improves upon uniform precision, manual mixed-precision, and recent integer quantization search methods. With integer models, we increase the accuracy of ResNet-18 on ImageNet by 1.31% and ResNet-50 by 0.90% with equivalent model cost over previous methods. Additionally, for the first time, we explore a novel mixed-precision floating-point search and improve MobileNetV2 by up to 0.98% compared to prior state-of-the-art FP8 models. Finally, we extend FLIQS to simultaneously search a joint quantization and neural architecture space and improve the ImageNet accuracy by 2.69% with similar model cost on a MobileNetV2 search space.

5/2/2024

🤯

I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models

Xing Hu, Yuan Cheng, Dawei Yang, Zhihang Yuan, Jiangyong Yu, Chen Xu, Sifan Zhou

Post-training quantization (PTQ) serves as a potent technique to accelerate the inference of large language models (LLMs). Nonetheless, existing works still necessitate a considerable number of floating-point (FP) operations during inference, including additional quantization and de-quantization, as well as non-linear operators such as RMSNorm and Softmax. This limitation hinders the deployment of LLMs on the edge and cloud devices. In this paper, we identify the primary obstacle to integer-only quantization for LLMs lies in the large fluctuation of activations across channels and tokens in both linear and non-linear operations. To address this issue, we propose I-LLM, a novel integer-only fully-quantized PTQ framework tailored for LLMs. Specifically, (1) we develop Fully-Smooth Block-Reconstruction (FSBR) to aggressively smooth inter-channel variations of all activations and weights. (2) to alleviate degradation caused by inter-token variations, we introduce a novel approach called Dynamic Integer-only MatMul (DI-MatMul). This method enables dynamic quantization in full-integer matrix multiplication by dynamically quantizing the input and outputs with integer-only operations. (3) we design DI-ClippedSoftmax, DI-Exp, and DI-Normalization, which utilize bit shift to execute non-linear operators efficiently while maintaining accuracy. The experiment shows that our I-LLM achieves comparable accuracy to the FP baseline and outperforms non-integer quantization methods. For example, I-LLM can operate at W4A4 with negligible loss of accuracy. To our knowledge, we are the first to bridge the gap between integer-only quantization and LLMs. We've published our code on anonymous.4open.science, aiming to contribute to the advancement of this field.

6/6/2024

Low-Bitwidth Floating Point Quantization for Efficient High-Quality Diffusion Models

Cheng Chen, Christina Giannoula, Andreas Moshovos

Diffusion models are emerging models that generate images by iteratively denoising random Gaussian noise using deep neural networks. These models typically exhibit high computational and memory demands, necessitating effective post-training quantization for high-performance inference. Recent works propose low-bitwidth (e.g., 8-bit or 4-bit) quantization for diffusion models, however 4-bit integer quantization typically results in low-quality images. We observe that on several widely used hardware platforms, there is little or no difference in compute capability between floating-point and integer arithmetic operations of the same bitwidth (e.g., 8-bit or 4-bit). Therefore, we propose an effective floating-point quantization method for diffusion models that provides better image quality compared to integer quantization methods. We employ a floating-point quantization method that was effective for other processing tasks, specifically computer vision and natural language tasks, and tailor it for diffusion models by integrating weight rounding learning during the mapping of the full-precision values to the quantized values in the quantization process. We comprehensively study integer and floating-point quantization methods in state-of-the-art diffusion models. Our floating-point quantization method not only generates higher-quality images than that of integer quantization methods, but also shows no noticeable degradation compared to full-precision models (32-bit floating-point), when both weights and activations are quantized to 8-bit floating-point values, while has minimal degradation with 4-bit weights and 8-bit activations.

8/14/2024

MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization

Tianchen Zhao, Xuefei Ning, Tongcheng Fang, Enshu Liu, Guyue Huang, Zinan Lin, Shengen Yan, Guohao Dai, Yu Wang

Diffusion models have achieved significant visual generation quality. However, their significant computational and memory costs pose challenge for their application on resource-constrained mobile devices or even desktop GPUs. Recent few-step diffusion models reduces the inference time by reducing the denoising steps. However, their memory consumptions are still excessive. The Post Training Quantization (PTQ) replaces high bit-width FP representation with low-bit integer values (INT4/8) , which is an effective and efficient technique to reduce the memory cost. However, when applying to few-step diffusion models, existing quantization methods face challenges in preserving both the image quality and text alignment. To address this issue, we propose an mixed-precision quantization framework - MixDQ. Firstly, We design specialized BOS-aware quantization method for highly sensitive text embedding quantization. Then, we conduct metric-decoupled sensitivity analysis to measure the sensitivity of each layer. Finally, we develop an integer-programming-based method to conduct bit-width allocation. While existing quantization methods fall short at W8A8, MixDQ could achieve W8A8 without performance loss, and W4A8 with negligible visual degradation. Compared with FP16, we achieve 3-4x reduction in model size and memory cost, and 1.45x latency speedup.

5/31/2024