Accurate Neural Training with 4-bit Matrix Multiplications at Standard Formats

Read original: arXiv:2112.10769 - Published 6/11/2024 by Brian Chmiel, Ron Banner, Elad Hoffer, Hilla Ben Yaacov, Daniel Soudry

🧠

Overview

Deep Neural Networks (DNNs) require significant computational resources, which can be a barrier to their deployment on resource-constrained devices.
One way to reduce the computational footprint of DNNs is through quantization, which involves representing the weights and activations with fewer bits.
Previous methods have enabled 4-bit quantization of the forward phase (inference), but the training process also requires quantizing the neural gradients, which has proven more challenging.
This paper introduces a new method called Logarithmic Unbiased Quantization (LUQ) that allows for accurate 4-bit quantization of both the forward and backward (training) phases.

Plain English Explanation

Deep learning models, like the ones used for image recognition or language processing, are incredibly powerful but also very computationally intensive. This can make it challenging to use these models on devices with limited computing power, like smartphones or embedded systems.

One way to address this is through a technique called quantization. The idea is to represent the numerical values inside the model (the weights and activations) using fewer bits, which reduces the amount of memory and computation required. Previous work has shown how to effectively quantize the forward pass of a model to 4 bits, but the backward pass during training has been more difficult to quantize.

This paper introduces a new quantization method called Logarithmic Unbiased Quantization (LUQ) that can accurately quantize both the forward and backward passes to 4 bits. The key innovations are:

Maintaining an unbiased quantization, so the quantized values are on average equal to the original values. This helps preserve the accuracy of the model.
Using a logarithmic scale for the quantization, which is well-suited for the wide range of values that can occur in neural network gradients.

By combining these two ideas, the researchers were able to achieve state-of-the-art results in 4-bit training of deep learning models, with only a small degradation in accuracy compared to full-precision training. This could enable deploying powerful deep learning models on a wider range of hardware, including resource-constrained devices.

Technical Explanation

The paper introduces a new quantization method called Logarithmic Unbiased Quantization (LUQ) that can accurately quantize both the forward and backward passes of deep neural network training to 4 bits.

Previous work has shown that accurate 4-bit quantization of neural gradients requires two key properties:

The quantization must be unbiased, meaning the average of the quantized values is equal to the average of the original values. This helps preserve the accuracy of the model.
The quantization must use a logarithmic scale, which is well-suited for the wide range of values that can occur in neural network gradients.

However, no prior work had combined these two ideas. The authors of this paper examine the importance of unbiased quantization in quantized neural network training and propose the LUQ method to achieve both unbiased and logarithmic quantization.

The authors evaluate LUQ on the ResNet50 model trained on the ImageNet dataset. They show that LUQ can achieve a degradation of only 1.1% compared to full-precision training. Furthermore, by combining LUQ with a variance reduction method and a short fine-tuning stage, they are able to reduce the degradation to just 0.32%.

The key technical contributions of this work are:

Demonstrating the importance of unbiased quantization for neural network training.
Proposing the LUQ method that combines unbiased and logarithmic quantization.
Achieving state-of-the-art results in 4-bit training of deep learning models with minimal accuracy degradation.

Critical Analysis

The paper makes a compelling case for the importance of unbiased quantization in training deep neural networks and presents an effective solution in the form of the LUQ method. The experimental results on the ResNet50 model are impressive, showing that 4-bit training with LUQ can achieve a degradation of only 0.32% compared to full-precision training.

One potential limitation of the work is that it has only been evaluated on a single model (ResNet50) and dataset (ImageNet). It would be valuable to see how well LUQ generalizes to a wider range of deep learning architectures and tasks. Additionally, the paper does not provide much insight into the computational or memory savings achieved through 4-bit quantization, which would be an important practical consideration.

Another area for further research could be exploring the interaction between LUQ and other quantization-aware training techniques, such as AdaQAT or Gradient-based Automatic Per-Weight Mixed Precision. Combining LUQ with these methods could potentially lead to even greater computational efficiency without sacrificing model accuracy.

Overall, this paper represents an important step forward in enabling efficient training of deep learning models, which could expand the deployment of these powerful techniques to a wider range of hardware platforms, including resource-constrained devices.

Conclusion

This paper introduces a new quantization method called Logarithmic Unbiased Quantization (LUQ) that can accurately quantize both the forward and backward passes of deep neural network training to 4 bits. By combining unbiased quantization and a logarithmic scale, LUQ achieves state-of-the-art results in 4-bit training, with only a small degradation in accuracy compared to full-precision training.

This work demonstrates the importance of unbiased quantization for preserving model accuracy and provides an effective solution that could enable the deployment of powerful deep learning models on a wider range of hardware, including resource-constrained devices. Further research is needed to evaluate LUQ on a broader range of architectures and tasks, as well as to explore its integration with other quantization-aware training techniques.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🧠

Accurate Neural Training with 4-bit Matrix Multiplications at Standard Formats

Brian Chmiel, Ron Banner, Elad Hoffer, Hilla Ben Yaacov, Daniel Soudry

Quantization of the weights and activations is one of the main methods to reduce the computational footprint of Deep Neural Networks (DNNs) training. Current methods enable 4-bit quantization of the forward phase. However, this constitutes only a third of the training process. Reducing the computational footprint of the entire training process requires the quantization of the neural gradients, i.e., the loss gradients with respect to the outputs of intermediate neural layers. Previous works separately showed that accurate 4-bit quantization of the neural gradients needs to (1) be unbiased and (2) have a log scale. However, no previous work aimed to combine both ideas, as we do in this work. Specifically, we examine the importance of having unbiased quantization in quantized neural network training, where to maintain it, and how to combine it with logarithmic quantization. Based on this, we suggest a $textit{logarithmic unbiased quantization}$ (LUQ) method to quantize both the forward and backward phases to 4-bit, achieving state-of-the-art results in 4-bit training without the overhead. For example, in ResNet50 on ImageNet, we achieved a degradation of 1.1%. We further improve this to a degradation of only 0.32% after three epochs of high precision fine-tuning, combined with a variance reduction method -- where both these methods add overhead comparable to previously suggested methods.

6/11/2024

1-Bit FQT: Pushing the Limit of Fully Quantized Training to 1-bit

Chang Gao, Jianfei Chen, Kang Zhao, Jiaqi Wang, Liping Jing

Fully quantized training (FQT) accelerates the training of deep neural networks by quantizing the activations, weights, and gradients into lower precision. To explore the ultimate limit of FQT (the lowest achievable precision), we make a first attempt to 1-bit FQT. We provide a theoretical analysis of FQT based on Adam and SGD, revealing that the gradient variance influences the convergence of FQT. Building on these theoretical results, we introduce an Activation Gradient Pruning (AGP) strategy. The strategy leverages the heterogeneity of gradients by pruning less informative gradients and enhancing the numerical precision of remaining gradients to mitigate gradient variance. Additionally, we propose Sample Channel joint Quantization (SCQ), which utilizes different quantization strategies in the computation of weight gradients and activation gradients to ensure that the method is friendly to low-bitwidth hardware. Finally, we present a framework to deploy our algorithm. For fine-tuning VGGNet-16 and ResNet-18 on multiple datasets, our algorithm achieves an average accuracy improvement of approximately 6%, compared to per-sample quantization. Moreover, our training speedup can reach a maximum of 5.13x compared to full precision training.

8/27/2024

💬

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, Bohan Zhuang

Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for LLMs. In existing studies, activation outliers in particular channels are identified as the bottleneck to PTQ accuracy. They propose to transform the magnitudes from activations to weights, which however offers limited alleviation or suffers from unstable gradients, resulting in a severe performance drop at low-bitwidth. In this paper, we propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs. QLLM introduces an adaptive channel reassembly technique that reallocates the magnitude of outliers to other channels, thereby mitigating their impact on the quantization range. This is achieved by channel disassembly and channel assembly, which first breaks down the outlier channels into several sub-channels to ensure a more balanced distribution of activation magnitudes. Then similar channels are merged to maintain the original channel number for efficiency. Additionally, an adaptive strategy is designed to autonomously determine the optimal number of sub-channels for channel disassembly. To further compensate for the performance loss caused by quantization, we propose an efficient tuning method that only learns a small number of low-rank weights while freezing the pre-trained quantized model. After training, these low-rank parameters can be fused into the frozen weights without affecting inference. Extensive experiments on LLaMA-1 and LLaMA-2 show that QLLM can obtain accurate quantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B within 10 hours on a single A100-80G GPU, outperforming the previous state-of-the-art method by 7.89% on the average accuracy across five zero-shot tasks.

4/9/2024

💬

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

Janghwan Lee, Minsoo Kim, Seungcheol Baek, Seok Joong Hwang, Wonyong Sung, Jungwook Choi

Large Language Models (LLMs) are proficient in natural language processing tasks, but their deployment is often restricted by extensive parameter sizes and computational demands. This paper focuses on post-training quantization (PTQ) in LLMs, specifically 4-bit weight and 8-bit activation (W4A8) quantization, to enhance computational efficiency -- a topic less explored compared to weight-only quantization. We present two innovative techniques: activation-quantization-aware scaling (AQAS) and sequence-length-aware calibration (SLAC) to enhance PTQ by considering the combined effects on weights and activations and aligning calibration sequence lengths to target tasks. Moreover, we introduce dINT, a hybrid data format combining integer and denormal representations, to address the underflow issue in W4A8 quantization, where small values are rounded to zero. Through rigorous evaluations of LLMs, including OPT and LLaMA, we demonstrate that our techniques significantly boost task accuracies to levels comparable with full-precision models. By developing arithmetic units compatible with dINT, we further confirm that our methods yield a 2$times$ hardware efficiency improvement compared to 8-bit integer MAC unit.

7/19/2024