Accuracy Booster: Enabling 4-bit Fixed-point Arithmetic for DNN Training

Read original: arXiv:2211.10737 - Published 6/3/2024 by Simla Burcu Harma, Ayan Chakraborty, Nicholas Sperry, Babak Falsafi, Martin Jaggi, Yunho Oh

🎯

Overview

The paper explores efficient numerical encoding for deep neural network (DNN) training
It proposes a single-level scaled format called Accuracy Booster that uses mixed mantissa sizes to maximize arithmetic density while maintaining training accuracy
Accuracy Booster enables a 2.3x increase in arithmetic density over other state-of-the-art (SOTA) formats while achieving SOTA accuracies in 4-bit training

Plain English Explanation

The rapid growth in deep learning has led to an unprecedented demand for computing resources to train large language models. Researchers have been searching for ways to reduce the numerical precision required for these models, known as quantization, to increase efficiency.

Recent SOTA proposals have advocated for using multi-level scaled narrow bitwidth numerical formats. However, this paper shows that a single-level scaling approach is sufficient to maintain training accuracy while maximizing arithmetic density.

The authors identify a previously proposed single-level scaled format called Hybrid Block Floating Point (HBFP) as the optimal candidate. They then explore the HBFP design space to find opportunities for even smaller encodings across layers and epochs.

Based on their findings, the authors propose Accuracy Booster, a mixed-mantissa HBFP technique that uses 4-bit mantissas for over 99% of all arithmetic operations during training and 6-bit mantissas only in the last epoch and first/last layers. This enables a 2.3x increase in arithmetic density over other SOTA formats while achieving SOTA accuracies in 4-bit training.

Technical Explanation

The paper conducts a full-scale exploration of the HBFP design space using mathematical tools to study the interplay among various parameters, such as mantissa size and exponent range. This allows the authors to identify opportunities for even smaller encodings across layers and epochs.

Based on their analysis, the authors propose Accuracy Booster, a mixed-mantissa HBFP technique that uses 4-bit mantissas for over 99% of all arithmetic operations during training and 6-bit mantissas only in the last epoch and first/last layers. This novel approach leverages the observation that different parts of the model require different levels of numerical precision.

The authors evaluate Accuracy Booster on various benchmark tasks and show that it enables a 2.3x increase in arithmetic density over other SOTA mixed-precision quantization formats while achieving SOTA accuracies in 4-bit training.

Critical Analysis

The paper provides a thorough exploration of the HBFP design space and offers a compelling solution in the form of Accuracy Booster. However, the authors acknowledge that the optimal numerical format may vary across different models and tasks, and further research is needed to understand the generalizability of their findings.

Additionally, the paper does not address the potential challenges of implementing Accuracy Booster in hardware or the energy/latency tradeoffs associated with the mixed-mantissa approach. These practical considerations will be crucial for the real-world deployment of such techniques.

Overall, the research presented in this paper represents a valuable contribution to the ongoing efforts to make deep learning more computationally efficient. It encourages readers to think critically about the role of numerical representation in model optimization and highlights the importance of tailoring quantization strategies to the specific characteristics of each model.

Conclusion

This paper demonstrates that a single-level scaled numerical format, Accuracy Booster, can effectively balance training accuracy and arithmetic density for deep learning models. By leveraging a mixed-mantissa approach, Accuracy Booster achieves a 2.3x increase in arithmetic density over other SOTA formats while maintaining state-of-the-art performance in 4-bit training.

This work highlights the potential of targeted numerical representation optimizations to drive greater computational efficiency in deep learning, which will be crucial as models continue to grow in size and complexity. The insights and techniques presented in this paper can inspire further research into hardware-aware quantization strategies and the role of numerical representation in model optimization.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🎯

Accuracy Booster: Enabling 4-bit Fixed-point Arithmetic for DNN Training

Simla Burcu Harma, Ayan Chakraborty, Nicholas Sperry, Babak Falsafi, Martin Jaggi, Yunho Oh

The unprecedented demand for computing resources to train DNN models has led to a search for minimal numerical encoding. Recent state-of-the-art (SOTA) proposals advocate for multi-level scaled narrow bitwidth numerical formats. In this paper, we show that single-level scaling is sufficient to maintain training accuracy while maximizing arithmetic density. We identify a previously proposed single-level scaled format for 8-bit training, Hybrid Block Floating Point (HBFP), as the optimal candidate to minimize. We perform a full-scale exploration of the HBFP design space using mathematical tools to study the interplay among various parameters and identify opportunities for even smaller encodings across layers and epochs. Based on our findings, we propose Accuracy Booster, a mixed-mantissa HBFP technique that uses 4-bit mantissas for over 99% of all arithmetic operations in training and 6-bit mantissas only in the last epoch and first/last layers. We show Accuracy Booster enables increasing arithmetic density over all other SOTA formats by at least 2.3x while achieving state-of-the-art accuracies in 4-bit training.

6/3/2024

🧠

Accurate Neural Training with 4-bit Matrix Multiplications at Standard Formats

Brian Chmiel, Ron Banner, Elad Hoffer, Hilla Ben Yaacov, Daniel Soudry

Quantization of the weights and activations is one of the main methods to reduce the computational footprint of Deep Neural Networks (DNNs) training. Current methods enable 4-bit quantization of the forward phase. However, this constitutes only a third of the training process. Reducing the computational footprint of the entire training process requires the quantization of the neural gradients, i.e., the loss gradients with respect to the outputs of intermediate neural layers. Previous works separately showed that accurate 4-bit quantization of the neural gradients needs to (1) be unbiased and (2) have a log scale. However, no previous work aimed to combine both ideas, as we do in this work. Specifically, we examine the importance of having unbiased quantization in quantized neural network training, where to maintain it, and how to combine it with logarithmic quantization. Based on this, we suggest a $textit{logarithmic unbiased quantization}$ (LUQ) method to quantize both the forward and backward phases to 4-bit, achieving state-of-the-art results in 4-bit training without the overhead. For example, in ResNet50 on ImageNet, we achieved a degradation of 1.1%. We further improve this to a degradation of only 0.32% after three epochs of high precision fine-tuning, combined with a variance reduction method -- where both these methods add overhead comparable to previously suggested methods.

6/11/2024

Accurate Block Quantization in LLMs with Outliers

Nikita Trukhanov, Ilya Soloveychik

The demand for inference on extremely large scale LLMs has seen enormous growth in the recent months. It made evident the colossal shortage of dedicated hardware capable of efficient and fast processing of the involved compute and memory movement. The problem is aggravated by the exploding raise in the lengths of the sequences being processed, since those require efficient on-chip storage of the KV-cache of size proportional to the sequence length. To make the required compute feasible and fit the involved data into available memory, numerous quantization techniques have been proposed that allow accurate quantization for both weights and activations. One of the main recent breakthroughs in this direction was introduction of the family of Block Floating Point (BFP) formats characterized by a block of mantissas with a shared scale factor. These enable memory- power-, and compute- efficient hardware support of the tensor operations and provide extremely good quantization accuracy. The main issues preventing widespread application of block formats is caused by the presence of outliers in weights and activations since those affect the accuracy of the other values in the same block. In this paper, we focus on the most critical problem of limited KV-cache storage. We propose a novel approach enabling usage of low precision BFP formats without compromising the resulting model accuracy. We exploit the common channel-wise patterns exhibited by the outliers to rearrange them in such a way, that their quantization quality is significantly improved. The methodology yields 2x savings in the memory footprint without significant degradation of the model's accuracy. Importantly, the rearrangement of channels happens at the compile time and thus has no impact on the inference latency.

4/1/2024

Towards Federated Learning with On-device Training and Communication in 8-bit Floating Point

Bokun Wang, Axel Berg, Durmus Alp Emre Acar, Chuteng Zhou

Recent work has shown that 8-bit floating point (FP8) can be used for efficiently training neural networks with reduced computational overhead compared to training in FP32/FP16. In this work, we investigate the use of FP8 training in a federated learning context. This brings not only the usual benefits of FP8 which are desirable for on-device training at the edge, but also reduces client-server communication costs due to significant weight compression. We present a novel method for combining FP8 client training while maintaining a global FP32 server model and provide convergence analysis. Experiments with various machine learning models and datasets show that our method consistently yields communication reductions of at least 2.9x across a variety of tasks and models compared to an FP32 baseline.

7/4/2024