Schrodinger's FP: Dynamic Adaptation of Floating-Point Containers for Deep Learning Training

Read original: arXiv:2204.13666 - Published 5/20/2024 by Milov{s} Nikoli'c, Enrique Torres Sanchez, Jiahui Wang, Ali Hadi Zadeh, Mostafa Mahmoud, Ameer Abdelhadi, Kareem Ibrahim, Andreas Moshovos

🤿

Overview

Neural network training is dominated by the time and energy needed to transfer tensors to and from memory.
Researchers have explored using narrower data representations to improve energy efficiency and performance.
Previous attempts relied on user-directed trial-and-error to achieve convergence.
This paper presents methods that dynamically adjust the size and format of floating-point containers for activations and weights during training.

Plain English Explanation

Training neural networks requires a lot of time and energy to move data, called tensors, between the computer's memory and the processors doing the calculations. Researchers have been looking for ways to use smaller data types, like fewer bits to represent numbers, to make this process more efficient and faster.

However, previous attempts required a lot of manual trial-and-error by the user to get the network to still work well with the smaller data types. This paper introduces new methods that can automatically adjust the size and format of the floating-point numbers used for the activations (the values flowing through the network) and the weights (the parameters the network learns) during training.

These methods Quantum Mantissa, Quantum Exponent, and BitWave can learn that many of the tensors only need 1 or 2 bits for the fractional part (the mantissa) and 3 or 4 bits for the whole number part (the exponent). This allows them to dramatically reduce the overall size, or "footprint," of the data by up to 4.74 times.

Additionally, an optional method called Gecko can further compress the exponent values, leading to an overall footprint reduction of up to 5.64 times.

Technical Explanation

The proposed methods dynamically adjust the size and format of the floating-point containers used for activations and weights during neural network training. This is done across three key dimensions:

Determining which datatype (e.g., number of mantissa and exponent bits) to use
Applying this per tensor (layer)
Updating it over the course of training

The different properties of the exponent and mantissa values lead the authors to develop tailored approaches for each. Two lossy methods, Quantum Mantissa and Quantum Exponent, aim to eliminate as many bits as possible from the mantissa and exponent, respectively, without affecting accuracy.

These methods leverage the gradient descent algorithm during training to automatically learn the minimal bit lengths for each layer. In contrast, the BitWave method observes changes in the loss function to adaptively adjust the bit lengths network-wide.

Overall, the two machine learning-based methods (Quantum Mantissa and Quantum Exponent) achieve a 4.74x reduction in footprint, while BitWave yields a 3.19x reduction. The optional Gecko method can further improve the compression to 5.64x and 4.56x, respectively, by exploiting the naturally emerging, lop-sided exponent distribution.

Critical Analysis

The paper presents innovative methods to dynamically adjust the bit representations of activations and weights during training, removing the need for manual, trial-and-error tuning. This is a significant improvement over previous approaches that required user intervention.

However, the paper does not discuss the computational overhead of the proposed methods themselves. While they may reduce the overall memory footprint, the extra computation required to determine the optimal bit widths could offset the performance gains, especially for smaller models or devices with limited computational resources.

Additionally, the paper focuses on fully connected and convolutional layers, but does not address how these methods would apply to other common neural network building blocks, such as attention mechanisms or recurrent layers. Further research is needed to understand the generalizability of these techniques.

Finally, the paper does not provide a comprehensive analysis of the resilience of these methods to model drift or distribution shift over time. It would be valuable to understand how well the dynamically adjusted bit widths adapt to changes in the data or network structure during long-running training or deployment.

Conclusion

This paper presents novel methods to dynamically adjust the bit representations of activations and weights during neural network training, reducing the overall memory footprint by up to 5.64 times. These techniques automatically learn the optimal bit widths for each tensor, eliminating the need for manual tuning.

The proposed approaches could have significant implications for improving the energy efficiency and performance of neural networks, especially on resource-constrained devices. However, further research is needed to understand the computational overhead, generalizability to other model architectures, and resilience to distribution shift over time.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤿

Schrodinger's FP: Dynamic Adaptation of Floating-Point Containers for Deep Learning Training

Milov{s} Nikoli'c, Enrique Torres Sanchez, Jiahui Wang, Ali Hadi Zadeh, Mostafa Mahmoud, Ameer Abdelhadi, Kareem Ibrahim, Andreas Moshovos

The transfer of tensors from/to memory during neural network training dominates time and energy. To improve energy efficiency and performance, research has been exploring ways to use narrower data representations. So far, these attempts relied on user-directed trial-and-error to achieve convergence. We present methods that relieve users from this responsibility. Our methods dynamically adjust the size and format of the floating-point containers used for activations and weights during training, achieving adaptivity across three dimensions: i) which datatype to use, ii) on which tensor, and iii) how it changes over time. The different meanings and distributions of exponent and mantissas lead us to tailored approaches for each. We present two lossy pairs of methods to eliminate as many mantissa and exponent bits as possible without affecting accuracy. Quantum Mantissa and Quantum Exponent are machine learning compression methods that tap into the gradient descent algorithm to learn the minimal mantissa and exponent bitlengths on a per-layer granularity. They automatically learn that many tensors can use just 1 or 2 mantissa bits and 3 or 4 exponent bits. Overall, the two machine learning methods reduce the footprint by $4.74times$. Alternatively, BitWave observes changes in the loss function during training to adjust mantissa and exponent bitlengths network-wide, yielding a $3.19times$ reduction in footprint. Finally, we present an optional method, Gecko, to exploit the naturally emerging, lop-sided exponent distribution to losslessly compress resulting exponents from Quantum Exponent or BitWave and, on average, improve compression rates to $5.64times$ and $4.56times$.

5/20/2024

🎲

Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs

Shivam Aggarwal, Hans Jakob Damsgaard, Alessandro Pappalardo, Giuseppe Franco, Thomas B. Preu{ss}er, Michaela Blott, Tulika Mitra

Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point formats(FP8) in the context of PTQ for model inference. However, floating-point formats smaller than 8 bits and their relative comparison in terms of accuracy-hardware cost with integers remains unexplored on FPGAs. In this work, we present minifloats, which are reduced-precision floating-point formats capable of further reducing the memory footprint, latency, and energy cost of a model while approaching full-precision model accuracy. We implement a custom FPGA-based multiply-accumulate operator library and explore the vast design space, comparing minifloat and integer representations across 3 to 8 bits for both weights and activations. We also examine the applicability of various integerbased quantization techniques to minifloats. Our experiments show that minifloats offer a promising alternative for emerging workloads such as vision transformers.

7/8/2024

Towards Federated Learning with On-device Training and Communication in 8-bit Floating Point

Bokun Wang, Axel Berg, Durmus Alp Emre Acar, Chuteng Zhou

Recent work has shown that 8-bit floating point (FP8) can be used for efficiently training neural networks with reduced computational overhead compared to training in FP32/FP16. In this work, we investigate the use of FP8 training in a federated learning context. This brings not only the usual benefits of FP8 which are desirable for on-device training at the edge, but also reduces client-server communication costs due to significant weight compression. We present a novel method for combining FP8 client training while maintaining a global FP32 server model and provide convergence analysis. Experiments with various machine learning models and datasets show that our method consistently yields communication reductions of at least 2.9x across a variety of tasks and models compared to an FP32 baseline.

7/4/2024

1-Bit FQT: Pushing the Limit of Fully Quantized Training to 1-bit

Chang Gao, Jianfei Chen, Kang Zhao, Jiaqi Wang, Liping Jing

Fully quantized training (FQT) accelerates the training of deep neural networks by quantizing the activations, weights, and gradients into lower precision. To explore the ultimate limit of FQT (the lowest achievable precision), we make a first attempt to 1-bit FQT. We provide a theoretical analysis of FQT based on Adam and SGD, revealing that the gradient variance influences the convergence of FQT. Building on these theoretical results, we introduce an Activation Gradient Pruning (AGP) strategy. The strategy leverages the heterogeneity of gradients by pruning less informative gradients and enhancing the numerical precision of remaining gradients to mitigate gradient variance. Additionally, we propose Sample Channel joint Quantization (SCQ), which utilizes different quantization strategies in the computation of weight gradients and activation gradients to ensure that the method is friendly to low-bitwidth hardware. Finally, we present a framework to deploy our algorithm. For fine-tuning VGGNet-16 and ResNet-18 on multiple datasets, our algorithm achieves an average accuracy improvement of approximately 6%, compared to per-sample quantization. Moreover, our training speedup can reach a maximum of 5.13x compared to full precision training.

8/27/2024