Scaling FP8 training to trillion-token LLMs

Read original: arXiv:2409.12517 - Published 9/20/2024 by Maxim Fishman, Brian Chmiel, Ron Banner, Daniel Soudry

Scaling FP8 training to trillion-token LLMs

Overview

Addresses the challenges of scaling FP8 (Floating Point 8-bit) training to large language models (LLMs) with trillions of parameters
Explores techniques to mitigate the issues of outlier amplification and numerical stability during FP8 training
Demonstrates the feasibility of training trillion-token LLMs using FP8 precision while maintaining model performance

Plain English Explanation

The paper discusses the challenges of using a more compact 8-bit floating-point (FP8) format for training large language models (LLMs) with trillions of parameters. LLMs are powerful AI systems that can generate human-like text, but they require significant computational resources to train.

The researchers explore techniques to address the problem of "outlier amplification" - where a few extremely large values in the model can dominate the training and lead to numerical instability. This is a particular issue when using the more limited FP8 format, which has a smaller range of values compared to the standard 32-bit floating-point (FP32) format.

The paper demonstrates that it is possible to successfully train trillion-token LLMs using the FP8 format, while maintaining the model's performance. This is important because FP8 can greatly reduce the memory and computational requirements of training these massive models, making them more accessible and scalable.

Technical Explanation

The researchers propose several techniques to address the challenges of FP8 training for LLMs:

Dynamic Outlier Clipping: They introduce a method to dynamically clip extremely large values (outliers) during training, preventing them from dominating the numeric computations and causing instability.
Gradient Accumulation with FP8: By accumulating gradients in FP32 before updating the model parameters in FP8, the researchers ensure that small gradients are not lost due to the limited precision of FP8.
Adaptive Gradient Scaling: The team developed an adaptive gradient scaling technique that adjusts the scaling factor based on the distribution of the gradients, further improving numerical stability.

These techniques, combined with other architectural and optimization choices, enabled the researchers to successfully train trillion-token LLMs using the FP8 format, with negligible performance degradation compared to FP32 training.

Critical Analysis

The paper provides a comprehensive and well-designed study on the challenges and solutions for scaling FP8 training to massive LLMs. However, the authors acknowledge several caveats and areas for further research:

The techniques may not be directly applicable to different model architectures or training regimes, and further experimentation is needed to understand their generalizability.
The impact of FP8 training on model quality and downstream task performance is not extensively evaluated, and more thorough testing is required.
The computational and memory savings of FP8 training are not quantified, and a more detailed analysis of the trade-offs would be beneficial.

Additionally, the paper does not address potential issues related to the reproducibility of FP8 training or the interpretability of the resulting models, which could be important considerations for real-world applications.

Conclusion

The research presented in this paper represents a significant step forward in enabling the training of massive, trillion-token language models using the more compact FP8 format. By addressing the challenges of outlier amplification and numerical instability, the researchers have demonstrated the feasibility of this approach, which could lead to substantial improvements in the computational efficiency and scalability of large-scale language model training.

The techniques developed in this work have the potential to make cutting-edge AI systems more accessible, as the reduced memory and compute requirements of FP8 training could make it easier to train and deploy these models in resource-constrained environments. Further research and refinement of these methods could have far-reaching implications for the field of natural language processing and the development of advanced AI technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

New!Scaling FP8 training to trillion-token LLMs

Maxim Fishman, Brian Chmiel, Ron Banner, Daniel Soudry

We train, for the first time, large language models using FP8 precision on datasets up to 2 trillion tokens -- a 20-fold increase over previous limits. Through these extended training runs, we uncover critical instabilities in FP8 training that were not observable in earlier works with shorter durations. We trace these instabilities to outlier amplification by the SwiGLU activation function. Interestingly, we show, both analytically and empirically, that this amplification happens only over prolonged training periods, and link it to a SwiGLU weight alignment process. To address this newly identified issue, we introduce Smooth-SwiGLU, a novel modification that ensures stable FP8 training without altering function behavior. We also demonstrate, for the first time, FP8 quantization of both Adam optimizer moments. Combining these innovations, we successfully train a 7B parameter model using FP8 precision on 256 Intel Gaudi2 accelerators, achieving on-par results with the BF16 baseline while delivering up to a $sim 34 %$ throughput improvement.

9/20/2024

To FP8 and Back Again: Quantifying the Effects of Reducing Precision on LLM Training Stability

Joonhyung Lee, Jeongin Bae, Byeongwook Kim, Se Jung Kwon, Dongsoo Lee

The massive computational costs associated with large language model (LLM) pretraining have spurred great interest in reduced-precision floating-point representations to accelerate the process. As a result, the BrainFloat16 (BF16) precision has become the de facto standard for LLM training, with hardware support included in recent accelerators. This trend has gone even further in the latest processors, where FP8 has recently been introduced. However, prior experience with FP16, which was found to be less stable than BF16, raises concerns as to whether FP8, with even fewer bits than FP16, can be a cost-effective option for LLM training. We argue that reduced-precision training schemes must have similar training stability and hyperparameter sensitivities to their higher-precision counterparts in order to be cost-effective. However, we find that currently available methods for FP8 training are not robust enough to allow their use as economical replacements. This prompts us to investigate the stability of reduced-precision LLM training in terms of robustness across random seeds and learning rates. To this end, we propose new evaluation techniques and a new metric for quantifying loss landscape sharpness in autoregressive language models. By simulating incremental bit reductions in floating-point representations, we analyze the relationship between representational power and training stability with the intent of aiding future research into the field.

5/30/2024

Towards Federated Learning with On-device Training and Communication in 8-bit Floating Point

Bokun Wang, Axel Berg, Durmus Alp Emre Acar, Chuteng Zhou

Recent work has shown that 8-bit floating point (FP8) can be used for efficiently training neural networks with reduced computational overhead compared to training in FP32/FP16. In this work, we investigate the use of FP8 training in a federated learning context. This brings not only the usual benefits of FP8 which are desirable for on-device training at the edge, but also reduces client-server communication costs due to significant weight compression. We present a novel method for combining FP8 client training while maintaining a global FP32 server model and provide convergence analysis. Experiments with various machine learning models and datasets show that our method consistently yields communication reductions of at least 2.9x across a variety of tasks and models compared to an FP32 baseline.

7/4/2024

Language Adaptation on a Tight Academic Compute Budget: Tokenizer Swapping Works and Pure bfloat16 Is Enough

Konstantin Dobler, Gerard de Melo

We investigate continued pretraining of LLMs for language adaptation on a tight academic budget: a setting in which only a few GPUs can be used in parallel, for a heavily constrained duration. We focus on adapting Mistral-7B to German or Arabic and evaluate several techniques to improve efficiency and effectiveness in this setting. Our German models adapted on this tight compute budget underperform compared to the base Mistral-7B, while our Arabic models outperform several baselines, showing that for sufficiently well-represented languages, continued pretraining for specialization is not always helpful. Our main findings focus on training precision and tokenizer swapping. Our results show that pure bfloat16 training is a viable alternative to mixed-precision training, while being much faster when only using a few GPUs. Swapping the tokenizer for a specialized one yields more efficient tokenization and is competitive with the original tokenizer, which already contains some German tokens, but did not significantly increase performance for German. Code and model weights are available at on GitHub.

8/29/2024