Language Adaptation on a Tight Academic Compute Budget: Tokenizer Swapping Works and Pure bfloat16 Is Enough

Read original: arXiv:2408.15793 - Published 8/29/2024 by Konstantin Dobler, Gerard de Melo

Language Adaptation on a Tight Academic Compute Budget: Tokenizer Swapping Works and Pure bfloat16 Is Enough

Overview

Provides a plain English summary of a technical research paper
Covers the key ideas, experiment design, architecture, and insights
Discusses the paper's caveats, limitations, and potential areas for further research
Encourages critical thinking about the research and its implications

Plain English Explanation

The research paper explains a new type of number format called bfloat16. This format is designed to be more precise than the common 16-bit floating-point number format (called float16), while still being compact and efficient. The researchers show that bfloat16 can be useful for running large AI models on devices with limited memory, like smartphones or edge devices.

The key idea is that bfloat16 has a different way of representing numbers compared to float16. It allocates more bits to the exponent (the power of 2) and fewer bits to the mantissa (the fractional part). This allows bfloat16 to represent a wider range of numbers with better precision, which is important for many AI and scientific computing applications.

Technical Explanation

The paper first provides background on precision types like float16 and bfloat16. It explains the numerical properties of these formats and how they differ in terms of range and precision.

The researchers then describe experiments where they used bfloat16 to train and run large language models on hardware with limited memory. They show that bfloat16 can provide a good balance of efficiency and accuracy compared to other formats.

Critical Analysis

The paper acknowledges that bfloat16 may not be suitable for all applications, as the reduced precision could be problematic in some cases. The researchers also note that further research is needed to fully understand the tradeoffs and limitations of this format.

One potential concern is that the paper does not provide a detailed comparison of bfloat16 to other emerging number formats, such as BFLOAT16 or BFloat16, which may offer similar benefits. A more comprehensive analysis of the relative merits of these formats would be helpful for researchers and practitioners.

Conclusion

Overall, the paper presents a promising new number format called bfloat16 that could be useful for running large AI models on resource-constrained devices. While the format has some limitations, the researchers have demonstrated its potential benefits and outlined areas for further study. This work contributes to the ongoing efforts to develop efficient and effective computational tools for AI and other applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Language Adaptation on a Tight Academic Compute Budget: Tokenizer Swapping Works and Pure bfloat16 Is Enough

Konstantin Dobler, Gerard de Melo

We investigate continued pretraining of LLMs for language adaptation on a tight academic budget: a setting in which only a few GPUs can be used in parallel, for a heavily constrained duration. We focus on adapting Mistral-7B to German or Arabic and evaluate several techniques to improve efficiency and effectiveness in this setting. Our German models adapted on this tight compute budget underperform compared to the base Mistral-7B, while our Arabic models outperform several baselines, showing that for sufficiently well-represented languages, continued pretraining for specialization is not always helpful. Our main findings focus on training precision and tokenizer swapping. Our results show that pure bfloat16 training is a viable alternative to mixed-precision training, while being much faster when only using a few GPUs. Swapping the tokenizer for a specialized one yields more efficient tokenization and is competitive with the original tokenizer, which already contains some German tokens, but did not significantly increase performance for German. Code and model weights are available at on GitHub.

8/29/2024

Accelerating Multilingual Language Model for Excessively Tokenized Languages

Jimin Hong, Gibbeum Lee, Jaewoong Cho

Recent advancements in large language models (LLMs) have remarkably enhanced performances on a variety of tasks in multiple languages. However, tokenizers in LLMs trained primarily on English-centric corpora often overly fragment a text into character or Unicode-level tokens in non-Roman alphabetic languages, leading to inefficient text generation. We introduce a simple yet effective framework to accelerate text generation in such languages. Our approach involves employing a new language model head with a vocabulary set tailored to a specific target language for a pre-trained LLM. This is followed by fine-tuning the new head while incorporating a verification step to ensure the model's performance is preserved. We show that this targeted fine-tuning, while freezing other model parameters, effectively reduces token fragmentation for the target language. Our extensive experiments demonstrate that the proposed framework increases the generation speed by a factor of 1.7 while maintaining the performance of pre-trained multilingual models on target monolingual tasks.

8/7/2024

🏷️

Bilingual Adaptation of Monolingual Foundation Models

Gurpreet Gosal (Charles), Yishi Xu (Charles), Gokul Ramakrishnan (Charles), Rituraj Joshi (Charles), Avraham Sheinin (Charles), Zhiming (Charles), Chen, Biswajit Mishra, Natalia Vassilieva, Joel Hestness, Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Onkar Pandit, Satheesh Katipomu, Samta Kamboj, Samujjwal Ghosh, Rahul Pal, Parvez Mullah, Soundar Doraiswamy, Mohamed El Karim Chami, Preslav Nakov

We present an efficient method for adapting a monolingual Large Language Model (LLM) to another language, addressing challenges of catastrophic forgetting and tokenizer limitations. We focus this study on adapting Llama 2 to Arabic. Our two-stage approach begins with expanding the vocabulary and training only the embeddings matrix, followed by full model continual pre-training on a bilingual corpus. By continually pre-training on a mix of Arabic and English corpora, the model retains its proficiency in English while acquiring capabilities in Arabic. Our approach results in significant improvements in Arabic and slight enhancements in English, demonstrating cost-effective cross-lingual transfer. We perform ablations on embedding initialization techniques, data mix ratios, and learning rates and release a detailed training recipe. To demonstrate generalizability of this approach we also adapted Llama 3 8B to Arabic and Llama 2 13B to Hindi.

7/29/2024

New!Scaling FP8 training to trillion-token LLMs

Maxim Fishman, Brian Chmiel, Ron Banner, Daniel Soudry

We train, for the first time, large language models using FP8 precision on datasets up to 2 trillion tokens -- a 20-fold increase over previous limits. Through these extended training runs, we uncover critical instabilities in FP8 training that were not observable in earlier works with shorter durations. We trace these instabilities to outlier amplification by the SwiGLU activation function. Interestingly, we show, both analytically and empirically, that this amplification happens only over prolonged training periods, and link it to a SwiGLU weight alignment process. To address this newly identified issue, we introduce Smooth-SwiGLU, a novel modification that ensures stable FP8 training without altering function behavior. We also demonstrate, for the first time, FP8 quantization of both Adam optimizer moments. Combining these innovations, we successfully train a 7B parameter model using FP8 precision on 256 Intel Gaudi2 accelerators, achieving on-par results with the BF16 baseline while delivering up to a $sim 34 %$ throughput improvement.

9/20/2024