CroissantLLM: A Truly Bilingual French-English Language Model

Read original: arXiv:2402.00786 - Published 4/1/2024 by Manuel Faysse, Patrick Fernandes, Nuno M. Guerreiro, Ant'onio Loison, Duarte M. Alves, Caio Corro, Nicolas Boizard, Jo~ao Alves, Ricardo Rei, Pedro H. Martins and 6 others

CroissantLLM: A Truly Bilingual French-English Language Model

Introduction

The paper introduces several contributions to address limitations in existing large language models (LLMs):

A highly-curated, diverse 303B token French corpus spanning various sources like internet data, literature, transcripts, legal documents, scientific articles, etc. This is claimed to be the largest high-quality multi-source French corpus released for language modeling.
Training CroissantLLM, a truly bilingual English-French language model with a 1:1 ratio of English to French data. This aims to create a model less skewed towards English performance or cultural biases compared to existing multilingual models where English dominates.
FrenchBench: A novel benchmark to evaluate LLMs on various tasks assessing factual knowledge, generative capabilities, and language understanding in French, using open and newly annotated data.
Releasing high-performing, inference-optimal CroissantLLM models under open licenses, trained on a large 3000:1 token-to-parameter ratio. Performance continues improving with lengthened pre-training beyond typical thresholds. The models, checkpoints, and resources are provided for industrial deployment and research purposes, promoting transparency.

The paper does not provide details on specific sections like lack of transparency in existing models, bias towards English, or challenges in using LLMs at scale.

Data

The paper describes the construction of a large bilingual (English-French) training corpus for a language model. The corpus aims to reduce biases present in existing models trained primarily on English data. Key points about the corpus:

French Data: Web pages, legal/administrative documents, cultural materials (books, poems, podcasts), encyclopedic data from French Wikipedia, and industrial data from company documents.
English Data: Drawn largely from the SlimPajama corpus without copyrighted documents. Includes web pages, scientific articles, books, and canary samples to test memorization.
Code Data: Code from various programming languages (Python, Java, C++, etc.) and Jupyter notebooks.
Parallel Data: 400 million English-French translated sentence pairs from domains like OPUS corpus, academic theses, and song lyrics. Rigorous filtering applied.

The diverse sources spanning different domains, genres, and languages aim to create a more balanced, less biased corpus for training multilingual language models with varied knowledge. Explicit content filtering was avoided for song lyrics to preserve cultural nuances.

Training

The paper describes the process of training a bilingual (English-French) language model called CroissantLLM, with a focus on optimizing performance across both languages while being resource-efficient.

The model architecture is based on the Llama transformer, with four different model sizes ranging from 100M to 1.2B parameters. The tokenizer was trained on a 100B token corpus containing English, French, and code data, aiming for improved token efficiency in French compared to existing tokenizers.

To determine the optimal ratio of English and French data for training, the authors leveraged scaling laws by training smaller models on different data mixes and fitting a joint scaling law to predict performance trade-offs for larger models. This analysis revealed that equal ratios of English and French data minimized performance hits across both languages.

The final dataset comprised 1.1T unique tokens, with upsampling of French text and parallel data. The model was trained on a supercomputer using the Megatron-DeepSpeed framework, achieving high training efficiency.

The authors discuss the environmental impact, estimating a carbon footprint of around 3.36 tons of CO2 for the training process. They argue that training a smaller model on more tokens results in an inference-optimized model that is more energy-efficient during deployment.

Evaluation Benchmarks

The paper describes the evaluation of large language models on various benchmarks, both in English and French. Key points:

English Benchmarks:

Standard LLM benchmarks like HellaSwag, PiQA, SciQ, Arc-C, and MT-Bench are used to assess reasoning, common sense, science knowledge, and multi-turn conversation abilities.

French Benchmarks:

FrenchBench is introduced as a novel evaluation benchmark for French language models.
It comprises generative tasks (title generation, summarization, question answering) using datasets like FQuAD, MultiFQuAD, OrangeSum.
It also has multiple-choice tasks for grammar, vocabulary, reasoning (French HellaSwag, Arc-E), and biases.

Other Tasks:

MT-Bench is adapted to French through translation and human review.
Machine translation tasks between French and English using WMT14 data.
French Trivia assesses cultural knowledge with questions formulated in English.
Belebele reading comprehension dataset with parallel English and French splits.

The evaluation aims for transparency, with open-sourced code and publicly available data. It covers a broad range of capabilities to assess truly bilingual pre-training.

Benchmark results

The provided text discusses the evaluation of CroissantLLM, a 1.1B parameter language model trained on English, French, and code data. Key points:

CroissantLLM achieves strong performance compared to other models of similar size on English and French benchmarks for classification and generation tasks.
On English tasks, it slightly trails behind the TinyLlama model which was trained only on English data, but outperforms other similar-sized monolingual and multilingual models.
On French tasks, CroissantLLM significantly outperforms existing monolingual French models and models of similar size. Its performance is on par with the larger 3B Bloom model.
The model continues improving throughout training, with performance emerging more gradually early on, reflecting previous findings on emergent abilities of large language models.
Overall, CroissantLLM displays top performance in its model size category across languages and benchmarks, even edging out some larger models like Bloom 3B, though still behind the strongest 7B models like Llama and Mistral.

The text provides comprehensive evaluation results and analysis comparing CroissantLLM's capabilities to various baselines across languages and benchmarks.

Foundation Model Transparency Index

The paper evaluates the transparency of their work using the Stanford Transparency Index and obtains a score of 81%, higher than most proprietary and open-source models. The upstream categories, including data, compute, and methods, score 88% due to the open-source nature and extensive disclosure of training information. However, difficulties in identifying personal information and licenses prevent a full score.

For the model categories, including model information, risks, limitations, trustworthiness, and mitigations, the work scores 73% on average. This is due to the wide array of reproducible evaluation results reported, but hindered by the lack of third-party external evaluation and a not-as-extensive evaluation of potential harms.

The downstream categories, referring to usage policies, user statistics, distribution, documentation, and model impact assessment, score 80%. The open-access nature and distribution channel avoid transparency pitfalls related to restricted usage policies and user information processing. However, assessing the model's impact remains difficult until its release.

Figure 8: Aggregated FMTI scores by major dimension of transparency. CroissantLLM scores are calculated by the authors, the rest by (Bommasani et al., 2023).

The provided text discusses the ethics statement and other transparency-related details of the CroissantLLM model. Key points:

The aim is to enable studying the impact of language distribution in multilingual pre-training datasets and model biases to inform future model development.
The models and resources are openly released with no usage restrictions under an MIT license. No monitoring of individual usage is done.
For risk mitigation, the base model is released as-is without methods beyond data curation to remove toxic content. The chat model variant includes instructions to avoid certain prompts.
Experiments confirmed only extreme cases of data repetition led to memorization, allowing confident release without fear of private data leakage.
A risk assessment finds the small model scale allows transparent release without major risk of misuse beyond existing larger models.
Contributions from various authors across academia and industry are acknowledged.
The methodology for the FMTI (Foundation Model Transparency Index) evaluation is discussed, with a conservative self-scoring approach.

No first-person language or apologies were used in the summary.

Ethics Statement

Here is a summary of the provided section:

The work aims to enable studying the impact of language distribution in pre-training datasets for multilingual language models. Valuable resources are offered to understand induced model behaviors and biases in this multilingual setup, to inform future model and dataset development for better inclusivity.

The models and artifacts are openly released with an MIT license and no usage restrictions. Users are responsible for any generated content, and no redress mechanisms exist for harmful disclosures. All training artifacts are transparently released for full transparency.

The chat model variant includes alignment instructions to avoid responding to certain prompts. Data leakage experiments confirm only extreme cases of data repetition lead to memorization within the model.

A risk assessment was conducted, including staged releases to individuals for testing, finding no warning flags for toxic or harmful behavior. Compliance with transparency guidelines is prioritized.

Contributions from authors at academic and industry partners are detailed, as well as acknowledgments. The research involved compute resources from supercomputers through grants.

ontributors and Acknowledgements

The provided text summarizes the contributions of different authors and researchers involved in the project. It lists the core team members (Manuel, Patrick, Nuno, and Pierre) and their main roles, such as project coordination, data collection, design decisions, model evaluation, scaling law efforts, finetuning, and securing compute grants.

Several other contributors are mentioned for their specific roles, such as developing the pre-training codebase, constructing parallel data, assisting with finetuning and data collection efforts, and providing feedback.

The text also mentions the affiliations of the core authors, which include academic institutions (CentraleSupélec, Instituto Superior Técnico de Lisboa, Sorbonne Université, Imperial College London) and industrial partners (Illuin Technology, Unbabel, Equall). It lists the compute resources used for training, including the Jean Zay supercomputer and the Adastra cluster.

Additionally, the work received support from various funding sources, including the EU's Horizon Europe Research and Innovation Actions, the Center for Responsible AI, and the Fundação para a Ciência e Tecnologia.

Finally, the text expresses gratitude to individuals who provided assistance, such as technical support, grant obtention, and interesting discussions, without listing them as authors.

Appendix A FMTI

The provided text discusses the methodology and disclaimers for the FMTI (Foundation Model Transparency Index) grid, which is meant to assess foundation models. The key points are:

The FMTI grid is primarily focused on evaluating base models, as fine-tuning for instructions or chat datasets is now widely accessible and can be done by individuals, leading to different legal and copyright implications.
The authors acknowledge the potential bias in self-evaluating their model and have taken a conservative approach in attributing points for the FMTI criteria.
Transparency evaluation should ideally be done by an independent third party, and the authors are open to discussions for potential scoring modifications.
The technical report includes additional information to validate certain criteria, which may give the authors an advantage compared to work published before the index's release.
The FMTI scores are considered a reflection of the authors' compliance efforts to the listed transparency principles, rather than scores comparable to larger foundation models with different usage objectives.

The text does not provide a methodology section for the training or evaluation process of the Croissant model itself.

Figure 9: Aggregated FMTI

Appendix B Additional data details

The paper describes the datasets used for training the language model. Here are the key points:

French Data:

A mix of 24 French datasets totaling 1258.7 GB and 303.5 billion tokens
Includes data from sources like Wikipedia, legislation, books, and open data

English Data:

3 English datasets totaling 2351.1 GB and 655.6 billion tokens
Includes Project Gutenberg books and web crawled data

Code Data:

17 code datasets across languages like Java, Python, C++, SQL totaling 366.9 GB and 141.4 billion tokens
Includes code from GitHub repositories and coding contests

Parallel Data:

3 parallel French-English datasets totaling 113.9 GB and 35.8 billion tokens
Includes data from translation services and song lyrics

Scaling Law Corpus:

A 50 billion token subset sampled from the French, English and code datasets
Used for scaling law experiments to study language distribution impacts

The datasets cover a diverse range of sources in different languages and domains like natural text, code, and translations. Their sizes and token counts are provided.

Appendix C Chat examples

Here is a summary of the provided text:

The text describes the Principality of Sealand, an unrecognized micronation located on an offshore platform called Roughs Tower in the North Sea near the coast of England. Built by the British during World War II, the decommissioned platform has been occupied since 1967 by the family and associates of Paddy Roy Bates, who seized control from pirate radio broadcasters with the intent of setting up his own station. Bates and his group have repelled attempted incursions from rival pirate radio stations and the British navy using firearms and petrol bombs. Since 1987, when the UK extended its territorial waters, the platform lies within British territory.

The text then provides some suggestions for winter activities in Marseille, France, including visiting the old port area, taking boat tours of nearby islands and inlets, hiking in a national park, seeing a cathedral, and going to a museum about Marseille's history and culture.

Next, it gives an example cover letter in French for applying to a part-time student bartending job, highlighting relevant experience, scheduling availability, and enthusiasm for the role.

The text also includes Python code for implementing a depth-first search algorithm on a graph data structure using a stack.

Finally, it provides some general tips for addressing back issues, such as staying active, eating a healthy diet, managing stress, getting regular checkups, and seeking professional medical help if needed. However, it disclaims providing any actual medical advice.

Appendix D Results

The methodology section outlines the evaluation process for the base models using the LM Evaluation Harness framework. For classification tasks, the answer with the highest log likelihood when concatenated with the prompt is chosen. For generative tasks, greedy sampling is used without higher temperature values to avoid stochasticity. A maximum of 2000 samples are used per benchmark task to reduce evaluation time.

The perplexity results compare CroissantLLM to PagnolXL, a 1.5B French language model, on the French Treebank dataset. CroissantLLM achieves a perplexity of 10.37, outperforming PagnolXL's reported 16.40.

The MT-Bench results show that smaller models struggle with reasoning tasks and constrained generation in both Turn 1 and Turn 2 prompts.

For bias assessment, the CROWS benchmark covering stereotypes across different bias types is used. CroissantLLM is found to be in line with or slightly less biased than other models, notably in French.

There are no details provided for the sections on Code Data mix or Arithmetic Tasks.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

CroissantLLM: A Truly Bilingual French-English Language Model

Manuel Faysse, Patrick Fernandes, Nuno M. Guerreiro, Ant'onio Loison, Duarte M. Alves, Caio Corro, Nicolas Boizard, Jo~ao Alves, Ricardo Rei, Pedro H. Martins, Antoni Bigata Casademunt, Franc{c}ois Yvon, Andr'e F. T. Martins, Gautier Viaud, C'eline Hudelot, Pierre Colombo

We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware. To that end, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio, a custom tokenizer, and bilingual finetuning datasets. We release the training dataset, notably containing a French split with manually curated, high-quality, and varied data sources. To assess performance outside of English, we craft a novel benchmark, FrenchBench, consisting of an array of classification and generation tasks, covering various orthogonal aspects of model performance in the French Language. Additionally, rooted in transparency and to foster further Large Language Model research, we release codebases, and dozens of checkpoints across various model sizes, training data distributions, and training steps, as well as fine-tuned Chat models, and strong translation models. We evaluate our model through the FMTI framework, and validate 81 % of the transparency criteria, far beyond the scores of even most open initiatives. This work enriches the NLP landscape, breaking away from previous English-centric work in order to strengthen our understanding of multilinguality in language models.

4/1/2024

🏷️

Bilingual Adaptation of Monolingual Foundation Models

Gurpreet Gosal (Charles), Yishi Xu (Charles), Gokul Ramakrishnan (Charles), Rituraj Joshi (Charles), Avraham Sheinin (Charles), Zhiming (Charles), Chen, Biswajit Mishra, Natalia Vassilieva, Joel Hestness, Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Onkar Pandit, Satheesh Katipomu, Samta Kamboj, Samujjwal Ghosh, Rahul Pal, Parvez Mullah, Soundar Doraiswamy, Mohamed El Karim Chami, Preslav Nakov

We present an efficient method for adapting a monolingual Large Language Model (LLM) to another language, addressing challenges of catastrophic forgetting and tokenizer limitations. We focus this study on adapting Llama 2 to Arabic. Our two-stage approach begins with expanding the vocabulary and training only the embeddings matrix, followed by full model continual pre-training on a bilingual corpus. By continually pre-training on a mix of Arabic and English corpora, the model retains its proficiency in English while acquiring capabilities in Arabic. Our approach results in significant improvements in Arabic and slight enhancements in English, demonstrating cost-effective cross-lingual transfer. We perform ablations on embedding initialization techniques, data mix ratios, and learning rates and release a detailed training recipe. To demonstrate generalizability of this approach we also adapted Llama 3 8B to Arabic and Llama 2 13B to Hindi.

7/29/2024

Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models

Lynn Chua, Badih Ghazi, Yangsibo Huang, Pritish Kamath, Ravi Kumar, Pasin Manurangsi, Amer Sinha, Chulin Xie, Chiyuan Zhang

Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora. But can these models relate corresponding concepts across languages, effectively being crosslingual? This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks. We observe that while these models show promising surface-level crosslingual abilities on machine translation and embedding space analyses, they struggle with deeper crosslingual knowledge transfer, revealing a crosslingual knowledge barrier in both general (MMLU benchmark) and domain-specific (Harry Potter quiz) contexts. We observe that simple inference-time mitigation methods offer only limited improvement. On the other hand, we propose fine-tuning of LLMs on mixed-language data, which effectively reduces these gaps, even when using out-of-domain datasets like WikiText. Our findings suggest the need for explicit optimization to unlock the full crosslingual potential of LLMs. Our code is publicly available at https://github.com/google-research/crosslingual-knowledge-barriers.

6/26/2024

💬

Investigating the translation capabilities of Large Language Models trained on parallel data only

Javier Garc'ia Gilabert, Carlos Escolano, Aleix Sant Savall, Francesca De Luca Fornaciari, Audrey Mash, Xixian Liao, Maite Melero

In recent years, Large Language Models (LLMs) have demonstrated exceptional proficiency across a broad spectrum of Natural Language Processing (NLP) tasks, including Machine Translation. However, previous methods predominantly relied on iterative processes such as instruction fine-tuning or continual pre-training, leaving unexplored the challenges of training LLMs solely on parallel data. In this work, we introduce PLUME (Parallel Language Model), a collection of three 2B LLMs featuring varying vocabulary sizes (32k, 128k, and 256k) trained exclusively on Catalan-centric parallel examples. These models perform comparably to previous encoder-decoder architectures on 16 supervised translation directions and 56 zero-shot ones. Utilizing this set of models, we conduct a thorough investigation into the translation capabilities of LLMs, probing their performance, the impact of the different elements of the prompt, and their cross-lingual representation space.

6/14/2024