FuxiTranyu: A Multilingual Large Language Model Trained with Balanced Data

Read original: arXiv:2408.06273 - Published 8/14/2024 by Haoran Sun, Renren Jin, Shaoyang Xu, Leiyu Pan, Supryadi, Menglong Cui, Jiangcun Du, Yikun Lei, Lei Yang, Ling Shi and 3 others

FuxiTranyu: A Multilingual Large Language Model Trained with Balanced Data

Overview

FuxiTranyu is a large language model trained on a balanced dataset across multiple languages.
It aims to provide high-quality language understanding and generation capabilities for a diverse set of languages.
The model was trained using a novel approach to balance the data across languages, preventing dominance by a few high-resource languages.

Plain English Explanation

FuxiTranyu is a powerful large language model that can understand and generate text in many different languages. Unlike some previous models that were mainly focused on high-resource languages like English, FuxiTranyu was trained using a special method to ensure that the data was balanced across a wide range of languages. This helps the model perform well across a diverse set of languages, not just the most common ones.

The researchers behind FuxiTranyu recognized that many existing language models tend to be biased towards a few dominant languages, which can limit their usefulness in real-world applications that require multilingual capabilities. By carefully curating the training data and using novel techniques, they were able to create a model that is adept at understanding and generating text in a much broader range of languages.

This balanced approach to training large language models like FuxiTranyu is an important step forward in making these powerful AI systems more accessible and useful for people from diverse linguistic backgrounds. It helps break down barriers and enables more inclusive and equitable access to advanced language technologies.

Technical Explanation

The FuxiTranyu model was trained using a novel approach to balance the data across multiple languages. This is an important consideration, as many existing large language models tend to be biased towards high-resource languages like English, which can limit their performance and applicability in more diverse linguistic contexts.

To address this, the researchers curated a large, multilingual dataset that included a balanced representation of texts from a wide range of languages. They used techniques like language-specific data sampling and linguistic feature engineering to ensure that no single language dominated the training process.

The FuxiTranyu model was then trained using a standard pretraining and fine-tuning approach, with the goal of creating a highly capable, multilingual language model that could be effectively applied to a wide range of downstream tasks and applications.

Critical Analysis

The FuxiTranyu research presents a compelling approach to training large language models that can perform well across diverse linguistic landscapes. By deliberately addressing the issue of data imbalance, the researchers have taken an important step towards creating more inclusive and equitable AI systems.

However, it's worth noting that the paper does not provide a comprehensive evaluation of the model's performance across all of the languages represented in the training data. While the authors claim that the model exhibits strong multilingual capabilities, more detailed assessments would be helpful to fully understand its strengths and weaknesses.

Additionally, the paper does not address potential biases or limitations that may still exist in the FuxiTranyu model, such as the inclusion of potentially problematic or offensive content in the training data. Further research and testing would be necessary to ensure that the model is truly free from harmful biases.

Overall, the FuxiTranyu research represents an important contribution to the field of multilingual language models, and it will be interesting to see how this approach is further refined and applied in the future.

Conclusion

FuxiTranyu is a promising large language model that has been trained using a novel approach to balance the data across multiple languages. This helps to address the common issue of language bias in many existing language models, making the FuxiTranyu model more inclusive and useful for a wider range of applications and users.

While more research is needed to fully understand the model's capabilities and limitations, the FuxiTranyu project represents an important step forward in the development of advanced, multilingual language technologies that can benefit people from diverse linguistic backgrounds.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

FuxiTranyu: A Multilingual Large Language Model Trained with Balanced Data

Haoran Sun, Renren Jin, Shaoyang Xu, Leiyu Pan, Supryadi, Menglong Cui, Jiangcun Du, Yikun Lei, Lei Yang, Ling Shi, Juesi Xiao, Shaolin Zhu, Deyi Xiong

Large language models (LLMs) have demonstrated prowess in a wide range of tasks. However, many LLMs exhibit significant performance discrepancies between high- and low-resource languages. To mitigate this challenge, we present FuxiTranyu, an open-source multilingual LLM, which is designed to satisfy the need of the research community for balanced and high-performing multilingual capabilities. FuxiTranyu-8B, the base model with 8 billion parameters, is trained from scratch on a meticulously balanced multilingual data repository that contains 600 billion tokens covering 43 natural languages and 16 programming languages. In addition to the base model, we also develop two instruction-tuned models: FuxiTranyu-8B-SFT that is fine-tuned on a diverse multilingual instruction dataset, and FuxiTranyu-8B-DPO that is further refined with DPO on a preference dataset for enhanced alignment ability. Extensive experiments on a wide range of multilingual benchmarks demonstrate the competitive performance of FuxiTranyu against existing multilingual LLMs, e.g., BLOOM-7B, PolyLM-13B, Llama-2-Chat-7B and Mistral-7B-Instruct. Interpretability analyses at both the neuron and representation level suggest that FuxiTranyu is able to learn consistent multilingual representations across different languages. To promote further research into multilingual LLMs and their working mechanisms, we release both the base and instruction-tuned FuxiTranyu models together with 58 pretraining checkpoints at HuggingFace and Github.

8/14/2024

LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages

Yinquan Lu, Wenhao Zhu, Lei Li, Yu Qiao, Fei Yuan

Large Language Models~(LLMs) demonstrate remarkable translation capabilities in high-resource language tasks, yet their performance in low-resource languages is hindered by insufficient multilingual data during pre-training. To address this, we dedicate 35,000 A100-SXM4-80GB GPU hours in conducting extensive multilingual continual pre-training on the LLaMA series models, enabling translation support across more than 100 languages. Through a comprehensive analysis of training strategies, such as vocabulary expansion and data augmentation, we develop LLaMAX. Remarkably, without sacrificing its generalization ability, LLaMAX achieves significantly higher translation performance compared to existing open-source LLMs~(by more than 10 spBLEU points) and performs on-par with specialized translation model~(M2M-100-12B) on the Flores-101 benchmark. Extensive experiments indicate that LLaMAX can serve as a robust multilingual foundation model. The code~footnote{url{https://github.com/CONE-MT/LLaMAX/.}} and models~footnote{url{https://huggingface.co/LLaMAX/.}} are publicly available.

7/9/2024

YuLan: An Open-source Large Language Model

Yutao Zhu, Kun Zhou, Kelong Mao, Wentong Chen, Yiding Sun, Zhipeng Chen, Qian Cao, Yihan Wu, Yushuo Chen, Feng Wang, Lei Zhang, Junyi Li, Xiaolei Wang, Lei Wang, Beichen Zhang, Zican Dong, Xiaoxue Cheng, Yuhan Chen, Xinyu Tang, Yupeng Hou, Qiangqiang Ren, Xincheng Pang, Shufang Xie, Wayne Xin Zhao, Zhicheng Dou, Jiaxin Mao, Yankai Lin, Ruihua Song, Jun Xu, Xu Chen, Rui Yan, Zhewei Wei, Di Hu, Wenbing Huang, Ze-Feng Gao, Yueguo Chen, Weizheng Lu, Ji-Rong Wen

Large language models (LLMs) have become the foundation of many applications, leveraging their extensive capabilities in processing and understanding natural language. While many open-source LLMs have been released with technical reports, the lack of training details hinders further research and development. This paper presents the development of YuLan, a series of open-source LLMs with $12$ billion parameters. The base model of YuLan is pre-trained on approximately $1.7$T tokens derived from a diverse corpus, including massive English, Chinese, and multilingual texts. We design a three-stage pre-training method to enhance YuLan's overall capabilities. Subsequent phases of training incorporate instruction-tuning and human alignment, employing a substantial volume of high-quality synthesized data. To facilitate the learning of complex and long-tail knowledge, we devise a curriculum-learning framework throughout across these stages, which helps LLMs learn knowledge in an easy-to-hard manner. YuLan's training is finished on Jan, 2024 and has achieved performance on par with state-of-the-art LLMs across various English and Chinese benchmarks. This paper outlines a comprehensive technical roadmap for developing LLMs from scratch. Our model and codes are available at https://github.com/RUC-GSAI/YuLan-Chat.

7/1/2024

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

Xinrun Du, Zhouliang Yu, Songyang Gao, Ding Pan, Yuyang Cheng, Ziyang Ma, Ruibin Yuan, Xingwei Qu, Jiaheng Liu, Tianyu Zheng, Xinchen Luo, Guorui Zhou, Wenhu Chen, Ge Zhang

In this study, we introduce CT-LLM, a 2B large language model (LLM) that illustrates a pivotal shift towards prioritizing the Chinese language in developing LLMs. Uniquely initiated from scratch, CT-LLM diverges from the conventional methodology by primarily incorporating Chinese textual data, utilizing an extensive corpus of 1,200 billion tokens, including 800 billion Chinese tokens, 300 billion English tokens, and 100 billion code tokens. This strategic composition facilitates the model's exceptional proficiency in understanding and processing Chinese, a capability further enhanced through alignment techniques. Demonstrating remarkable performance on the CHC-Bench, CT-LLM excels in Chinese language tasks, and showcases its adeptness in English through SFT. This research challenges the prevailing paradigm of training LLMs predominantly on English corpora and then adapting them to other languages, broadening the horizons for LLM training methodologies. By open-sourcing the full process of training a Chinese LLM, including a detailed data processing procedure with the obtained Massive Appropriate Pretraining Chinese Corpus (MAP-CC), a well-chosen multidisciplinary Chinese Hard Case Benchmark (CHC-Bench), and the 2B-size Chinese Tiny LLM (CT-LLM), we aim to foster further exploration and innovation in both academia and industry, paving the way for more inclusive and versatile language models.

9/16/2024