Tele-FLM Technical Report

2404.16645

Published 4/26/2024 by Xiang Li, Yiqun Yao, Xin Jiang, Xuezhi Fang, Chao Wang, Xinzhang Liu, Zihan Wang, Yu Zhao, Xin Wang, Yuyao Huang and 10 others

cs.CL cs.AI

🐍

Abstract

Large language models (LLMs) have showcased profound capabilities in language understanding and generation, facilitating a wide array of applications. However, there is a notable paucity of detailed, open-sourced methodologies on efficiently scaling LLMs beyond 50 billion parameters with minimum trial-and-error cost and computational resources. In this report, we introduce Tele-FLM (aka FLM-2), a 52B open-sourced multilingual large language model that features a stable, efficient pre-training paradigm and enhanced factual judgment capabilities. Tele-FLM demonstrates superior multilingual language modeling abilities, measured by BPB on textual corpus. Besides, in both English and Chinese foundation model evaluation, it is comparable to strong open-sourced models that involve larger pre-training FLOPs, such as Llama2-70B and DeepSeek-67B. In addition to the model weights, we share the core designs, engineering practices, and training details, which we expect to benefit both the academic and industrial communities.

Get summaries of the top AI research delivered straight to your inbox:

Overview

Large language models (LLMs) have shown remarkable capabilities in language understanding and generation, enabling a wide range of applications.
However, there is a lack of detailed, open-sourced methodologies on efficiently scaling LLMs beyond 50 billion parameters with minimal trial-and-error cost and computational resources.
This report introduces Tele-FLM (aka FLM-2), a 52-billion-parameter open-sourced multilingual large language model that features a stable, efficient pre-training paradigm and enhanced factual judgment capabilities.

Plain English Explanation

Tele-FLM is a new large language model that has been developed and made publicly available. Large language models are AI systems that can understand and generate human-like text, and they have become increasingly powerful in recent years. However, it can be challenging and resource-intensive to create language models that are even larger than 50 billion parameters (a measure of the model's size and complexity).

This report describes a new approach for efficiently scaling up large language models beyond 50 billion parameters, while minimizing the time and computational resources required. The Tele-FLM model is 52 billion parameters in size and can work with multiple languages, not just English. It has also been designed to have improved abilities in making factual judgments, which is an important capability for many real-world applications of language models.

The researchers behind Tele-FLM have shared the model's architecture, training details, and other key technical information. This openness is intended to benefit both academic and industry researchers working on large language models and related technologies.

Technical Explanation

The Tele-FLM paper introduces a 52-billion-parameter open-sourced multilingual large language model that the authors call "Tele-FLM" (or "FLM-2"). The model features a stable and efficient pre-training approach, as well as enhanced factual judgment capabilities.

Tele-FLM demonstrates strong multilingual language modeling performance, as measured by bits per byte (BPB) on textual corpora. Furthermore, in evaluations on both English and Chinese foundation model tasks, Tele-FLM performs comparably to other large open-sourced models, such as Llama2-70B and DeepSeek-67B, despite involving less pre-training compute.

In addition to the model weights, the researchers have shared the core design decisions, engineering practices, and training details of Tele-FLM. This open-sourcing of methodologies is intended to benefit both academic and industrial communities working on large language models and related technologies.

Critical Analysis

The Tele-FLM paper presents a novel approach for efficiently scaling up large language models beyond 50 billion parameters. The researchers have done a commendable job in developing Tele-FLM and making it openly available, which is a valuable contribution to the field.

One potential area for further research mentioned in the paper is the need to explore more advanced techniques for factual judgment and knowledge integration in large language models. While Tele-FLM has shown enhanced factual judgment capabilities, there is still room for improvement in this area, particularly as language models are increasingly being deployed in real-world applications that require a high degree of factual accuracy.

Additionally, the paper does not delve deeply into potential biases or societal impacts that may arise from the use of such large-scale language models. As these models become more powerful and widely adopted, it will be crucial for researchers to carefully examine and address any ethical concerns or unintended consequences that may emerge.

Overall, the Tele-FLM research represents an important step forward in the development of efficient and scalable large language models. The open-sourcing of the methodology and model weights is a welcome contribution that should spur further advancements in this rapidly evolving field.

Conclusion

The Tele-FLM paper introduces a novel 52-billion-parameter open-sourced multilingual large language model that showcases a stable and efficient pre-training paradigm, as well as enhanced factual judgment capabilities.

Tele-FLM demonstrates strong multilingual language modeling performance and is comparable to other large open-sourced models in various foundation model tasks, despite involving less pre-training compute. The open-sourcing of the model's technical details and methodologies is a valuable contribution that should benefit both academic and industry researchers working on large language models and related technologies.

While the Tele-FLM research represents an important step forward, further work is needed to explore more advanced techniques for factual judgment and knowledge integration, as well as to carefully examine the potential biases and societal impacts of such large-scale language models. Nonetheless, the Tele-FLM project is a significant and promising development in the field of large language models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

New!GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators

Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Dong Zhang, Zhehuai Chen, Eng Siong Chng

Recent advances in large language models (LLMs) have stepped forward the development of multilingual speech and machine translation by its reduced representation errors and incorporated external knowledge. However, both translation tasks typically utilize beam search decoding and top-1 hypothesis selection for inference. These techniques struggle to fully exploit the rich information in the diverse N-best hypotheses, making them less optimal for translation tasks that require a single, high-quality output sequence. In this paper, we propose a new generative paradigm for translation tasks, namely GenTranslate, which builds upon LLMs to generate better results from the diverse translation versions in N-best list. Leveraging the rich linguistic knowledge and strong reasoning abilities of LLMs, our new paradigm can integrate the rich information in N-best candidates to generate a higher-quality translation result. Furthermore, to support LLM finetuning, we build and release a HypoTranslate dataset that contains over 592K hypotheses-translation pairs in 11 languages. Experiments on various speech and machine translation benchmarks (e.g., FLEURS, CoVoST-2, WMT) demonstrate that our GenTranslate significantly outperforms the state-of-the-art model.

5/17/2024

cs.CL cs.AI cs.LG cs.SD eess.AS

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

Xinrun Du, Zhouliang Yu, Songyang Gao, Ding Pan, Yuyang Cheng, Ziyang Ma, Ruibin Yuan, Xingwei Qu, Jiaheng Liu, Tianyu Zheng, Xinchen Luo, Guorui Zhou, Binhang Yuan, Wenhu Chen, Jie Fu, Ge Zhang

In this study, we introduce CT-LLM, a 2B large language model (LLM) that illustrates a pivotal shift towards prioritizing the Chinese language in developing LLMs. Uniquely initiated from scratch, CT-LLM diverges from the conventional methodology by primarily incorporating Chinese textual data, utilizing an extensive corpus of 1,200 billion tokens, including 800 billion Chinese tokens, 300 billion English tokens, and 100 billion code tokens. This strategic composition facilitates the model's exceptional proficiency in understanding and processing Chinese, a capability further enhanced through alignment techniques. Demonstrating remarkable performance on the CHC-Bench, CT-LLM excels in Chinese language tasks, and showcases its adeptness in English through SFT. This research challenges the prevailing paradigm of training LLMs predominantly on English corpora and then adapting them to other languages, broadening the horizons for LLM training methodologies. By open-sourcing the full process of training a Chinese LLM, including a detailed data processing procedure with the obtained Massive Appropriate Pretraining Chinese Corpus (MAP-CC), a well-chosen multidisciplinary Chinese Hard Case Benchmark (CHC-Bench), and the 2B-size Chinese Tiny LLM (CT-LLM), we aim to foster further exploration and innovation in both academia and industry, paving the way for more inclusive and versatile language models.

4/10/2024

cs.CL cs.AI

A Novel Paradigm Boosting Translation Capabilities of Large Language Models

Jiaxin Guo, Hao Yang, Zongyao Li, Daimeng Wei, Hengchao Shang, Xiaoyu Chen

This paper presents a study on strategies to enhance the translation capabilities of large language models (LLMs) in the context of machine translation (MT) tasks. The paper proposes a novel paradigm consisting of three stages: Secondary Pre-training using Extensive Monolingual Data, Continual Pre-training with Interlinear Text Format Documents, and Leveraging Source-Language Consistent Instruction for Supervised Fine-Tuning. Previous research on LLMs focused on various strategies for supervised fine-tuning (SFT), but their effectiveness has been limited. While traditional machine translation approaches rely on vast amounts of parallel bilingual data, our paradigm highlights the importance of using smaller sets of high-quality bilingual data. We argue that the focus should be on augmenting LLMs' cross-lingual alignment abilities during pre-training rather than solely relying on extensive bilingual data during SFT. Experimental results conducted using the Llama2 model, particularly on Chinese-Llama2 after monolingual augmentation, demonstrate the improved translation capabilities of LLMs. A significant contribution of our approach lies in Stage2: Continual Pre-training with Interlinear Text Format Documents, which requires less than 1B training data, making our method highly efficient. Additionally, in Stage3, we observed that setting instructions consistent with the source language benefits the supervised fine-tuning process. Experimental results demonstrate that our approach surpasses previous work and achieves superior performance compared to models such as NLLB-54B and GPT3.5-text-davinci-003, despite having a significantly smaller parameter count of only 7B or 13B. This achievement establishes our method as a pioneering strategy in the field of machine translation.

4/16/2024

cs.CL

SambaLingo: Teaching Large Language Models New Languages

Zoltan Csaki, Bo Li, Jonathan Li, Qiantong Xu, Pian Pawakapan, Leon Zhang, Yun Du, Hengyu Zhao, Changran Hu, Urmish Thakker

Despite the widespread availability of LLMs, there remains a substantial gap in their capabilities and availability across diverse languages. One approach to address these issues has been to take an existing pre-trained LLM and continue to train it on new languages. While prior works have experimented with language adaptation, many questions around best practices and methodology have not been covered. In this paper, we present a comprehensive investigation into the adaptation of LLMs to new languages. Our study covers the key components in this process, including vocabulary extension, direct preference optimization and the data scarcity problem for human alignment in low-resource languages. We scale these experiments across 9 languages and 2 parameter scales (7B and 70B). We compare our models against Llama 2, Aya-101, XGLM, BLOOM and existing language experts, outperforming all prior published baselines. Additionally, all evaluation code and checkpoints are made public to facilitate future research.

4/10/2024

cs.CL cs.AI cs.LG