Pretraining Data and Tokenizer for Indic LLM

Read original: arXiv:2407.12481 - Published 7/18/2024 by Rahul Kumar, Shubham Kakde, Divyansh Rajput, Daud Ibrahim, Rishabh Nahata, Pidathala Sowjanya, Deepak Kumar

Pretraining Data and Tokenizer for Indic LLM

Overview

This paper presents research on building a pretraining dataset and tokenizer for large language models (LLMs) in Indic languages.
The researchers collected a large corpus of text data in 13 Indic languages and developed a custom tokenizer to handle the unique linguistic features of these languages.
The goal is to enable the training of high-quality LLMs that can understand and generate content in a wide range of Indic languages, supporting advancements in natural language processing for these underserved languages.

Plain English Explanation

The paper focuses on building the necessary foundations for training powerful language models in Indic languages, which include languages like Hindi, Bengali, Tamil, and others. Language models are AI systems that can understand and generate human-like text, and they are the backbone of many natural language processing applications.

However, most of the existing language models have been developed for English and other widely-spoken languages. This creates a gap for Indic languages, which are used by billions of people around the world. To address this, the researchers in this paper collected a large dataset of text in 13 different Indic languages, from sources like news articles, websites, and books.

They also developed a custom tokenizer, which is a critical component that breaks down the text into the basic units that the language model can understand. Indic languages have unique writing systems, grammatical structures, and vocabulary, so a standard tokenizer designed for English would not work well. The custom tokenizer they created is better able to handle the complexities of Indic languages.

With this high-quality pretraining data and tokenizer, the researchers can now train large language models that can understand and generate fluent text in a variety of Indic languages. This lays the groundwork for a wide range of applications, from smart assistants and translation tools to educational resources and creative writing aids, that can serve the diverse Indic-speaking population around the world.

Technical Explanation

The paper first describes the Building Pre-train LLM Dataset for Indic Languages dataset, which was collected by the researchers. This dataset covers 13 Indic languages, including Hindi, Bengali, Tamil, Telugu, and others, and consists of a diverse set of text sources such as news articles, websites, books, and social media. The total corpus size is over 1 trillion tokens, making it one of the largest pretraining datasets for Indic languages.

To handle the unique linguistic properties of these Indic languages, the researchers developed a custom tokenizer based on the Byte-Pair Encoding (BPE) algorithm. The tokenizer was trained on the pretraining dataset and is able to effectively split the text into meaningful subword units, accounting for factors like complex morphology, script variations, and the presence of loanwords from other languages.

The paper also introduces two additional datasets, L3Cube-IndicNews and L3Cube-MahaNews, which provide news-based text data in various Indic languages for evaluating language models. These datasets can be used to assess the performance of models trained on the pretraining corpus.

Finally, the researchers discuss the potential for using the pretraining dataset and tokenizer to train large language models that can serve the Indic-speaking population, as well as the broader implications for advancing natural language processing in these underrepresented languages. They note that the availability of high-quality resources like this can enable the development of Chinese-Tiny LLM, which are smaller and more efficient language models tailored for specific language communities.

Critical Analysis

The paper presents a comprehensive and well-designed approach to building pretraining resources for Indic language models. The researchers have clearly put significant effort into collecting a large and diverse dataset that covers a wide range of Indic languages, which is a crucial first step in enabling the development of high-quality language models.

One potential limitation of the study is that the pretraining dataset, while large, may not fully capture the linguistic diversity and regional variations within each Indic language. The researchers acknowledge this and suggest that further work is needed to curate more region-specific or domain-specific datasets to supplement the main corpus.

Additionally, while the custom tokenizer appears to be a significant improvement over using a standard tokenizer designed for English, there may still be room for further refinements and optimizations to better handle the unique features of Indic languages. The researchers could explore incorporating more advanced techniques, such as the use of subword models or language-specific rules, to further enhance the tokenizer's performance.

It would also be valuable for the researchers to conduct more extensive evaluations of the pretraining dataset and tokenizer, such as by training language models and assessing their performance on a wider range of benchmarks and real-world applications. This would help to better understand the practical implications and limitations of the resources they have developed.

Conclusion

This paper presents a significant contribution to the field of natural language processing for Indic languages. By developing a large-scale pretraining dataset and a custom tokenizer, the researchers have laid the groundwork for training powerful language models that can understand and generate content in a diverse range of Indic languages.

The availability of these high-quality resources has the potential to drive advancements in a wide variety of applications, from intelligent assistants and machine translation to educational tools and creative writing aids, that can better serve the large and underrepresented Indic-speaking population around the world. This work represents an important step towards bridging the gap between Indic languages and the state-of-the-art in natural language processing.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Pretraining Data and Tokenizer for Indic LLM

Rahul Kumar, Shubham Kakde, Divyansh Rajput, Daud Ibrahim, Rishabh Nahata, Pidathala Sowjanya, Deepak Kumar

We present a novel approach to data preparation for developing multilingual Indic large language model. Our meticulous data acquisition spans open-source and proprietary sources, including Common Crawl, Indic books, news articles, and Wikipedia, ensuring a diverse and rich linguistic representation. For each Indic language, we design a custom preprocessing pipeline to effectively eliminate redundant and low-quality text content. Additionally, we perform deduplication on Common Crawl data to address the redundancy present in 70% of the crawled web pages. This study focuses on developing high-quality data, optimizing tokenization for our multilingual dataset for Indic large language models with 3B and 7B parameters, engineered for superior performance in Indic languages. We introduce a novel multilingual tokenizer training strategy, demonstrating our custom-trained Indic tokenizer outperforms the state-of-the-art OpenAI Tiktoken tokenizer, achieving a superior token-to-word ratio for Indic languages.

7/18/2024

Building pre-train LLM Dataset for the INDIC Languages: a case study on Hindi

Shantipriya Parida, Shakshi Panwar, Kusum Lata, Sanskruti Mishra, Sambit Sekhar

Large language models (LLMs) demonstrated transformative capabilities in many applications that require automatically generating responses based on human instruction. However, the major challenge for building LLMs, particularly in Indic languages, is the availability of high-quality data for building foundation LLMs. In this paper, we are proposing a large pre-train dataset in Hindi useful for the Indic language Hindi. We have collected the data span across several domains including major dialects in Hindi. The dataset contains 1.28 billion Hindi tokens. We have explained our pipeline including data collection, pre-processing, and availability for LLM pre-training. The proposed approach can be easily extended to other Indic and low-resource languages and will be available freely for LLM pre-training and LLM research purposes.

7/16/2024

🤖

Decoding the Diversity: A Review of the Indic AI Research Landscape

Sankalp KJ, Vinija Jain, Sreyoshi Bhaduri, Tamoghna Roy, Aman Chadha

This review paper provides a comprehensive overview of large language model (LLM) research directions within Indic languages. Indic languages are those spoken in the Indian subcontinent, including India, Pakistan, Bangladesh, Sri Lanka, Nepal, and Bhutan, among others. These languages have a rich cultural and linguistic heritage and are spoken by over 1.5 billion people worldwide. With the tremendous market potential and growing demand for natural language processing (NLP) based applications in diverse languages, generative applications for Indic languages pose unique challenges and opportunities for research. Our paper deep dives into the recent advancements in Indic generative modeling, contributing with a taxonomy of research directions, tabulating 84 recent publications. Research directions surveyed in this paper include LLM development, fine-tuning existing LLMs, development of corpora, benchmarking and evaluation, as well as publications around specific techniques, tools, and applications. We found that researchers across the publications emphasize the challenges associated with limited data availability, lack of standardization, and the peculiar linguistic complexities of Indic languages. This work aims to serve as a valuable resource for researchers and practitioners working in the field of NLP, particularly those focused on Indic languages, and contributes to the development of more accurate and efficient LLM applications for these languages.

6/17/2024

L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages

Aishwarya Mirashi, Srushti Sonavane, Purva Lingayat, Tejas Padhiyar, Raviraj Joshi

In this work, we introduce L3Cube-IndicNews, a multilingual text classification corpus aimed at curating a high-quality dataset for Indian regional languages, with a specific focus on news headlines and articles. We have centered our work on 10 prominent Indic languages, including Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Kannada, Odia, Malayalam, and Punjabi. Each of these news datasets comprises 10 or more classes of news articles. L3Cube-IndicNews offers 3 distinct datasets tailored to handle different document lengths that are classified as: Short Headlines Classification (SHC) dataset containing the news headline and news category, Long Document Classification (LDC) dataset containing the whole news article and the news category, and Long Paragraph Classification (LPC) containing sub-articles of the news and the news category. We maintain consistent labeling across all 3 datasets for in-depth length-based analysis. We evaluate each of these Indic language datasets using 4 different models including monolingual BERT, multilingual Indic Sentence BERT (IndicSBERT), and IndicBERT. This research contributes significantly to expanding the pool of available text classification datasets and also makes it possible to develop topic classification models for Indian regional languages. This also serves as an excellent resource for cross-lingual analysis owing to the high overlap of labels among languages. The datasets and models are shared publicly at https://github.com/l3cube-pune/indic-nlp

4/30/2024