Building pre-train LLM Dataset for the INDIC Languages: a case study on Hindi

Read original: arXiv:2407.09855 - Published 7/16/2024 by Shantipriya Parida, Shakshi Panwar, Kusum Lata, Sanskruti Mishra, Sambit Sekhar

Building pre-train LLM Dataset for the INDIC Languages: a case study on Hindi

Overview

This paper describes the process of building a pre-training dataset for large language models (LLMs) in the Indic languages, with a focus on Hindi.
The authors highlight the importance of developing high-quality datasets for Indic languages to enable the creation of more accurate and robust LLMs.
The paper covers the steps involved in dataset construction, including data collection, cleaning, and curation.

Plain English Explanation

The paper discusses the challenge of creating high-quality datasets for training large language models (LLMs) in Indic languages, which are a group of languages spoken in the Indian subcontinent. The authors focus on the case of Hindi, which is one of the most widely spoken Indic languages.

To build LLMs that can understand and generate text effectively in Hindi, researchers need access to a large corpus of high-quality text data. However, creating such datasets for Indic languages can be challenging due to factors like limited availability of digital resources and diverse dialects.

The paper outlines the steps the authors took to address this challenge and build a pre-training dataset for Hindi LLMs. This includes collecting data from various online sources, cleaning and processing the text to remove errors and inconsistencies, and curating the dataset to ensure it is representative of the language and its usage.

By sharing their experience and the insights gained from this process, the authors aim to provide a roadmap for researchers and developers working on building LLMs for other Indic languages. This can help accelerate the development of more accurate and robust language models that can better serve the needs of Indic language speakers.

Technical Explanation

The paper describes the process of building a pre-training dataset for large language models (LLMs) in the Indic languages, with a focus on Hindi. The authors highlight the importance of developing high-quality datasets for Indic languages to enable the creation of more accurate and robust LLMs.

The authors first conducted a literature survey to understand the existing research landscape and identify the challenges in building datasets for Indic languages. They then outlined the steps involved in their dataset construction process, which included data collection, cleaning, and curation.

For data collection, the authors leveraged various online sources, such as news articles, Wikipedia, and social media platforms, to amass a diverse corpus of Hindi text. They then employed a range of text processing techniques, including language detection, deduplication, and normalization, to clean and prepare the data for curation.

The curation process involved several steps to ensure the dataset's quality and representativeness. This included filtering out low-quality or irrelevant content, maintaining the diversity of the corpus in terms of domains and genres, and addressing potential biases or skewed representations.

The authors also discussed the challenges they faced during the dataset construction process, such as the limited availability of high-quality digital resources in Hindi and the need to handle the language's complex morphology and script.

Critical Analysis

The paper provides a comprehensive and well-documented approach to building a pre-training dataset for Hindi LLMs, which can serve as a valuable reference for researchers and developers working on Indic language models.

However, the authors acknowledge some limitations of their work. For instance, they note that their dataset may not fully capture the linguistic diversity of Hindi, as it primarily focuses on standard formal Hindi and may not adequately represent colloquial or regional variations.

Additionally, the authors did not address potential biases or skewed representations in their dataset, which could have implications for the fairness and inclusiveness of the resulting LLMs. Future research could explore methods for proactively identifying and mitigating such biases, drawing on initiatives like IndicBias and IndicGenBench.

There is also an opportunity to expand the scope of the dataset construction process to include other Indic languages, potentially leveraging cross-lingual techniques and insights from multilingual datasets like TageNgo and Chinese Tiny LLM. This could lead to the development of more comprehensive and interoperable language models for the Indic language ecosystem.

Conclusion

This paper presents a detailed case study on the process of building a pre-training dataset for Hindi LLMs, highlighting the importance of developing high-quality datasets for Indic languages. The authors' approach provides a valuable blueprint for researchers and developers working on language models for other Indic languages.

By addressing the challenges of data collection, cleaning, and curation, the paper contributes to the growing body of research on Indic language technologies. The insights gained from this work can help accelerate the development of more accurate and inclusive LLMs that can better serve the needs of Indic language speakers, ultimately promoting digital inclusion and empowerment.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Building pre-train LLM Dataset for the INDIC Languages: a case study on Hindi

Shantipriya Parida, Shakshi Panwar, Kusum Lata, Sanskruti Mishra, Sambit Sekhar

Large language models (LLMs) demonstrated transformative capabilities in many applications that require automatically generating responses based on human instruction. However, the major challenge for building LLMs, particularly in Indic languages, is the availability of high-quality data for building foundation LLMs. In this paper, we are proposing a large pre-train dataset in Hindi useful for the Indic language Hindi. We have collected the data span across several domains including major dialects in Hindi. The dataset contains 1.28 billion Hindi tokens. We have explained our pipeline including data collection, pre-processing, and availability for LLM pre-training. The proposed approach can be easily extended to other Indic and low-resource languages and will be available freely for LLM pre-training and LLM research purposes.

7/16/2024

Pretraining Data and Tokenizer for Indic LLM

Rahul Kumar, Shubham Kakde, Divyansh Rajput, Daud Ibrahim, Rishabh Nahata, Pidathala Sowjanya, Deepak Kumar

We present a novel approach to data preparation for developing multilingual Indic large language model. Our meticulous data acquisition spans open-source and proprietary sources, including Common Crawl, Indic books, news articles, and Wikipedia, ensuring a diverse and rich linguistic representation. For each Indic language, we design a custom preprocessing pipeline to effectively eliminate redundant and low-quality text content. Additionally, we perform deduplication on Common Crawl data to address the redundancy present in 70% of the crawled web pages. This study focuses on developing high-quality data, optimizing tokenization for our multilingual dataset for Indic large language models with 3B and 7B parameters, engineered for superior performance in Indic languages. We introduce a novel multilingual tokenizer training strategy, demonstrating our custom-trained Indic tokenizer outperforms the state-of-the-art OpenAI Tiktoken tokenizer, achieving a superior token-to-word ratio for Indic languages.

7/18/2024

INDIC QA BENCHMARK: A Multilingual Benchmark to Evaluate Question Answering capability of LLMs for Indic Languages

Abhishek Kumar Singh, Rudra Murthy, Vishwajeet kumar, Jaydeep Sen, Ganesh Ramakrishnan

Large Language Models (LLMs) have demonstrated remarkable zero-shot and few-shot capabilities in unseen tasks, including context-grounded question answering (QA) in English. However, the evaluation of LLMs' capabilities in non-English languages for context-based QA is limited by the scarcity of benchmarks in non-English languages. To address this gap, we introduce Indic-QA, the largest publicly available context-grounded question-answering dataset for 11 major Indian languages from two language families. The dataset comprises both extractive and abstractive question-answering tasks and includes existing datasets as well as English QA datasets translated into Indian languages. Additionally, we generate a synthetic dataset using the Gemini model to create question-answer pairs given a passage, which is then manually verified for quality assurance. We evaluate various multilingual Large Language Models and their instruction-fine-tuned variants on the benchmark and observe that their performance is subpar, particularly for low-resource languages. We hope that the release of this dataset will stimulate further research on the question-answering abilities of LLMs for low-resource languages.

7/19/2024

🤖

Decoding the Diversity: A Review of the Indic AI Research Landscape

Sankalp KJ, Vinija Jain, Sreyoshi Bhaduri, Tamoghna Roy, Aman Chadha

This review paper provides a comprehensive overview of large language model (LLM) research directions within Indic languages. Indic languages are those spoken in the Indian subcontinent, including India, Pakistan, Bangladesh, Sri Lanka, Nepal, and Bhutan, among others. These languages have a rich cultural and linguistic heritage and are spoken by over 1.5 billion people worldwide. With the tremendous market potential and growing demand for natural language processing (NLP) based applications in diverse languages, generative applications for Indic languages pose unique challenges and opportunities for research. Our paper deep dives into the recent advancements in Indic generative modeling, contributing with a taxonomy of research directions, tabulating 84 recent publications. Research directions surveyed in this paper include LLM development, fine-tuning existing LLMs, development of corpora, benchmarking and evaluation, as well as publications around specific techniques, tools, and applications. We found that researchers across the publications emphasize the challenges associated with limited data availability, lack of standardization, and the peculiar linguistic complexities of Indic languages. This work aims to serve as a valuable resource for researchers and practitioners working in the field of NLP, particularly those focused on Indic languages, and contributes to the development of more accurate and efficient LLM applications for these languages.

6/17/2024