L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages

Read original: arXiv:2401.02254 - Published 4/30/2024 by Aishwarya Mirashi, Srushti Sonavane, Purva Lingayat, Tejas Padhiyar, Raviraj Joshi

L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages

Overview

This paper presents the L3Cube-IndicNews dataset, a collection of news articles in multiple Indic languages for short text and long document classification tasks.
The dataset covers diverse topics and includes annotations for various categories, making it a valuable resource for natural language processing research in Indic languages.
The paper also introduces several Indic language benchmarks, including IndicGenBench, IndiBias, and MahaSquAD, which aim to evaluate the performance of language models in these languages.

Plain English Explanation

The researchers have created a new dataset called L3Cube-IndicNews, which contains news articles in various Indic languages. These articles cover a wide range of topics and have been categorized into different classes, such as politics, sports, and entertainment. This dataset can be used by researchers and developers to train and test machine learning models that can classify short text snippets or longer documents in Indic languages.

The researchers have also developed several other benchmarks, like IndicGenBench, IndiBias, and MahaSquAD, which are designed to evaluate the performance of language models in Indic languages. These benchmarks can help researchers and developers understand how well their models perform on tasks like text generation, bias detection, and question answering in these languages.

Technical Explanation

The L3Cube-IndicNews dataset was created by crawling and curating news articles from various Indic language news portals. The dataset covers 12 Indic languages, including Hindi, Bengali, Marathi, and Tamil, and includes annotations for both short text (headline) and long document (full article) classification tasks.

The researchers used a two-step approach to create the dataset. First, they collected news articles from various sources and preprocessed the data, removing non-textual elements and applying language-specific cleaning techniques. Then, they manually annotated a subset of the articles, assigning them to one of several predefined categories, such as politics, sports, and entertainment.

The resulting dataset consists of over 100,000 news articles, with a balanced distribution across the different categories and languages. To ensure the quality of the dataset, the researchers employed multiple annotators and used standard inter-annotator agreement measures.

In addition to the L3Cube-IndicNews dataset, the paper introduces several other Indic language benchmarks, including IndicGenBench, IndiBias, and MahaSquAD. These benchmarks are designed to evaluate the performance of language models on a variety of tasks, including text generation, bias detection, and question answering.

Critical Analysis

The L3Cube-IndicNews dataset is a valuable addition to the field of natural language processing for Indic languages. The dataset's diversity in terms of topics and languages, as well as its careful curation and annotation process, make it a reliable resource for researchers and developers.

However, the paper does not provide detailed information about the distribution of articles across the different categories and languages, which could be useful for understanding the dataset's potential biases and limitations. Additionally, the paper does not discuss the challenges faced during the data collection and annotation process, which could provide valuable insights for future researchers working on similar datasets.

While the introduction of the Indic language benchmarks, such as IndicGenBench, IndiBias, and MahaSquAD, is a positive step, the paper does not provide a detailed evaluation of the performance of existing language models on these benchmarks. This information could have been useful for understanding the current state of the art in Indic language processing and identifying areas for further research.

Conclusion

The L3Cube-IndicNews dataset and the associated Indic language benchmarks introduced in this paper represent a significant contribution to the field of natural language processing for Indic languages. The dataset's diversity and the benchmarks' focus on evaluating various language processing capabilities can help researchers and developers advance the state of the art in Indic language understanding and generation.

By providing these resources, the researchers have taken an important step towards bridging the linguistic divide and promoting the development of more inclusive and accessible language technologies for Indic language speakers. As the field continues to evolve, it will be essential to build upon these foundational efforts and address the remaining challenges in Indic language processing.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages

Aishwarya Mirashi, Srushti Sonavane, Purva Lingayat, Tejas Padhiyar, Raviraj Joshi

In this work, we introduce L3Cube-IndicNews, a multilingual text classification corpus aimed at curating a high-quality dataset for Indian regional languages, with a specific focus on news headlines and articles. We have centered our work on 10 prominent Indic languages, including Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Kannada, Odia, Malayalam, and Punjabi. Each of these news datasets comprises 10 or more classes of news articles. L3Cube-IndicNews offers 3 distinct datasets tailored to handle different document lengths that are classified as: Short Headlines Classification (SHC) dataset containing the news headline and news category, Long Document Classification (LDC) dataset containing the whole news article and the news category, and Long Paragraph Classification (LPC) containing sub-articles of the news and the news category. We maintain consistent labeling across all 3 datasets for in-depth length-based analysis. We evaluate each of these Indic language datasets using 4 different models including monolingual BERT, multilingual Indic Sentence BERT (IndicSBERT), and IndicBERT. This research contributes significantly to expanding the pool of available text classification datasets and also makes it possible to develop topic classification models for Indian regional languages. This also serves as an excellent resource for cross-lingual analysis owing to the high overlap of labels among languages. The datasets and models are shared publicly at https://github.com/l3cube-pune/indic-nlp

4/30/2024

L3Cube-MahaNews: News-based Short Text and Long Document Classification Datasets in Marathi

Saloni Mittal, Vidula Magdum, Omkar Dhekane, Sharayu Hiwarkhedkar, Raviraj Joshi

The availability of text or topic classification datasets in the low-resource Marathi language is limited, typically consisting of fewer than 4 target labels, with some achieving nearly perfect accuracy. In this work, we introduce L3Cube-MahaNews, a Marathi text classification corpus that focuses on News headlines and articles. This corpus stands out as the largest supervised Marathi Corpus, containing over 1.05L records classified into a diverse range of 12 categories. To accommodate different document lengths, MahaNews comprises three supervised datasets specifically designed for short text, long documents, and medium paragraphs. The consistent labeling across these datasets facilitates document length-based analysis. We provide detailed data statistics and baseline results on these datasets using state-of-the-art pre-trained BERT models. We conduct a comparative analysis between monolingual and multilingual BERT models, including MahaBERT, IndicBERT, and MuRIL. The monolingual MahaBERT model outperforms all others on every dataset. These resources also serve as Marathi topic classification datasets or models and are publicly available at https://github.com/l3cube-pune/MarathiNLP .

4/30/2024

New!L3Cube-IndicQuest: A Benchmark Questing Answering Dataset for Evaluating Knowledge of LLMs in Indic Context

Pritika Rohera, Chaitrali Ginimav, Akanksha Salunke, Gayatri Sawant, Raviraj Joshi

Large Language Models (LLMs) have made significant progress in incorporating Indic languages within multilingual models. However, it is crucial to quantitatively assess whether these languages perform comparably to globally dominant ones, such as English. Currently, there is a lack of benchmark datasets specifically designed to evaluate the regional knowledge of LLMs in various Indic languages. In this paper, we present the L3Cube-IndicQuest, a gold-standard question-answering benchmark dataset designed to evaluate how well multilingual LLMs capture regional knowledge across various Indic languages. The dataset contains 200 question-answer pairs, each for English and 19 Indic languages, covering five domains specific to the Indic region. We aim for this dataset to serve as a benchmark, providing ground truth for evaluating the performance of LLMs in understanding and representing knowledge relevant to the Indian context. The IndicQuest can be used for both reference-based evaluation and LLM-as-a-judge evaluation. The dataset is shared publicly at https://github.com/l3cube-pune/indic-nlp .

9/16/2024

Pretraining Data and Tokenizer for Indic LLM

Rahul Kumar, Shubham Kakde, Divyansh Rajput, Daud Ibrahim, Rishabh Nahata, Pidathala Sowjanya, Deepak Kumar

We present a novel approach to data preparation for developing multilingual Indic large language model. Our meticulous data acquisition spans open-source and proprietary sources, including Common Crawl, Indic books, news articles, and Wikipedia, ensuring a diverse and rich linguistic representation. For each Indic language, we design a custom preprocessing pipeline to effectively eliminate redundant and low-quality text content. Additionally, we perform deduplication on Common Crawl data to address the redundancy present in 70% of the crawled web pages. This study focuses on developing high-quality data, optimizing tokenization for our multilingual dataset for Indic large language models with 3B and 7B parameters, engineered for superior performance in Indic languages. We introduce a novel multilingual tokenizer training strategy, demonstrating our custom-trained Indic tokenizer outperforms the state-of-the-art OpenAI Tiktoken tokenizer, achieving a superior token-to-word ratio for Indic languages.

7/18/2024