Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages

Read original: arXiv:2305.08487 - Published 6/5/2024 by Chunlan Ma, Ayyoob ImaniGooghari, Haotian Ye, Renhao Pei, Ehsaneddin Asgari, Hinrich Schutze

🏷️

Overview

This paper aims to address the lack of evaluation datasets for a wide range of the world's over 7000 languages, including low-resource and endangered ones.
The researchers leverage parallel translations of the Bible to construct a text classification dataset covering more than 1500 languages.
They extensively benchmark several existing multilingual language models using this new dataset, and plan to release the dataset and code to facilitate further research in this area.

Plain English Explanation

The paper is focused on the challenge of developing natural language processing (NLP) tools for the wide range of languages spoken around the world. While NLP has seen significant progress for some major languages, a large portion of the world's over 7000 languages still lack adequate resources and evaluation datasets.

To tackle this problem, the researchers turned to an unlikely source: the Bible. By using the parallel translations of the Bible available in many languages, they were able to construct a text classification dataset covering more than 1500 languages, including many that currently have little to no annotated data available.

The process involved first developing relevant topics for the text classification task, and then using crowdsourcing to collect annotated data in English. The researchers were then able to project these annotations onto the other language versions of the Bible passages, effectively creating a multilingual dataset.

With this new dataset in hand, the researchers were able to extensively test the performance of several existing multilingual language models. The insights gained from this benchmarking will help drive further progress in developing NLP capabilities for a much broader range of the world's languages.

Notably, the researchers plan to release both the dataset and the code used to create it, which will allow other researchers to build upon this work and accelerate progress in this important area of NLP.

Technical Explanation

The key innovation in this paper is the use of parallel Bible translations to construct a large-scale text classification dataset covering over 1500 languages. This approach allows the researchers to bypass the significant time and cost required to manually annotate data for such a diverse set of languages.

First, the researchers identified a set of relevant topics for the text classification task, such as politics, religion, and science. They then employed crowdsourcing to collect annotated data for these topics in English.

By aligning the annotated English text with the corresponding verses in translations of the Bible, the researchers were able to automatically project the labels onto the other language versions. This allowed them to generate text classification datasets for a vast number of languages, many of which currently have little to no annotated data available.

The researchers then conducted extensive benchmarking of several existing multilingual language models using this new dataset. The insights gained from this analysis will help inform the development of better NLP models that can handle the diversity of the world's languages, including low-resource and endangered ones.

Critical Analysis

While the researchers' approach of leveraging parallel Bible translations is a clever and innovative solution to the data scarcity problem, there are some potential limitations to consider.

One concern is the potential bias inherent in using religious texts as the basis for the dataset. The topics and language used in the Bible may not be fully representative of the breadth of human knowledge and communication, which could limit the broader applicability of models trained on this data.

Additionally, the quality and accuracy of the automatically projected labels may vary across languages, depending on the fidelity of the translations and the alignment between the English and other language versions. This could introduce noise or inconsistencies in the dataset that could impact the reliability of the benchmarking results.

Further research would be needed to assess the generalizability of models trained on this dataset and to explore ways to diversify the source material beyond religious texts. Incorporating other multilingual datasets could also help to strengthen the robustness and coverage of the evaluation.

Conclusion

This paper presents a novel approach to addressing the challenge of developing NLP capabilities for a wide range of the world's languages. By leveraging parallel Bible translations, the researchers were able to create a large-scale text classification dataset covering more than 1500 languages, many of which currently lack adequate annotated data.

The benchmarking of existing multilingual language models using this new dataset will provide valuable insights to guide the future development of more robust and inclusive NLP systems. The researchers' plan to release both the dataset and the code used to create it is a commendable step that will enable other researchers to build upon this work and accelerate progress in this important field.

While the approach has some potential limitations, the overall contribution of this work is significant, as it represents an important step towards addressing the language diversity challenge and ensuring that the benefits of NLP technology can be more equitably shared across the world's population.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏷️

Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages

Chunlan Ma, Ayyoob ImaniGooghari, Haotian Ye, Renhao Pei, Ehsaneddin Asgari, Hinrich Schutze

While natural language processing tools have been developed extensively for some of the world's languages, a significant portion of the world's over 7000 languages are still neglected. One reason for this is that evaluation datasets do not yet cover a wide range of languages, including low-resource and endangered ones. We aim to address this issue by creating a text classification dataset encompassing a large number of languages, many of which currently have little to no annotated data available. We leverage parallel translations of the Bible to construct such a dataset by first developing applicable topics and employing a crowdsourcing tool to collect annotated data. By annotating the English side of the data and projecting the labels onto other languages through aligned verses, we generate text classification datasets for more than 1500 languages. We extensively benchmark several existing multilingual language models using our dataset. To facilitate the advancement of research in this area, we will release our dataset and code.

6/5/2024

🖼️

Tagengo: A Multilingual Chat Dataset

Peter Devine

Open source large language models (LLMs) have shown great improvements in recent times. However, many of these models are focused solely on popular spoken languages. We present a high quality dataset of more than 70k prompt-response pairs in 74 languages which consist of human generated prompts and synthetic responses. We use this dataset to train a state-of-the-art open source English LLM to chat multilingually. We evaluate our model on MT-Bench chat benchmarks in 6 languages, finding that our multilingual model outperforms previous state-of-the-art open source LLMs across each language. We further find that training on more multilingual data is beneficial to the performance in a chosen target language (Japanese) compared to simply training on only data in that language. These results indicate the necessity of training on large amounts of high quality multilingual data to make a more accessible LLM.

5/22/2024

A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts

Gokcen Gokceoglu, Devrim Cavusoglu, Emre Akbas, Ozen Nergis Dolcerocca

This paper introduces a multi-level, multi-label text classification dataset comprising over 3000 documents. The dataset features literary and critical texts from 19th-century Ottoman Turkish and Russian. It is the first study to apply large language models (LLMs) to this dataset, sourced from prominent literary periodicals of the era. The texts have been meticulously organized and labeled. This was done according to a taxonomic framework that takes into account both their structural and semantic attributes. Articles are categorized and tagged with bibliometric metadata by human experts. We present baseline classification results using a classical bag-of-words (BoW) naive Bayes model and three modern LLMs: multilingual BERT, Falcon, and Llama-v2. We found that in certain cases, Bag of Words (BoW) outperforms Large Language Models (LLMs), emphasizing the need for additional research, especially in low-resource language settings. This dataset is expected to be a valuable resource for researchers in natural language processing and machine learning, especially for historical and low-resource languages. The dataset is publicly available^1.

7/23/2024

Universal Cross-Lingual Text Classification

Riya Savant, Anushka Shelke, Sakshi Todmal, Sanskruti Kanphade, Ananya Joshi, Raviraj Joshi

Text classification, an integral task in natural language processing, involves the automatic categorization of text into predefined classes. Creating supervised labeled datasets for low-resource languages poses a considerable challenge. Unlocking the language potential of low-resource languages requires robust datasets with supervised labels. However, such datasets are scarce, and the label space is often limited. In our pursuit to address this gap, we aim to optimize existing labels/datasets in different languages. This research proposes a novel perspective on Universal Cross-Lingual Text Classification, leveraging a unified model across languages. Our approach involves blending supervised data from different languages during training to create a universal model. The supervised data for a target classification task might come from different languages covering different labels. The primary goal is to enhance label and language coverage, aiming for a label set that represents a union of labels from various languages. We propose the usage of a strong multilingual SBERT as our base model, making our novel training strategy feasible. This strategy contributes to the adaptability and effectiveness of the model in cross-lingual language transfer scenarios, where it can categorize text in languages not encountered during training. Thus, the paper delves into the intricacies of cross-lingual text classification, with a particular focus on its application for low-resource languages, exploring methodologies and implications for the development of a robust and adaptable universal cross-lingual model.

6/18/2024