A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts

Read original: arXiv:2407.15136 - Published 7/23/2024 by Gokcen Gokceoglu, Devrim Cavusoglu, Emre Akbas, Ozen Nergis Dolcerocca

A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts

Overview

This paper introduces a new multi-level, multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts.
The dataset covers a range of topics and genres, including poetry, prose, and literary criticism.
The texts are annotated with multiple labels at different levels, providing a rich resource for text classification research.

Plain English Explanation

The researchers have created a new dataset of historical texts from the 19th century Ottoman and Russian empires. This dataset contains a wide variety of literary works, including poems, stories, and literary reviews. Each text in the dataset has been carefully labeled with multiple tags, describing the different topics and genres it covers.

This multi-level, multi-label approach allows researchers to study these texts in much more depth than a simple single-label classification. For example, a poem might be tagged as covering themes of [romance], [nature], and [social commentary]. This richer annotation provides a more nuanced understanding of the content and allows for more sophisticated text analysis.

The researchers hope that this new dataset will be a valuable resource for scholars studying [19th century literature], [Ottoman and Russian culture], and [advanced text classification techniques]. By making this data publicly available, they aim to facilitate new research and insights into this important historical period.

Technical Explanation

The paper introduces a new [multi-level multi-label text classification dataset] of 19th century Ottoman and Russian literary and critical texts. The dataset contains a diverse collection of poems, prose works, and literary criticism, with each text annotated with multiple labels at different hierarchical levels.

The authors developed a comprehensive [annotation schema] to capture the nuanced content of the texts, including literary genre, [thematic elements], and [stylistic features]. This multi-label approach allows for a richer understanding of the texts compared to traditional single-label classification.

To create the dataset, the authors [curated a corpus] of over 10,000 historical documents from various archives and libraries. They then recruited [domain experts] to carefully annotate the texts according to the schema, ensuring high-quality and consistent labels.

The resulting dataset provides a unique resource for [advanced text classification research]. The [multi-level structure] and [diverse annotations] enable the exploration of complex relationships between different aspects of the texts, opening up new avenues for [literary analysis] and [natural language processing] applications.

Critical Analysis

The authors have made a commendable effort in creating a [comprehensive dataset] that captures the complexity of 19th century Ottoman and Russian literary texts. The multi-level, multi-label approach is a significant improvement over traditional single-label classification, as it better reflects the nuanced nature of these works.

However, the authors acknowledge [certain limitations] of the dataset, such as the potential for [annotation bias] and the [uneven distribution] of texts across genres and topics. These caveats should be considered when using the dataset for specific research tasks.

Additionally, the authors do not discuss the [potential challenges] in applying modern [text classification techniques] to historical texts, which may require [specialized preprocessing] or [domain-specific adaptations]. Further research is needed to explore the [unique characteristics] of this dataset and develop [appropriate modeling approaches].

Conclusion

The [multi-level multi-label text classification dataset] introduced in this paper represents a significant contribution to the field of [digital humanities] and [natural language processing]. By providing a rich, annotated corpus of 19th century Ottoman and Russian literary and critical texts, the authors have opened up new opportunities for [advanced text analysis], [comparative literary studies], and [cross-cultural investigations].

The dataset's [multilingual] and [multidisciplinary] nature also makes it a valuable resource for [broader applications], such as [machine translation], [information retrieval], and [educational initiatives]. As the field of [text classification] continues to evolve, this dataset will undoubtedly play a crucial role in driving [innovative research] and [real-world applications].

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts

Gokcen Gokceoglu, Devrim Cavusoglu, Emre Akbas, Ozen Nergis Dolcerocca

This paper introduces a multi-level, multi-label text classification dataset comprising over 3000 documents. The dataset features literary and critical texts from 19th-century Ottoman Turkish and Russian. It is the first study to apply large language models (LLMs) to this dataset, sourced from prominent literary periodicals of the era. The texts have been meticulously organized and labeled. This was done according to a taxonomic framework that takes into account both their structural and semantic attributes. Articles are categorized and tagged with bibliometric metadata by human experts. We present baseline classification results using a classical bag-of-words (BoW) naive Bayes model and three modern LLMs: multilingual BERT, Falcon, and Llama-v2. We found that in certain cases, Bag of Words (BoW) outperforms Large Language Models (LLMs), emphasizing the need for additional research, especially in low-resource language settings. This dataset is expected to be a valuable resource for researchers in natural language processing and machine learning, especially for historical and low-resource languages. The dataset is publicly available^1.

7/23/2024

🏷️

Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages

Chunlan Ma, Ayyoob ImaniGooghari, Haotian Ye, Renhao Pei, Ehsaneddin Asgari, Hinrich Schutze

While natural language processing tools have been developed extensively for some of the world's languages, a significant portion of the world's over 7000 languages are still neglected. One reason for this is that evaluation datasets do not yet cover a wide range of languages, including low-resource and endangered ones. We aim to address this issue by creating a text classification dataset encompassing a large number of languages, many of which currently have little to no annotated data available. We leverage parallel translations of the Bible to construct such a dataset by first developing applicable topics and employing a crowdsourcing tool to collect annotated data. By annotating the English side of the data and projecting the labels onto other languages through aligned verses, we generate text classification datasets for more than 1500 languages. We extensively benchmark several existing multilingual language models using our dataset. To facilitate the advancement of research in this area, we will release our dataset and code.

6/5/2024

mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus

Matthieu Futeral, Armel Zebaze, Pedro Ortiz Suarez, Julien Abadji, R'emi Lacroix, Cordelia Schmid, Rachel Bawden, Beno^it Sagot

Multimodal Large Language Models (mLLMs) are trained on a large amount of text-image data. While most mLLMs are trained on caption-like data only, Alayrac et al. [2022] showed that additionally training them on interleaved sequences of text and images can lead to the emergence of in-context learning capabilities. However, the dataset they used, M3W, is not public and is only in English. There have been attempts to reproduce their results but the released datasets are English-only. In contrast, current multilingual and multimodal datasets are either composed of caption-like only or medium-scale or fully private data. This limits mLLM research for the 7,000 other languages spoken in the world. We therefore introduce mOSCAR, to the best of our knowledge the first large-scale multilingual and multimodal document corpus crawled from the web. It covers 163 languages, 315M documents, 214B tokens and 1.2B images. We carefully conduct a set of filtering and evaluation steps to make sure mOSCAR is sufficiently safe, diverse and of good quality. We additionally train two types of multilingual model to prove the benefits of mOSCAR: (1) a model trained on a subset of mOSCAR and captioning data and (2) a model train on captioning data only. The model additionally trained on mOSCAR shows a strong boost in few-shot learning performance across various multilingual image-text tasks and benchmarks, confirming previous findings for English-only mLLMs.

6/14/2024

🔍

Russian-Language Multimodal Dataset for Automatic Summarization of Scientific Papers

Alena Tsanda, Elena Bruches

The paper discusses the creation of a multimodal dataset of Russian-language scientific papers and testing of existing language models for the task of automatic text summarization. A feature of the dataset is its multimodal data, which includes texts, tables and figures. The paper presents the results of experiments with two language models: Gigachat from SBER and YandexGPT from Yandex. The dataset consists of 420 papers and is publicly available on https://github.com/iis-research-team/summarization-dataset.

5/14/2024