KazSAnDRA: Kazakh Sentiment Analysis Dataset of Reviews and Attitudes

Read original: arXiv:2403.19335 - Published 4/11/2024 by Rustem Yeshpanov, Huseyin Atakan Varol

KazSAnDRA: Kazakh Sentiment Analysis Dataset of Reviews and Attitudes

Overview

This paper introduces KazSAnDRA, a new Kazakh sentiment analysis dataset of reviews and attitudes.
The dataset includes over 20,000 labeled Kazakh text samples from various domains like e-commerce, movies, and news.
The authors aim to advance Kazakh natural language processing (NLP) research by providing a high-quality benchmark dataset.

Plain English Explanation

The researchers created a new dataset called KazSAnDRA that contains over 20,000 Kazakh text samples, such as product reviews, movie comments, and news articles. Each text sample has been labeled as expressing a positive, negative, or neutral sentiment. This dataset is intended to help researchers working on Kazakh language processing tasks like sentiment analysis, where they can use the labeled data to train and test their AI models. Having a high-quality, standardized dataset is important for advancing the field of Kazakh NLP, which is still relatively new compared to more widely studied languages like English or Chinese.

Technical Explanation

The KazQAD dataset and KazPARC corpus have helped drive progress in Kazakh language understanding, while the KazEmotTS dataset has enabled text-to-speech advancements. Building on this, the authors of KazSAnDRA aimed to create a large-scale dataset to support sentiment analysis in the Kazakh language.

The dataset was sourced from various online platforms including e-commerce sites, movie review forums, and news articles. The text samples were manually annotated by native Kazakh speakers as expressing positive, negative, or neutral sentiment. The authors employed quality control measures to ensure consistency and reliability of the annotations.

The full KazSAnDRA dataset contains 20,269 labeled Kazakh text samples across three sentiment classes. The authors benchmarked several popular sentiment analysis models on this dataset, including M2SA, demonstrating its utility as a standardized evaluation resource for Kazakh NLP.

Critical Analysis

The KazSAnDRA dataset represents an important contribution to Kazakh language processing research. However, the authors acknowledge several limitations. The dataset is relatively small compared to sentiment benchmarks in other languages, and the text samples come from a limited set of domains. Additionally, the annotation process, while employing quality controls, could potentially introduce biases.

Further research is needed to expand the dataset size and diversity, as well as explore more advanced sentiment analysis techniques tailored to the Kazakh language. Comparisons to human performance on the task would also provide valuable insights.

Conclusion

The KazSAnDRA dataset fills a critical gap in Kazakh natural language processing by providing a standardized benchmark for sentiment analysis. This resource can enable researchers to develop more accurate and robust Kazakh language models, ultimately benefiting Kazakh-speaking communities. As Kazakh NLP continues to mature, datasets like KazSAnDRA will play a vital role in driving progress and innovation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

KazSAnDRA: Kazakh Sentiment Analysis Dataset of Reviews and Attitudes

Rustem Yeshpanov, Huseyin Atakan Varol

This paper presents KazSAnDRA, a dataset developed for Kazakh sentiment analysis that is the first and largest publicly available dataset of its kind. KazSAnDRA comprises an extensive collection of 180,064 reviews obtained from various sources and includes numerical ratings ranging from 1 to 5, providing a quantitative representation of customer attitudes. The study also pursued the automation of Kazakh sentiment classification through the development and evaluation of four machine learning models trained for both polarity classification and score classification. Experimental analysis included evaluation of the results considering both balanced and imbalanced scenarios. The most successful model attained an F1-score of 0.81 for polarity classification and 0.39 for score classification on the test sets. The dataset and fine-tuned models are open access and available for download under the Creative Commons Attribution 4.0 International License (CC BY 4.0) through our GitHub repository.

4/11/2024

KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis

Adal Abilbekov, Saida Mussakhojayeva, Rustem Yeshpanov, Huseyin Atakan Varol

This study focuses on the creation of the KazEmoTTS dataset, designed for emotional Kazakh text-to-speech (TTS) applications. KazEmoTTS is a collection of 54,760 audio-text pairs, with a total duration of 74.85 hours, featuring 34.23 hours delivered by a female narrator and 40.62 hours by two male narrators. The list of the emotions considered include neutral, angry, happy, sad, scared, and surprised. We also developed a TTS model trained on the KazEmoTTS dataset. Objective and subjective evaluations were employed to assess the quality of synthesized speech, yielding an MCD score within the range of 6.02 to 7.67, alongside a MOS that spanned from 3.51 to 3.57. To facilitate reproducibility and inspire further research, we have made our code, pre-trained model, and dataset accessible in our GitHub repository.

4/11/2024

KazQAD: Kazakh Open-Domain Question Answering Dataset

Rustem Yeshpanov, Pavel Efimov, Leonid Boytsov, Ardak Shalkarbayuli, Pavel Braslavski

We introduce KazQAD -- a Kazakh open-domain question answering (ODQA) dataset -- that can be used in both reading comprehension and full ODQA settings, as well as for information retrieval experiments. KazQAD contains just under 6,000 unique questions with extracted short answers and nearly 12,000 passage-level relevance judgements. We use a combination of machine translation, Wikipedia search, and in-house manual annotation to ensure annotation efficiency and data quality. The questions come from two sources: translated items from the Natural Questions (NQ) dataset (only for training) and the original Kazakh Unified National Testing (UNT) exam (for development and testing). The accompanying text corpus contains more than 800,000 passages from the Kazakh Wikipedia. As a supplementary dataset, we release around 61,000 question-passage-answer triples from the NQ dataset that have been machine-translated into Kazakh. We develop baseline retrievers and readers that achieve reasonable scores in retrieval (NDCG@10 = 0.389 MRR = 0.382), reading comprehension (EM = 38.5 F1 = 54.2), and full ODQA (EM = 17.8 F1 = 28.7) settings. Nevertheless, these results are substantially lower than state-of-the-art results for English QA collections, and we think that there should still be ample room for improvement. We also show that the current OpenAI's ChatGPTv3.5 is not able to answer KazQAD test questions in the closed-book setting with acceptable quality. The dataset is freely available under the Creative Commons licence (CC BY-SA) at https://github.com/IS2AI/KazQAD.

4/9/2024

🏷️

New Directions in Text Classification Research: Maximizing The Performance of Sentiment Classification from Limited Data

Surya Agustian, Muhammad Irfan Syah, Nurul Fatiara, Rahmad Abdillah

The stakeholders' needs in sentiment analysis for various issues, whether positive or negative, are speed and accuracy. One new challenge in sentiment analysis tasks is the limited training data, which often leads to suboptimal machine learning models and poor performance on test data. This paper discusses the problem of text classification based on limited training data (300 to 600 samples) into three classes: positive, negative, and neutral. A benchmark dataset is provided for training and testing data on the issue of Kaesang Pangarep's appointment as Chairman of PSI. External data for aggregation and augmentation purposes are provided, consisting of two datasets: the topic of Covid Vaccination sentiment and an open topic. The official score used is the F1-score, which balances precision and recall among the three classes, positive, negative, and neutral. A baseline score is provided as a reference for researchers for unoptimized classification methods. The optimized score is provided as a reference for the target score to be achieved by any proposed method. Both scoring (baseline and optimized) use the SVM method, which is widely reported as the state-of-the-art in conventional machine learning methods. The F1-scores achieved by the baseline and optimized methods are 40.83% and 51.28%, respectively.

7/9/2024