BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis

Read original: arXiv:2408.08964 - Published 8/20/2024 by Sadia Alam, Md Farhan Ishmam, Navid Hasin Alvee, Md Shahnewaz Siddique, Md Azam Hossain, Abu Raihan Mostofa Kamal

BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis

Overview

This paper introduces BnSentMix, a diverse Bengali-English code-mixed dataset for sentiment analysis.
The dataset contains over 25,000 labeled code-mixed social media posts across a wide range of topics and sentiment polarities.
The authors perform comprehensive analysis and benchmarking of state-of-the-art sentiment analysis models on the dataset.

Plain English Explanation

The study focuses on creating a dataset of social media posts that use a mix of Bengali and English, known as "code-mixing." [Code-mixing is a common linguistic phenomenon where people switch between two or more languages within the same conversation or text.] This type of language usage is prevalent in many multilingual communities, including in South Asia.

The researchers compiled a large dataset of over 25,000 code-mixed social media posts that cover a diverse range of topics and express a variety of positive, negative, and neutral sentiments. They labeled each post with its overall sentiment, creating a comprehensive resource for training and evaluating sentiment analysis models.

By benchmarking the performance of leading AI models on this dataset, the researchers were able to identify strengths and limitations in how well current techniques can handle the complexities of code-mixed text. This provides valuable insights for improving sentiment analysis in multilingual, code-mixed contexts, which has important applications in areas like social media monitoring, customer service, and market research.

Technical Explanation

The authors introduce the BnSentMix dataset, which contains over 25,000 code-mixed Bengali-English social media posts from various online platforms. [The dataset is available for download at

link

.] The posts cover a diverse range of topics, including entertainment, politics, sports, and more, and have been manually annotated for sentiment (positive, negative, or neutral).

To create the dataset, the researchers used a combination of automated data collection and manual review and curation. They developed custom web crawlers to scrape code-mixed text from social media, online forums, and other user-generated content sources. The collected data was then reviewed by annotators fluent in both Bengali and English to validate the code-mixing and assign sentiment labels.

The authors perform extensive analysis and benchmarking of state-of-the-art sentiment analysis models on the BnSentMix dataset. They evaluate the performance of transformer-based models like BERT, as well as more specialized code-mixed language models, across a range of metrics. The results reveal both the strengths and limitations of current techniques in handling the unique challenges posed by code-mixed text.

Critical Analysis

The BnSentMix dataset and the associated benchmarking represent a valuable contribution to the field of sentiment analysis, particularly in the context of code-mixed language. The diversity of the dataset, in terms of both topic coverage and sentiment distribution, makes it a robust resource for training and evaluating models.

However, the authors acknowledge some limitations of the dataset, such as the potential for annotation bias and the exclusion of certain demographic groups. Additionally, the benchmarking results highlight the need for further advancements in model architectures and training techniques to improve performance on code-mixed data.

Future research could explore incorporating additional linguistic features, such as code-switching patterns and pragmatic context, to enhance the understanding and modeling of code-mixed sentiment. Expanding the dataset to include more languages and code-mixing scenarios would also be a valuable direction for further development.

Conclusion

This paper presents the BnSentMix dataset, a comprehensive resource for sentiment analysis in Bengali-English code-mixed text. The dataset and the associated benchmarking provide valuable insights into the current state of the art and the challenges of working with code-mixed language data.

The findings from this research have important implications for the development of more robust and inclusive sentiment analysis models, which are crucial for effectively monitoring and understanding the conversations taking place in multilingual online communities. By addressing the complexities of code-mixing, the BnSentMix dataset and the insights derived from it can contribute to the broader goal of building AI systems that better reflect the linguistic diversity of the real world.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis

Sadia Alam, Md Farhan Ishmam, Navid Hasin Alvee, Md Shahnewaz Siddique, Md Azam Hossain, Abu Raihan Mostofa Kamal

The widespread availability of code-mixed data can provide valuable insights into low-resource languages like Bengali, which have limited datasets. Sentiment analysis has been a fundamental text classification task across several languages for code-mixed data. However, there has yet to be a large-scale and diverse sentiment analysis dataset on code-mixed Bengali. We address this limitation by introducing BnSentMix, a sentiment analysis dataset on code-mixed Bengali consisting of 20,000 samples with $4$ sentiment labels from Facebook, YouTube, and e-commerce sites. We ensure diversity in data sources to replicate realistic code-mixed scenarios. Additionally, we propose $14$ baseline methods including novel transformer encoders further pre-trained on code-mixed Bengali-English, achieving an overall accuracy of $69.8%$ and an F1 score of $69.1%$ on sentiment classification tasks. Detailed analyses reveal variations in performance across different sentiment labels and text types, highlighting areas for future improvement.

8/20/2024

EmoMix-3L: A Code-Mixed Dataset for Bangla-English-Hindi Emotion Detection

Nishat Raihan, Dhiman Goswami, Antara Mahmud, Antonios Anastasopoulos, Marcos Zampieri

Code-mixing is a well-studied linguistic phenomenon that occurs when two or more languages are mixed in text or speech. Several studies have been conducted on building datasets and performing downstream NLP tasks on code-mixed data. Although it is not uncommon to observe code-mixing of three or more languages, most available datasets in this domain contain code-mixed data from only two languages. In this paper, we introduce EmoMix-3L, a novel multi-label emotion detection dataset containing code-mixed data from three different languages. We experiment with several models on EmoMix-3L and we report that MuRIL outperforms other models on this dataset.

5/14/2024

🔮

Code-mixed Sentiment and Hate-speech Prediction

Anjali Yadav, Tanya Garg, Matej Klemen, Matej Ulcar, Basant Agarwal, Marko Robnik Sikonja

Code-mixed discourse combines multiple languages in a single text. It is commonly used in informal discourse in countries with several official languages, but also in many other countries in combination with English or neighboring languages. As recently large language models have dominated most natural language processing tasks, we investigated their performance in code-mixed settings for relevant tasks. We first created four new bilingual pre-trained masked language models for English-Hindi and English-Slovene languages, specifically aimed to support informal language. Then we performed an evaluation of monolingual, bilingual, few-lingual, and massively multilingual models on several languages, using two tasks that frequently contain code-mixed text, in particular, sentiment analysis and offensive language detection in social media texts. The results show that the most successful classifiers are fine-tuned bilingual models and multilingual models, specialized for social media texts, followed by non-specialized massively multilingual and monolingual models, while huge generative models are not competitive. For our affective problems, the models mostly perform slightly better on code-mixed data compared to non-code-mixed data.

5/22/2024

🔎

Improving code-mixed hate detection by native sample mixing: A case study for Hindi-English code-mixed scenario

Debajyoti Mazumder, Aakash Kumar, Jasabanta Patro

Hate detection has long been a challenging task for the NLP community. The task becomes complex in a code-mixed environment because the models must understand the context and the hate expressed through language alteration. Compared to the monolingual setup, we see very less work on code-mixed hate as large-scale annotated hate corpora are unavailable to make the study. To overcome this bottleneck, we propose using native language hate samples. We hypothesise that in the era of multilingual language models (MLMs), hate in code-mixed settings can be detected by majorly relying on the native language samples. Even though the NLP literature reports the effectiveness of MLMs on hate detection in many cross-lingual settings, their extensive evaluation in a code-mixed scenario is yet to be done. This paper attempts to fill this gap through rigorous empirical experiments. We considered the Hindi-English code-mixed setup as a case study as we have the linguistic expertise for the same. Some of the interesting observations we got are: (i) adding native hate samples in the code-mixed training set, even in small quantity, improved the performance of MLMs for code-mixed hate detection, (ii) MLMs trained with native samples alone observed to be detecting code-mixed hate to a large extent, (iii) The visualisation of attention scores revealed that, when native samples were included in training, MLMs could better focus on the hate emitting words in the code-mixed context, and (iv) finally, when hate is subjective or sarcastic, naively mixing native samples doesn't help much to detect code-mixed hate. We will release the data and code repository to reproduce the reported results.

6/3/2024