Code-mixed Sentiment and Hate-speech Prediction

Read original: arXiv:2405.12929 - Published 5/22/2024 by Anjali Yadav, Tanya Garg, Matej Klemen, Matej Ulcar, Basant Agarwal, Marko Robnik Sikonja

🔮

Overview

Code-mixing, the combination of multiple languages in a single text, is common in informal communication in multilingual societies.
As large language models have become dominant in natural language processing, the researchers investigated their performance on code-mixed data for tasks like sentiment analysis and offensive language detection.
The researchers created new bilingual pre-trained language models for English-Hindi and English-Slovene, then evaluated monolingual, bilingual, and multilingual models on code-mixed and non-code-mixed data.

Plain English Explanation

In many parts of the world, people often mix two or more languages when they communicate informally. This is called "code-mixing." For example, someone might write a social media post that includes both English and Hindi words. As recently large language models have dominated most natural language processing tasks, the researchers wanted to see how well these models perform on text that contains this kind of code-mixing.

First, the researchers created new language models that were trained on a mix of English and Hindi, as well as English and Slovene. These models were designed to handle informal, code-mixed language better than more general-purpose models.

Then, the researchers tested different types of language models - monolingual (single-language), bilingual (two-language), and multilingual (many-language) - on two tasks: sentiment analysis (determining if a piece of text expresses positive or negative feelings) and detecting offensive language in social media posts. They compared the models' performance on code-mixed data versus non-code-mixed data.

The results showed that the bilingual and multilingual models, especially those designed for social media text, performed the best on the code-mixed data. The very large, general-purpose language models did not do as well. Interestingly, the models often performed slightly better on the code-mixed data compared to the non-code-mixed data for these affective tasks.

Technical Explanation

The researchers first created four new bilingual pre-trained masked language models for the English-Hindi and English-Slovene language pairs, with the goal of better supporting informal, code-mixed language.

They then evaluated the performance of monolingual, bilingual, few-lingual, and massively multilingual models on two tasks that frequently involve code-mixed text: sentiment analysis and offensive language detection in social media posts. The models included fine-tuned bilingual models, fine-tuned multilingual models specialized for social media, as well as non-specialized massively multilingual and monolingual models.

The results showed that the most successful classifiers were the fine-tuned bilingual and multilingual models specialized for social media texts. These outperformed the non-specialized massively multilingual and monolingual models. The huge generative language models were not competitive for these affective tasks.

Interestingly, the models generally performed slightly better on code-mixed data compared to non-code-mixed data for the sentiment analysis and offensive language detection tasks. This suggests that the models were able to effectively handle the code-mixing and leverage the additional linguistic information present in the mixed-language text.

Critical Analysis

The researchers acknowledge some limitations of their work. They only evaluated the models on two specific tasks - sentiment analysis and offensive language detection. The performance of these models may differ for other NLP tasks that involve code-mixed data.

Additionally, the study was limited to the English-Hindi and English-Slovene language pairs. Code-mixing can occur in many other language combinations, and the findings may not generalize to those cases.

While the researchers created new bilingual pre-trained models, it's unclear if these models were truly optimized for informal, code-mixed language, or if further dataset curation and model architecture innovations would be needed to fully unlock the potential of language models for this domain.

Furthermore, the use of synthetic data generation techniques to augment the training data was not explored in this work, which could potentially improve the models' ability to handle code-mixing.

Overall, this research provides a useful benchmark for the current state of language models on code-mixed data, but there is still room for improvement and further exploration in this area.

Conclusion

This study investigated the performance of various language models, including monolingual, bilingual, and multilingual models, on code-mixed text for sentiment analysis and offensive language detection tasks. The key finding is that fine-tuned bilingual and multilingual models specialized for social media performed the best on these code-mixed datasets, outperforming even massive generalist language models.

These results suggest that tailored models that can effectively handle the linguistic complexities of code-mixing are needed to unlock the full potential of natural language processing in multilingual societies. Further research is needed to extend these findings to other language pairs and NLP tasks, as well as to explore additional modeling techniques to improve performance on code-mixed data.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔮

Code-mixed Sentiment and Hate-speech Prediction

Anjali Yadav, Tanya Garg, Matej Klemen, Matej Ulcar, Basant Agarwal, Marko Robnik Sikonja

Code-mixed discourse combines multiple languages in a single text. It is commonly used in informal discourse in countries with several official languages, but also in many other countries in combination with English or neighboring languages. As recently large language models have dominated most natural language processing tasks, we investigated their performance in code-mixed settings for relevant tasks. We first created four new bilingual pre-trained masked language models for English-Hindi and English-Slovene languages, specifically aimed to support informal language. Then we performed an evaluation of monolingual, bilingual, few-lingual, and massively multilingual models on several languages, using two tasks that frequently contain code-mixed text, in particular, sentiment analysis and offensive language detection in social media texts. The results show that the most successful classifiers are fine-tuned bilingual models and multilingual models, specialized for social media texts, followed by non-specialized massively multilingual and monolingual models, while huge generative models are not competitive. For our affective problems, the models mostly perform slightly better on code-mixed data compared to non-code-mixed data.

5/22/2024

🔎

Improving code-mixed hate detection by native sample mixing: A case study for Hindi-English code-mixed scenario

Debajyoti Mazumder, Aakash Kumar, Jasabanta Patro

Hate detection has long been a challenging task for the NLP community. The task becomes complex in a code-mixed environment because the models must understand the context and the hate expressed through language alteration. Compared to the monolingual setup, we see very less work on code-mixed hate as large-scale annotated hate corpora are unavailable to make the study. To overcome this bottleneck, we propose using native language hate samples. We hypothesise that in the era of multilingual language models (MLMs), hate in code-mixed settings can be detected by majorly relying on the native language samples. Even though the NLP literature reports the effectiveness of MLMs on hate detection in many cross-lingual settings, their extensive evaluation in a code-mixed scenario is yet to be done. This paper attempts to fill this gap through rigorous empirical experiments. We considered the Hindi-English code-mixed setup as a case study as we have the linguistic expertise for the same. Some of the interesting observations we got are: (i) adding native hate samples in the code-mixed training set, even in small quantity, improved the performance of MLMs for code-mixed hate detection, (ii) MLMs trained with native samples alone observed to be detecting code-mixed hate to a large extent, (iii) The visualisation of attention scores revealed that, when native samples were included in training, MLMs could better focus on the hate emitting words in the code-mixed context, and (iv) finally, when hate is subjective or sarcastic, naively mixing native samples doesn't help much to detect code-mixed hate. We will release the data and code repository to reproduce the reported results.

6/3/2024

🔍

From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences

Prashant Kodali, Anmol Goel, Likhith Asapu, Vamshi Krishna Bonagiri, Anirudh Govil, Monojit Choudhury, Manish Shrivastava, Ponnurangam Kumaraguru

Current computational approaches for analysing or generating code-mixed sentences do not explicitly model naturalness or acceptability of code-mixed sentences, but rely on training corpora to reflect distribution of acceptable code-mixed sentences. Modelling human judgement for the acceptability of code-mixed text can help in distinguishing natural code-mixed text and enable quality-controlled generation of code-mixed text. To this end, we construct Cline - a dataset containing human acceptability judgements for English-Hindi (en-hi) code-mixed text. Cline is the largest of its kind with 16,642 sentences, consisting of samples sourced from two sources: synthetically generated code-mixed text and samples collected from online social media. Our analysis establishes that popular code-mixing metrics such as CMI, Number of Switch Points, Burstines, which are used to filter/curate/compare code-mixed corpora have low correlation with human acceptability judgements, underlining the necessity of our dataset. Experiments using Cline demonstrate that simple Multilayer Perceptron (MLP) models trained solely on code-mixing metrics are outperformed by fine-tuned pre-trained Multilingual Large Language Models (MLLMs). Specifically, XLM-Roberta and Bernice outperform IndicBERT across different configurations in challenging data settings. Comparison with ChatGPT's zero and fewshot capabilities shows that MLLMs fine-tuned on larger data outperform ChatGPT, providing scope for improvement in code-mixed tasks. Zero-shot transfer from English-Hindi to English-Telugu acceptability judgments using our model checkpoints proves superior to random baselines, enabling application to other code-mixed language pairs and providing further avenues of research. We publicly release our human-annotated dataset, trained checkpoints, code-mix corpus, and code for data generation and model training.

5/10/2024

BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis

Sadia Alam, Md Farhan Ishmam, Navid Hasin Alvee, Md Shahnewaz Siddique, Md Azam Hossain, Abu Raihan Mostofa Kamal

The widespread availability of code-mixed data can provide valuable insights into low-resource languages like Bengali, which have limited datasets. Sentiment analysis has been a fundamental text classification task across several languages for code-mixed data. However, there has yet to be a large-scale and diverse sentiment analysis dataset on code-mixed Bengali. We address this limitation by introducing BnSentMix, a sentiment analysis dataset on code-mixed Bengali consisting of 20,000 samples with $4$ sentiment labels from Facebook, YouTube, and e-commerce sites. We ensure diversity in data sources to replicate realistic code-mixed scenarios. Additionally, we propose $14$ baseline methods including novel transformer encoders further pre-trained on code-mixed Bengali-English, achieving an overall accuracy of $69.8%$ and an F1 score of $69.1%$ on sentiment classification tasks. Detailed analyses reveal variations in performance across different sentiment labels and text types, highlighting areas for future improvement.

8/20/2024