Improving code-mixed hate detection by native sample mixing: A case study for Hindi-English code-mixed scenario

Read original: arXiv:2405.20755 - Published 6/3/2024 by Debajyoti Mazumder, Aakash Kumar, Jasabanta Patro

🔎

Overview

Hate detection is a challenging task for natural language processing (NLP) models, especially in code-mixed environments where multiple languages are mixed in a single text.
Annotated hate speech datasets for code-mixed settings are scarce, making it difficult to train effective models.
This paper explores the use of native language hate samples to improve the performance of multilingual language models (MLMs) on code-mixed hate detection.

Plain English Explanation

Detecting hate speech in text is a difficult problem for computers to solve. It becomes even more complicated when the text contains a mix of multiple languages, a common phenomenon known as "code-mixing." In a code-mixed environment, the models need to understand the context and the way language is being used to express hate.

Compared to working with text in a single language, there is much less research on detecting hate in code-mixed settings. This is because large-scale datasets with annotations for code-mixed hate speech are hard to come by. To address this issue, the researchers in this paper propose using hate speech samples from the native languages as a way to improve the performance of MLMs on code-mixed hate detection.

The key idea is that even though the text being analyzed contains a mix of languages, the underlying hateful sentiment may be expressed using words and patterns from the native language. By incorporating native language hate samples into the training data, the researchers hypothesized that MLMs could better detect hate in code-mixed text.

Technical Explanation

The researchers conducted a case study focusing on Hindi-English code-mixed text. They evaluated the performance of MLMs on code-mixed hate detection, both with and without the inclusion of native Hindi hate samples in the training data.

Their experiments revealed several interesting findings:

Adding even a small amount of native hate samples to the code-mixed training data improved the performance of MLMs on code-mixed hate detection.
MLMs trained on native hate samples alone were able to detect code-mixed hate to a large extent.
Visualizing the attention scores of the MLMs showed that when native samples were included in training, the models could better focus on the words that were indicative of hate in the code-mixed context.
However, for cases where the hate was more subjective or sarcastic, simply mixing in native samples did not help much in detecting the code-mixed hate.

The researchers plan to release the data and code used in this study to allow for the reproduction of their results.

Critical Analysis

The paper presents a promising approach to addressing the challenge of code-mixed hate detection, which is an important yet understudied problem in the field of NLP. By leveraging native language hate samples, the researchers were able to improve the performance of MLMs on this task, which is a significant contribution.

However, the authors acknowledge that their approach may have limitations when it comes to detecting more subtle or sarcastic forms of hate speech in code-mixed text. This highlights the need for further research into more advanced techniques for understanding the nuances of hate expression in multilingual contexts.

Additionally, the availability and quality of annotated code-mixed hate speech datasets remain a significant challenge. The researchers' plan to release their data and code is a positive step, but more work is needed to create comprehensive, high-quality resources for the community to advance research in this area.

Conclusion

This paper demonstrates the potential of using native language hate samples to enhance the performance of MLMs on the task of code-mixed hate detection. The researchers' findings suggest that this approach can be a valuable tool for tackling the challenge of hate speech identification in multilingual settings.

As the use of code-mixing continues to grow, especially on social media and other online platforms, the ability to effectively detect and mitigate hateful content in these environments becomes increasingly important. The insights gained from this study can contribute to the development of more robust and comprehensive solutions for addressing the problem of hate speech in a diverse, globalized world.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔎

Improving code-mixed hate detection by native sample mixing: A case study for Hindi-English code-mixed scenario

Debajyoti Mazumder, Aakash Kumar, Jasabanta Patro

Hate detection has long been a challenging task for the NLP community. The task becomes complex in a code-mixed environment because the models must understand the context and the hate expressed through language alteration. Compared to the monolingual setup, we see very less work on code-mixed hate as large-scale annotated hate corpora are unavailable to make the study. To overcome this bottleneck, we propose using native language hate samples. We hypothesise that in the era of multilingual language models (MLMs), hate in code-mixed settings can be detected by majorly relying on the native language samples. Even though the NLP literature reports the effectiveness of MLMs on hate detection in many cross-lingual settings, their extensive evaluation in a code-mixed scenario is yet to be done. This paper attempts to fill this gap through rigorous empirical experiments. We considered the Hindi-English code-mixed setup as a case study as we have the linguistic expertise for the same. Some of the interesting observations we got are: (i) adding native hate samples in the code-mixed training set, even in small quantity, improved the performance of MLMs for code-mixed hate detection, (ii) MLMs trained with native samples alone observed to be detecting code-mixed hate to a large extent, (iii) The visualisation of attention scores revealed that, when native samples were included in training, MLMs could better focus on the hate emitting words in the code-mixed context, and (iv) finally, when hate is subjective or sarcastic, naively mixing native samples doesn't help much to detect code-mixed hate. We will release the data and code repository to reproduce the reported results.

6/3/2024

🔮

Code-mixed Sentiment and Hate-speech Prediction

Anjali Yadav, Tanya Garg, Matej Klemen, Matej Ulcar, Basant Agarwal, Marko Robnik Sikonja

Code-mixed discourse combines multiple languages in a single text. It is commonly used in informal discourse in countries with several official languages, but also in many other countries in combination with English or neighboring languages. As recently large language models have dominated most natural language processing tasks, we investigated their performance in code-mixed settings for relevant tasks. We first created four new bilingual pre-trained masked language models for English-Hindi and English-Slovene languages, specifically aimed to support informal language. Then we performed an evaluation of monolingual, bilingual, few-lingual, and massively multilingual models on several languages, using two tasks that frequently contain code-mixed text, in particular, sentiment analysis and offensive language detection in social media texts. The results show that the most successful classifiers are fine-tuned bilingual models and multilingual models, specialized for social media texts, followed by non-specialized massively multilingual and monolingual models, while huge generative models are not competitive. For our affective problems, the models mostly perform slightly better on code-mixed data compared to non-code-mixed data.

5/22/2024

🔍

From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences

Prashant Kodali, Anmol Goel, Likhith Asapu, Vamshi Krishna Bonagiri, Anirudh Govil, Monojit Choudhury, Manish Shrivastava, Ponnurangam Kumaraguru

Current computational approaches for analysing or generating code-mixed sentences do not explicitly model naturalness or acceptability of code-mixed sentences, but rely on training corpora to reflect distribution of acceptable code-mixed sentences. Modelling human judgement for the acceptability of code-mixed text can help in distinguishing natural code-mixed text and enable quality-controlled generation of code-mixed text. To this end, we construct Cline - a dataset containing human acceptability judgements for English-Hindi (en-hi) code-mixed text. Cline is the largest of its kind with 16,642 sentences, consisting of samples sourced from two sources: synthetically generated code-mixed text and samples collected from online social media. Our analysis establishes that popular code-mixing metrics such as CMI, Number of Switch Points, Burstines, which are used to filter/curate/compare code-mixed corpora have low correlation with human acceptability judgements, underlining the necessity of our dataset. Experiments using Cline demonstrate that simple Multilayer Perceptron (MLP) models trained solely on code-mixing metrics are outperformed by fine-tuned pre-trained Multilingual Large Language Models (MLLMs). Specifically, XLM-Roberta and Bernice outperform IndicBERT across different configurations in challenging data settings. Comparison with ChatGPT's zero and fewshot capabilities shows that MLLMs fine-tuned on larger data outperform ChatGPT, providing scope for improvement in code-mixed tasks. Zero-shot transfer from English-Hindi to English-Telugu acceptability judgments using our model checkpoints proves superior to random baselines, enabling application to other code-mixed language pairs and providing further avenues of research. We publicly release our human-annotated dataset, trained checkpoints, code-mix corpus, and code for data generation and model training.

5/10/2024

EmoMix-3L: A Code-Mixed Dataset for Bangla-English-Hindi Emotion Detection

Nishat Raihan, Dhiman Goswami, Antara Mahmud, Antonios Anastasopoulos, Marcos Zampieri

Code-mixing is a well-studied linguistic phenomenon that occurs when two or more languages are mixed in text or speech. Several studies have been conducted on building datasets and performing downstream NLP tasks on code-mixed data. Although it is not uncommon to observe code-mixing of three or more languages, most available datasets in this domain contain code-mixed data from only two languages. In this paper, we introduce EmoMix-3L, a novel multi-label emotion detection dataset containing code-mixed data from three different languages. We experiment with several models on EmoMix-3L and we report that MuRIL outperforms other models on this dataset.

5/14/2024