From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences

Read original: arXiv:2405.05572 - Published 5/10/2024 by Prashant Kodali, Anmol Goel, Likhith Asapu, Vamshi Krishna Bonagiri, Anirudh Govil, Monojit Choudhury, Manish Shrivastava, Ponnurangam Kumaraguru

🔍

Overview

The paper proposes a dataset called Cline, which contains human acceptability judgments for English-Hindi code-mixed text, to help model the naturalness of code-mixed sentences.
The dataset is the largest of its kind, with 16,642 sentences, and includes samples from synthetic code-mixed text and online social media.
The paper finds that popular code-mixing metrics, such as CMI and Number of Switch Points, have low correlation with human acceptability judgments, highlighting the need for the Cline dataset.
Experiments show that fine-tuned Multilingual Large Language Models (MLLMs) like XLM-Roberta and Bernice outperform simpler models and even ChatGPT in code-mixed acceptability tasks.
The paper also demonstrates the ability to transfer the model's zero-shot learning from English-Hindi to English-Telugu, suggesting applicability to other code-mixed language pairs.

Plain English Explanation

Code-mixing is the practice of combining words or phrases from different languages within a single sentence or conversation. Computational approaches for analyzing or generating code-mixed sentences often rely on training data that reflects the distribution of acceptable code-mixed sentences, but they do not explicitly model the naturalness or acceptability of these sentences.

The researchers in this paper aimed to address this by constructing a dataset called Cline, which contains human judgments on the acceptability of English-Hindi code-mixed text. The dataset includes both synthetically generated code-mixed sentences and samples collected from online social media, making it the largest of its kind.

The researchers found that popular code-mixing metrics, such as the Code-Mixing Index (CMI) and the Number of Switch Points, have a low correlation with human acceptability judgments. This underscores the need for a dataset like Cline, which can help distinguish natural code-mixed text and enable quality-controlled generation of such text.

To demonstrate the usefulness of the Cline dataset, the researchers conducted experiments using various machine learning models. They found that fine-tuned Multilingual Large Language Models (MLLMs) like XLM-Roberta and Bernice outperformed simpler models and even the popular language model ChatGPT in code-mixed acceptability tasks.

Additionally, the researchers showed that their model's zero-shot learning, trained on the English-Hindi data, could be transferred to the English-Telugu language pair, suggesting the potential for applying this approach to other code-mixed language pairs.

Technical Explanation

The researchers constructed the Cline dataset, which contains 16,642 sentences with human acceptability judgments for English-Hindi code-mixed text. The dataset includes samples from two sources: synthetically generated code-mixed text and samples collected from online social media.

To evaluate the usefulness of the Cline dataset, the researchers conducted experiments using various machine learning models. They found that simple Multilayer Perceptron (MLP) models trained solely on code-mixing metrics were outperformed by fine-tuned pre-trained Multilingual Large Language Models (MLLMs) like XLM-Roberta and Bernice.

The researchers also compared the performance of their fine-tuned MLLMs with the zero-shot and few-shot capabilities of ChatGPT, a popular language model. The results showed that the MLLMs fine-tuned on the larger Cline dataset outperformed ChatGPT, suggesting opportunities for improving code-mixed tasks using these models.

Additionally, the researchers demonstrated the ability to transfer the model's zero-shot learning from English-Hindi to English-Telugu acceptability judgments, which outperformed random baselines. This suggests the potential for applying the Cline dataset and the researchers' approach to other code-mixed language pairs.

Critical Analysis

The researchers acknowledge several limitations and areas for further research in their paper. For example, they note that the Cline dataset is limited to English-Hindi code-mixed text and suggest that expanding the dataset to include other language pairs would be valuable.

Additionally, the researchers highlight the need for more comprehensive metrics to assess the naturalness and acceptability of code-mixed text beyond the traditional code-mixing metrics they investigated. Exploring alternative approaches to model human judgments of code-mixed text could be a fruitful area for future research.

While the researchers demonstrate the superiority of fine-tuned MLLMs over simpler models and ChatGPT, it would be interesting to see how these models perform on more diverse and challenging code-mixed tasks, such as generation or translation, in addition to the acceptability judgments explored in this paper.

Overall, the Cline dataset and the researchers' findings provide a valuable contribution to the field of code-mixed language processing, highlighting the importance of modeling human judgments and the potential of advanced language models in this domain.

Conclusion

This paper presents the Cline dataset, a large-scale collection of human acceptability judgments for English-Hindi code-mixed text. The researchers found that popular code-mixing metrics have limited correlation with human judgments, underscoring the need for datasets like Cline to better understand and model the naturalness of code-mixed language.

Experiments using the Cline dataset demonstrated the superior performance of fine-tuned Multilingual Large Language Models (MLLMs) over simpler models and even the popular ChatGPT language model in code-mixed acceptability tasks. The researchers also showcased the ability to transfer the model's zero-shot learning to other code-mixed language pairs, suggesting the broader applicability of their approach.

The Cline dataset and the researchers' findings offer valuable insights for the development of more reliable and accurate computational approaches to analyzing and generating code-mixed text. This work has the potential to contribute to various applications, from code-mixed language processing in social media to quality-controlled generation of code-mixed content.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔍

From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences

Prashant Kodali, Anmol Goel, Likhith Asapu, Vamshi Krishna Bonagiri, Anirudh Govil, Monojit Choudhury, Manish Shrivastava, Ponnurangam Kumaraguru

Current computational approaches for analysing or generating code-mixed sentences do not explicitly model naturalness or acceptability of code-mixed sentences, but rely on training corpora to reflect distribution of acceptable code-mixed sentences. Modelling human judgement for the acceptability of code-mixed text can help in distinguishing natural code-mixed text and enable quality-controlled generation of code-mixed text. To this end, we construct Cline - a dataset containing human acceptability judgements for English-Hindi (en-hi) code-mixed text. Cline is the largest of its kind with 16,642 sentences, consisting of samples sourced from two sources: synthetically generated code-mixed text and samples collected from online social media. Our analysis establishes that popular code-mixing metrics such as CMI, Number of Switch Points, Burstines, which are used to filter/curate/compare code-mixed corpora have low correlation with human acceptability judgements, underlining the necessity of our dataset. Experiments using Cline demonstrate that simple Multilayer Perceptron (MLP) models trained solely on code-mixing metrics are outperformed by fine-tuned pre-trained Multilingual Large Language Models (MLLMs). Specifically, XLM-Roberta and Bernice outperform IndicBERT across different configurations in challenging data settings. Comparison with ChatGPT's zero and fewshot capabilities shows that MLLMs fine-tuned on larger data outperform ChatGPT, providing scope for improvement in code-mixed tasks. Zero-shot transfer from English-Hindi to English-Telugu acceptability judgments using our model checkpoints proves superior to random baselines, enabling application to other code-mixed language pairs and providing further avenues of research. We publicly release our human-annotated dataset, trained checkpoints, code-mix corpus, and code for data generation and model training.

5/10/2024

🔮

Code-mixed Sentiment and Hate-speech Prediction

Anjali Yadav, Tanya Garg, Matej Klemen, Matej Ulcar, Basant Agarwal, Marko Robnik Sikonja

Code-mixed discourse combines multiple languages in a single text. It is commonly used in informal discourse in countries with several official languages, but also in many other countries in combination with English or neighboring languages. As recently large language models have dominated most natural language processing tasks, we investigated their performance in code-mixed settings for relevant tasks. We first created four new bilingual pre-trained masked language models for English-Hindi and English-Slovene languages, specifically aimed to support informal language. Then we performed an evaluation of monolingual, bilingual, few-lingual, and massively multilingual models on several languages, using two tasks that frequently contain code-mixed text, in particular, sentiment analysis and offensive language detection in social media texts. The results show that the most successful classifiers are fine-tuned bilingual models and multilingual models, specialized for social media texts, followed by non-specialized massively multilingual and monolingual models, while huge generative models are not competitive. For our affective problems, the models mostly perform slightly better on code-mixed data compared to non-code-mixed data.

5/22/2024

🔎

Improving code-mixed hate detection by native sample mixing: A case study for Hindi-English code-mixed scenario

Debajyoti Mazumder, Aakash Kumar, Jasabanta Patro

Hate detection has long been a challenging task for the NLP community. The task becomes complex in a code-mixed environment because the models must understand the context and the hate expressed through language alteration. Compared to the monolingual setup, we see very less work on code-mixed hate as large-scale annotated hate corpora are unavailable to make the study. To overcome this bottleneck, we propose using native language hate samples. We hypothesise that in the era of multilingual language models (MLMs), hate in code-mixed settings can be detected by majorly relying on the native language samples. Even though the NLP literature reports the effectiveness of MLMs on hate detection in many cross-lingual settings, their extensive evaluation in a code-mixed scenario is yet to be done. This paper attempts to fill this gap through rigorous empirical experiments. We considered the Hindi-English code-mixed setup as a case study as we have the linguistic expertise for the same. Some of the interesting observations we got are: (i) adding native hate samples in the code-mixed training set, even in small quantity, improved the performance of MLMs for code-mixed hate detection, (ii) MLMs trained with native samples alone observed to be detecting code-mixed hate to a large extent, (iii) The visualisation of attention scores revealed that, when native samples were included in training, MLMs could better focus on the hate emitting words in the code-mixed context, and (iv) finally, when hate is subjective or sarcastic, naively mixing native samples doesn't help much to detect code-mixed hate. We will release the data and code repository to reproduce the reported results.

6/3/2024

📊

Synthetic Data Generation and Joint Learning for Robust Code-Mixed Translation

Kartik Kartik, Sanjana Soni, Anoop Kunchukuttan, Tanmoy Chakraborty, Md Shad Akhtar

The widespread online communication in a modern multilingual world has provided opportunities to blend more than one language (aka code-mixed language) in a single utterance. This has resulted a formidable challenge for the computational models due to the scarcity of annotated data and presence of noise. A potential solution to mitigate the data scarcity problem in low-resource setup is to leverage existing data in resource-rich language through translation. In this paper, we tackle the problem of code-mixed (Hinglish and Bengalish) to English machine translation. First, we synthetically develop HINMIX, a parallel corpus of Hinglish to English, with ~4.2M sentence pairs. Subsequently, we propose RCMT, a robust perturbation based joint-training model that learns to handle noise in the real-world code-mixed text by parameter sharing across clean and noisy words. Further, we show the adaptability of RCMT in a zero-shot setup for Bengalish to English translation. Our evaluation and comprehensive analyses qualitatively and quantitatively demonstrate the superiority of RCMT over state-of-the-art code-mixed and robust translation methods.

5/1/2024