Mapping Violence: Developing an Extensive Framework to Build a Bangla Sectarian Expression Dataset from Social Media Interactions

Read original: arXiv:2404.11752 - Published 4/19/2024 by Nazia Tasnim, Sujan Sen Gupta, Md. Istiak Hossain Shihab, Fatiha Islam Juee, Arunima Tahsin, Pritom Ghum, Kanij Fatema, Marshia Haque, Wasema Farzana, Prionti Nasir and 3 others

🐍

Overview

Communal violence in online forums is a significant problem in South Asia, where diverse communities often exhibit strong in-group loyalty and out-group hostility.
Researchers have developed a comprehensive framework for automatically detecting markers of communal violence in Bangla (Bengali) online content, along with a large dataset of 13,000 social media interactions classified into four major violence categories and 16 subcategories.
The study involved a rigorous 7-step expert annotation process, drawing insights from social scientists, linguists, and psychologists.
The research found that Religio-communal violence is particularly prevalent in Bangla text, and that fine-tuning language models can be an effective approach for identifying violent comments.

Plain English Explanation

In South Asia, online communities often exhibit strong bonds within their own groups but also hostility towards other groups. This can lead to conflicts that escalate into violence. Researchers have created a system to automatically detect signs of this "communal violence" in Bangla (the language of Bangladesh and parts of India) online content.

They built a large dataset of 13,000 social media interactions that they categorized into four main types of violence, with 16 subcategories. This involved a thorough process of getting input from experts in fields like social science, linguistics, and psychology.

The researchers found that one type of violence, called "Religio-communal violence," is particularly common in the Bangla text they analyzed. They also showed that using advanced language models, which can learn patterns in text, can be a good way to identify violent comments online.

Technical Explanation

The researchers developed a comprehensive framework for automatically detecting communal violence markers in Bangla online content. They created the largest known dataset of its kind, with 13,000 raw social media sentences categorized into four major violence classes and 16 subcategories.

The framework involved a 7-step expert annotation process that incorporated insights from social scientists, linguists, and psychologists. This rigorous approach helped ensure the quality and reliability of the dataset.

The researchers analyzed the data statistics and benchmarked the performance of state-of-the-art Bangla language models in identifying violent comments. They found that, aside from the "Non-communal violence" category, "Religio-communal violence" was particularly prevalent in the Bangla text.

Furthermore, the study demonstrated the effectiveness of fine-tuning language models for this task. By adapting these powerful AI models to the specific problem of detecting communal violence, the researchers were able to achieve promising results, suggesting this could be a useful approach for addressing this issue.

Critical Analysis

The researchers acknowledge several limitations of their work. For example, the dataset is limited to Bangla text, so the framework may need to be adapted for other languages. Additionally, the annotation process, while thorough, could still introduce some biases or inconsistencies.

One potential concern is the reliance on language models, which can sometimes reflect and amplify societal biases present in the training data. The researchers may need to investigate the fairness and robustness of their models to ensure they are not perpetuating harmful stereotypes or unfairly targeting certain communities.

Further research could also explore the root causes of communal violence in online forums, such as underlying social, political, or economic factors. Understanding these deeper issues could inform more holistic, long-term solutions beyond just automated detection.

Overall, the researchers have made an important contribution by developing a robust framework and dataset for studying communal violence in Bangla online content. However, continued critical examination and multidisciplinary collaboration will be crucial to address this complex and pervasive problem effectively.

Conclusion

This study presents a comprehensive framework for the automatic detection of communal violence markers in Bangla online content. By creating a large dataset of social media interactions and leveraging expert annotation, the researchers have provided a valuable resource for understanding and addressing this pressing issue.

The finding that Religio-communal violence is particularly prevalent in Bangla text highlights the need for targeted interventions and further research in this area. The demonstrated effectiveness of fine-tuning language models for this task suggests a promising technical approach, though it will need to be implemented thoughtfully and with a focus on fairness and ethical considerations.

Overall, this work represents an important step forward in the ongoing effort to combat communal violence and foster more inclusive and harmonious online communities, not only in South Asia but potentially in other regions as well. Continued interdisciplinary collaboration and a commitment to critical analysis will be essential to turning these research insights into meaningful, long-lasting solutions.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🐍

Mapping Violence: Developing an Extensive Framework to Build a Bangla Sectarian Expression Dataset from Social Media Interactions

Nazia Tasnim, Sujan Sen Gupta, Md. Istiak Hossain Shihab, Fatiha Islam Juee, Arunima Tahsin, Pritom Ghum, Kanij Fatema, Marshia Haque, Wasema Farzana, Prionti Nasir, Ashique KhudaBukhsh, Farig Sadeque, Asif Sushmit

Communal violence in online forums has become extremely prevalent in South Asia, where many communities of different cultures coexist and share resources. These societies exhibit a phenomenon characterized by strong bonds within their own groups and animosity towards others, leading to conflicts that frequently escalate into violent confrontations. To address this issue, we have developed the first comprehensive framework for the automatic detection of communal violence markers in online Bangla content accompanying the largest collection (13K raw sentences) of social media interactions that fall under the definition of four major violence class and their 16 coarse expressions. Our workflow introduces a 7-step expert annotation process incorporating insights from social scientists, linguists, and psychologists. By presenting data statistics and benchmarking performance using this dataset, we have determined that, aside from the category of Non-communal violence, Religio-communal violence is particularly pervasive in Bangla text. Moreover, we have substantiated the effectiveness of fine-tuning language models in identifying violent comments by conducting preliminary benchmarking on the state-of-the-art Bangla deep learning model.

4/19/2024

Assessing the Level of Toxicity Against Distinct Groups in Bangla Social Media Comments: A Comprehensive Investigation

Mukaffi Bin Moin, Pronay Debnath, Usafa Akther Rifa, Rijeet Bin Anis

Social media platforms have a vital role in the modern world, serving as conduits for communication, the exchange of ideas, and the establishment of networks. However, the misuse of these platforms through toxic comments, which can range from offensive remarks to hate speech, is a concerning issue. This study focuses on identifying toxic comments in the Bengali language targeting three specific groups: transgender people, indigenous people, and migrant people, from multiple social media sources. The study delves into the intricate process of identifying and categorizing toxic language while considering the varying degrees of toxicity: high, medium, and low. The methodology involves creating a dataset, manual annotation, and employing pre-trained transformer models like Bangla-BERT, bangla-bert-base, distil-BERT, and Bert-base-multilingual-cased for classification. Diverse assessment metrics such as accuracy, recall, precision, and F1-score are employed to evaluate the model's effectiveness. The experimental findings reveal that Bangla-BERT surpasses alternative models, achieving an F1-score of 0.8903. This research exposes the complexity of toxicity in Bangla social media dialogues, revealing its differing impacts on diverse demographic groups.

9/26/2024

Exploring Boundaries and Intensities in Offensive and Hate Speech: Unveiling the Complex Spectrum of Social Media Discourse

Abinew Ali Ayele, Esubalew Alemneh Jalew, Adem Chanie Ali, Seid Muhie Yimam, Chris Biemann

The prevalence of digital media and evolving sociopolitical dynamics have significantly amplified the dissemination of hateful content. Existing studies mainly focus on classifying texts into binary categories, often overlooking the continuous spectrum of offensiveness and hatefulness inherent in the text. In this research, we present an extensive benchmark dataset for Amharic, comprising 8,258 tweets annotated for three distinct tasks: category classification, identification of hate targets, and rating offensiveness and hatefulness intensities. Our study highlights that a considerable majority of tweets belong to the less offensive and less hate intensity levels, underscoring the need for early interventions by stakeholders. The prevalence of ethnic and political hatred targets, with significant overlaps in our dataset, emphasizes the complex relationships within Ethiopia's sociopolitical landscape. We build classification and regression models and investigate the efficacy of models in handling these tasks. Our results reveal that hate and offensive speech can not be addressed by a simplistic binary classification, instead manifesting as variables across a continuous range of values. The Afro-XLMR-large model exhibits the best performances achieving F1-scores of 75.30%, 70.59%, and 29.42% for the category, target, and regression tasks, respectively. The 80.22% correlation coefficient of the Afro-XLMR-large model indicates strong alignments.

4/19/2024

🤖

The Uli Dataset: An Exercise in Experience Led Annotation of oGBV

Arnav Arora, Maha Jinadoss, Cheshta Arora, Denny George, Brindaalakshmi, Haseena Dawood Khan, Kirti Rawat, Div, Ritash, Seema Mathur, Shivani Yadav, Shehla Rashid Shora, Rie Raut, Sumit Pawar, Apurva Paithane, Sonia, Vivek, Dharini Priscilla, Khairunnisha, Grace Banu, Ambika Tandon, Rishav Thakker, Rahul Dev Korra, Aatman Vaidya, Tarunima Prabhakar

Online gender based violence has grown concomitantly with adoption of the internet and social media. Its effects are worse in the Global majority where many users use social media in languages other than English. The scale and volume of conversations on the internet has necessitated the need for automated detection of hate speech, and more specifically gendered abuse. There is, however, a lack of language specific and contextual data to build such automated tools. In this paper we present a dataset on gendered abuse in three languages- Hindi, Tamil and Indian English. The dataset comprises of tweets annotated along three questions pertaining to the experience of gender abuse, by experts who identify as women or a member of the LGBTQIA community in South Asia. Through this dataset we demonstrate a participatory approach to creating datasets that drive AI systems.

6/26/2024