MHS-STMA: Multimodal Hate Speech Detection via Scalable Transformer-Based Multilevel Attention Framework

Read original: arXiv:2409.05136 - Published 9/18/2024 by Anusha Chhabra, Dinesh Kumar Vishwakarma

🗣️

Overview

Social media has a significant impact on people's lives.
Hate speech on social media is a serious societal issue.
Multimodal data (text and images) is commonly shared on social media.
Previous approaches have focused on unimodal (single-mode) analysis, neglecting the unique characteristics of each modality.

Plain English Explanation

Social media plays a major role in many people's daily lives. One of the most pressing problems associated with social media is the prevalence of hateful or abusive content, known as hate speech. This type of content can take different forms, including text and images.

Past research on detecting hate speech has primarily focused on analyzing one type of data at a time, such as just text or just images. However, this approach fails to account for the unique properties and nuances of each data type. To address this, the paper proposes a Transformer-based Multilevel Attention (STMA) architecture that can effectively handle multimodal hate content detection.

The key idea is to use various attention mechanisms to process both text and images simultaneously, allowing the system to capture the distinctive characteristics of each data modality. This helps to provide a more comprehensive and accurate identification of hateful content on social media platforms.

Technical Explanation

The STMA architecture consists of three main components:

Combined Attention-based Deep Learning Mechanism: This component uses different attention processes to jointly analyze text and images for hate content detection.
Vision Attention Mechanism Encoder: This module focuses on processing visual data (images) using an attention-based approach.
Caption Attention-Mechanism Encoder: This encoder specializes in handling textual data (captions) using an attention-based mechanism.

The researchers evaluated the STMA approach on three hate speech datasets: Hateful Memes, MultiOff, and MMHS150K. The results demonstrate that the proposed STMA strategy outperforms baseline methods on all three datasets, indicating its effectiveness in detecting multimodal hate content.

Critical Analysis

The paper provides a novel and comprehensive solution for addressing the issue of hate speech detection on social media. By leveraging attention mechanisms to process both text and images simultaneously, the STMA approach offers a more holistic way to identify hateful content compared to previous unimodal techniques.

However, the paper does not discuss potential limitations or caveats of the proposed architecture. For example, it would be valuable to understand how the STMA system performs on edge cases, such as subtle or ambiguous hateful content, or how it might handle multilingual or cross-cultural hate speech. Additionally, the paper could have explored the computational efficiency and resource requirements of the STMA model, which are crucial factors for real-world deployment.

Further research could also investigate the generalizability of the STMA approach to other domains beyond social media, such as detecting hate speech in online forums, news articles, or even in multimodal communication in the physical world.

Conclusion

This paper presents a innovative Transformer-based Multilevel Attention (STMA) architecture for effectively detecting hate speech on social media by jointly processing text and images. The proposed solution outperforms baseline methods, demonstrating the benefits of a multimodal approach to this important societal issue. While the paper could have addressed certain limitations, it represents a significant step forward in the field of hate speech detection and has the potential to contribute to safer and more inclusive online communities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🗣️

MHS-STMA: Multimodal Hate Speech Detection via Scalable Transformer-Based Multilevel Attention Framework

Anusha Chhabra, Dinesh Kumar Vishwakarma

Social media has a significant impact on people's lives. Hate speech on social media has emerged as one of society's most serious issues in recent years. Text and pictures are two forms of multimodal data that are distributed within articles. Unimodal analysis has been the primary emphasis of earlier approaches. Additionally, when doing multimodal analysis, researchers neglect to preserve the distinctive qualities associated with each modality. To address these shortcomings, the present article suggests a scalable architecture for multimodal hate content detection called transformer-based multilevel attention (STMA). This architecture consists of three main parts: a combined attention-based deep learning mechanism, a vision attention-mechanism encoder, and a caption attention-mechanism encoder. To identify hate content, each component uses various attention processes and handles multimodal data in a unique way. Several studies employing multiple assessment criteria on three hate speech datasets such as Hateful memes, MultiOff, and MMHS150K, validate the suggested architecture's efficacy. The outcomes demonstrate that on all three datasets, the suggested strategy performs better than the baseline approaches.

9/18/2024

Exploiting Hatred by Targets for Hate Speech Detection on Vietnamese Social Media Texts

Cuong Nhat Vo, Khanh Bao Huynh, Son T. Luu, Trong-Hop Do

The growth of social networks makes toxic content spread rapidly. Hate speech detection is a task to help decrease the number of harmful comments. With the diversity in the hate speech created by users, it is necessary to interpret the hate speech besides detecting it. Hence, we propose a methodology to construct a system for targeted hate speech detection from online streaming texts from social media. We first introduce the ViTHSD - a targeted hate speech detection dataset for Vietnamese Social Media Texts. The dataset contains 10K comments, each comment is labeled to specific targets with three levels: clean, offensive, and hate. There are 5 targets in the dataset, and each target is labeled with the corresponding level manually by humans with strict annotation guidelines. The inter-annotator agreement obtained from the dataset is 0.45 by Cohen's Kappa index, which is indicated as a moderate level. Then, we construct a baseline for this task by combining the Bi-GRU-LSTM-CNN with the pre-trained language model to leverage the power of text representation of BERTology. Finally, we suggest a methodology to integrate the baseline model for targeted hate speech detection into the online streaming system for practical application in preventing hateful and offensive content on social media.

5/1/2024

M2SA: Multimodal and Multilingual Model for Sentiment Analysis of Tweets

Gaurish Thakkar, Sherzod Hakimov, Marko Tadi'c

In recent years, multimodal natural language processing, aimed at learning from diverse data types, has garnered significant attention. However, there needs to be more clarity when it comes to analysing multimodal tasks in multi-lingual contexts. While prior studies on sentiment analysis of tweets have predominantly focused on the English language, this paper addresses this gap by transforming an existing textual Twitter sentiment dataset into a multimodal format through a straightforward curation process. Our work opens up new avenues for sentiment-related research within the research community. Additionally, we conduct baseline experiments utilising this augmented dataset and report the findings. Notably, our evaluations reveal that when comparing unimodal and multimodal configurations, using a sentiment-tuned large language model as a text encoder performs exceptionally well.

6/13/2024

🔎

Multimodal Detection of Bots on X (Twitter) using Transformers

Loukas Ilias, Ioannis Michail Kazelidis, Dimitris Askounis

Although not all bots are malicious, the vast majority of them are responsible for spreading misinformation and manipulating the public opinion about several issues, i.e., elections and many more. Therefore, the early detection of bots is crucial. Although there have been proposed methods for detecting bots in social media, there are still substantial limitations. For instance, existing research initiatives still extract a large number of features and train traditional machine learning algorithms or use GloVe embeddings and train LSTMs. However, feature extraction is a tedious procedure demanding domain expertise. Also, language models based on transformers have been proved to be better than LSTMs. Other approaches create large graphs and train graph neural networks requiring in this way many hours for training and access to computational resources. To tackle these limitations, this is the first study employing only the user description field and images of three channels denoting the type and content of tweets posted by the users. Firstly, we create digital DNA sequences, transform them to 3d images, and apply pretrained models of the vision domain, including EfficientNet, AlexNet, VGG16, etc. Next, we propose a multimodal approach, where we use TwHIN-BERT for getting the textual representation of the user description field and employ VGG16 for acquiring the visual representation for the image modality. We propose three different fusion methods, namely concatenation, gated multimodal unit, and crossmodal attention, for fusing the different modalities and compare their performances. Finally, we present a qualitative analysis of the behavior of our best performing model. Extensive experiments conducted on the Cresci'17 and TwiBot-20 datasets demonstrate valuable advantages of our introduced approaches over state-of-the-art ones.

7/25/2024