HateCOT: An Explanation-Enhanced Dataset for Generalizable Offensive Speech Detection via Large Language Models

Read original: arXiv:2403.11456 - Published 6/18/2024 by Huy Nghiem, Hal Daum'e III

HateCOT: An Explanation-Enhanced Dataset for Generalizable Offensive Speech Detection via Large Language Models

Overview

This paper introduces HateCOT, a new dataset for detecting offensive speech that includes explanations from annotators on why they labeled content as offensive or not.
The researchers used this dataset to train large language models to not only detect offensive content, but also generate explanations for their predictions.
The goal is to create models that are more generalizable and transparent in their offensive speech detection, rather than relying on black-box models.

Plain English Explanation

The researchers behind this paper recognized that current approaches to detecting offensive speech online often rely on machine learning models that can be like "black boxes" - they make predictions, but it's not always clear how they arrived at those conclusions. This can make it difficult to understand why certain content was flagged as offensive, or to have confidence that the models will work well across diverse contexts.

To address this, the researchers created a new dataset called HateCOT. When annotators labeled content as offensive or not, they also provided explanations for their decisions. By training large language models on this dataset, the researchers aimed to develop systems that can not only detect offensive speech, but also generate explanations for their predictions.

The key idea is that models that can explain their reasoning will be more transparent and trustworthy. Rather than simply making binary decisions, these models can provide insight into the specific factors they are considering. This could help users better understand content moderation decisions, and also make the models more generalizable to new contexts, rather than being overly specific to the training data.

Overall, the HateCOT dataset and models trained on it represent an effort to make offensive speech detection systems more robust, explainable, and useful for real-world applications.

Technical Explanation

The HateCOT dataset was created by having annotators label online comments as offensive or not, and then provide free-text explanations for their decisions. This resulted in a dataset of over 50,000 annotated comments, along with the corresponding explanations.

The researchers then fine-tuned large language models like BERT and GPT-3 on this HateCOT dataset, training them not just to classify comments as offensive or not, but also to generate explanations for their predictions. This allowed the models to not only make decisions, but also provide insight into their reasoning.

Through experiments, the researchers found that the explanation-enhanced models outperformed standard offensive speech detection models in terms of accuracy, as well as generalization to new datasets. The generated explanations also aligned well with human-provided rationales, suggesting the models were capturing meaningful factors in their decision making.

Importantly, the researchers also examined potential biases in the HateCOT dataset itself, drawing on previous work like Investigating Annotator Bias in Large Language Models for Hate Speech Detection and Exploiting Hatred by Targets: Hate Speech Detection with Victim Robustness. They found that certain demographic and linguistic biases existed, which is an important consideration when deploying these types of systems in the real world.

Critical Analysis

While the HateCOT dataset and explanation-enhanced models represent an important step forward, there are still some limitations and areas for further research. For example, the dataset only covers English-language content, so it's unclear how well the models would generalize to other languages and cultural contexts, as discussed in Exploring Cross-Cultural Differences in English Hate Speech.

Additionally, the researchers note that the generated explanations, while more aligned with human rationales than standard models, still have room for improvement in terms of faithfully capturing the nuanced reasoning behind offensive speech judgments. Approaches like Tox-BART: Leveraging Toxicity Attributes for Explanation Generation could potentially help address this.

Finally, while the researchers made efforts to assess biases in the dataset, it's an ongoing challenge to fully account for and mitigate these issues in real-world content moderation systems, as highlighted in HateTinyLLM: Hate Speech Detection Using Tiny Large Language Models.

Overall, the HateCOT dataset and explanation-enhanced models represent a promising step toward more transparent and generalizable offensive speech detection. However, continued research and careful deployment will be necessary to ensure these systems are fair, effective, and beneficial for online communities.

Conclusion

This paper introduces an important new dataset and modeling approach for detecting and explaining offensive speech online. By training large language models to not only classify content, but also generate explanations for their decisions, the researchers have created a more transparent and generalizable system compared to traditional black-box models.

While there are still some limitations and areas for further research, the HateCOT dataset and explanation-enhanced models demonstrate the value of incorporating explainability into AI-powered content moderation. As these types of systems become more widely deployed, ensuring they are fair, accountable, and beneficial to users will be crucial. This paper represents a significant contribution toward that goal.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

HateCOT: An Explanation-Enhanced Dataset for Generalizable Offensive Speech Detection via Large Language Models

Huy Nghiem, Hal Daum'e III

The widespread use of social media necessitates reliable and efficient detection of offensive content to mitigate harmful effects. Although sophisticated models perform well on individual datasets, they often fail to generalize due to varying definitions and labeling of offensive content. In this paper, we introduce HateCOT, an English dataset with over 52,000 samples from diverse sources, featuring explanations generated by GPT-3.5Turbo and curated by humans. We demonstrate that pretraining on HateCOT significantly enhances the performance of open-source Large Language Models on three benchmark datasets for offensive content detection in both zero-shot and few-shot settings, despite differences in domain and task. Additionally, HateCOT facilitates effective K-shot fine-tuning of LLMs with limited data and improves the quality of their explanations, as confirmed by our human evaluation.

6/18/2024

Investigating Annotator Bias in Large Language Models for Hate Speech Detection

Amit Das, Zheng Zhang, Fatemeh Jamshidi, Vinija Jain, Aman Chadha, Nilanjana Raychawdhary, Mary Sandage, Lauramarie Pope, Gerry Dozier, Cheryl Seals

Data annotation, the practice of assigning descriptive labels to raw data, is pivotal in optimizing the performance of machine learning models. However, it is a resource-intensive process susceptible to biases introduced by annotators. The emergence of sophisticated Large Language Models (LLMs), like ChatGPT presents a unique opportunity to modernize and streamline this complex procedure. While existing research extensively evaluates the efficacy of LLMs, as annotators, this paper delves into the biases present in LLMs, specifically GPT 3.5 and GPT 4o when annotating hate speech data. Our research contributes to understanding biases in four key categories: gender, race, religion, and disability. Specifically targeting highly vulnerable groups within these categories, we analyze annotator biases. Furthermore, we conduct a comprehensive examination of potential factors contributing to these biases by scrutinizing the annotated data. We introduce our custom hate speech detection dataset, HateSpeechCorpus, to conduct this research. Additionally, we perform the same experiments on the ETHOS (Mollas et al., 2022) dataset also for comparative analysis. This paper serves as a crucial resource, guiding researchers and practitioners in harnessing the potential of LLMs for dataannotation, thereby fostering advancements in this critical field. The HateSpeechCorpus dataset is available here: https://github.com/AmitDasRup123/HateSpeechCorpus

6/19/2024

Exploiting Hatred by Targets for Hate Speech Detection on Vietnamese Social Media Texts

Cuong Nhat Vo, Khanh Bao Huynh, Son T. Luu, Trong-Hop Do

The growth of social networks makes toxic content spread rapidly. Hate speech detection is a task to help decrease the number of harmful comments. With the diversity in the hate speech created by users, it is necessary to interpret the hate speech besides detecting it. Hence, we propose a methodology to construct a system for targeted hate speech detection from online streaming texts from social media. We first introduce the ViTHSD - a targeted hate speech detection dataset for Vietnamese Social Media Texts. The dataset contains 10K comments, each comment is labeled to specific targets with three levels: clean, offensive, and hate. There are 5 targets in the dataset, and each target is labeled with the corresponding level manually by humans with strict annotation guidelines. The inter-annotator agreement obtained from the dataset is 0.45 by Cohen's Kappa index, which is indicated as a moderate level. Then, we construct a baseline for this task by combining the Bi-GRU-LSTM-CNN with the pre-trained language model to leverage the power of text representation of BERTology. Finally, we suggest a methodology to integrate the baseline model for targeted hate speech detection into the online streaming system for practical application in preventing hateful and offensive content on social media.

5/1/2024

IndoToxic2024: A Demographically-Enriched Dataset of Hate Speech and Toxicity Types for Indonesian Language

Lucky Susanto, Musa Izzanardi Wijanarko, Prasetia Anugrah Pratama, Traci Hong, Ika Idris, Alham Fikri Aji, Derry Wijaya

Hate speech poses a significant threat to social harmony. Over the past two years, Indonesia has seen a ten-fold increase in the online hate speech ratio, underscoring the urgent need for effective detection mechanisms. However, progress is hindered by the limited availability of labeled data for Indonesian texts. The condition is even worse for marginalized minorities, such as Shia, LGBTQ, and other ethnic minorities because hate speech is underreported and less understood by detection tools. Furthermore, the lack of accommodation for subjectivity in current datasets compounds this issue. To address this, we introduce IndoToxic2024, a comprehensive Indonesian hate speech and toxicity classification dataset. Comprising 43,692 entries annotated by 19 diverse individuals, the dataset focuses on texts targeting vulnerable groups in Indonesia, specifically during the hottest political event in the country: the presidential election. We establish baselines for seven binary classification tasks, achieving a macro-F1 score of 0.78 with a BERT model (IndoBERTweet) fine-tuned for hate speech classification. Furthermore, we demonstrate how incorporating demographic information can enhance the zero-shot performance of the large language model, gpt-3.5-turbo. However, we also caution that an overemphasis on demographic information can negatively impact the fine-tuned model performance due to data fragmentation.

6/28/2024