Grounding Toxicity in Real-World Events across Languages

2405.13754

Published 5/24/2024 by Wondimagegnhue Tsegaye Tufa, Ilia Markov, Piek Vossen

🛠️

Abstract

Social media conversations frequently suffer from toxicity, creating significant issues for users, moderators, and entire communities. Events in the real world, like elections or conflicts, can initiate and escalate toxic behavior online. Our study investigates how real-world events influence the origin and spread of toxicity in online discussions across various languages and regions. We gathered Reddit data comprising 4.5 million comments from 31 thousand posts in six different languages (Dutch, English, German, Arabic, Turkish and Spanish). We target fifteen major social and political world events that occurred between 2020 and 2023. We observe significant variations in toxicity, negative sentiment, and emotion expressions across different events and language communities, showing that toxicity is a complex phenomenon in which many different factors interact and still need to be investigated. We will release the data for further research along with our code.

Create account to get full access

Overview

Examines how real-world events influence the origin and spread of toxicity in online discussions across various languages and regions
Analyzed 4.5 million comments from 31,000 Reddit posts in six languages (Dutch, English, German, Arabic, Turkish, and Spanish)
Focused on 15 major social and political events that occurred between 2020 and 2023
Observed significant variations in toxicity, negative sentiment, and emotion expressions across different events and language communities

Plain English Explanation

This study investigates how real-world events, such as elections or conflicts, can impact the level of toxicity and negativity in online discussions. The researchers gathered a large dataset of over 4.5 million comments from Reddit, covering posts in six different languages. They focused on 15 significant social and political events that happened between 2020 and 2023, to see how these events influenced the tone and emotions expressed in these online conversations.

The results show that there are significant differences in the amount of toxicity, negative sentiment, and emotional expressions across the various events and language communities. This suggests that toxicity is a complex phenomenon where many different factors interact, and more research is still needed to fully understand it. By making the dataset and code available, the researchers hope to enable further studies in this important area.

Technical Explanation

The researchers collected a dataset of 4.5 million comments from 31,000 Reddit posts across six languages: Dutch, English, German, Arabic, Turkish, and Spanish. They targeted 15 major social and political world events that occurred between 2020 and 2023, including events like elections and conflicts.

The team analyzed this data to measure the levels of toxicity, negative sentiment, and emotional expressions (such as anger, fear, or sadness) present in the online discussions surrounding these real-world events. They observed significant variations in these metrics across the different events and language communities, suggesting that toxicity is a complex phenomenon influenced by multiple factors.

By making the dataset and analysis code publicly available, the researchers aim to enable further research into understanding the complex dynamics of toxicity in social media discussions and how they are influenced by events in the real world.

Critical Analysis

The study provides valuable insights into how real-world events can impact the level of toxicity and negativity in online discussions. However, the researchers acknowledge that their analysis is limited to a specific platform (Reddit) and a selected set of events. It would be interesting to see if similar patterns emerge in other social media platforms or in response to a wider range of events.

Additionally, the study does not delve deeply into the underlying causes or triggers of toxic behavior in these discussions. Further research could explore the specific factors, such as political polarization, misinformation, or group dynamics, that contribute to the observed variations in toxicity.

Overall, this study is a valuable contribution to the understanding of how real-world events can shape online discourse and the complex interplay between offline and online dynamics. The availability of the dataset and analysis code is particularly commendable, as it will enable other researchers to build upon this work and expand our knowledge in this important area.

Conclusion

This study sheds light on the intricate relationship between real-world events and the prevalence of toxicity in online discussions. By analyzing a large dataset of Reddit comments across multiple languages, the researchers have demonstrated that events in the physical world can significantly influence the tone and emotions expressed in digital spaces. The findings underscore the need for continued research and a deeper understanding of the complex factors that contribute to the spread of toxicity online, with the ultimate goal of fostering more constructive and inclusive digital communities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

The Constant in HATE: Analyzing Toxicity in Reddit across Topics and Languages

Wondimagegnhue Tsegaye Tufa, Ilia Markov, Piek Vossen

Toxic language remains an ongoing challenge on social media platforms, presenting significant issues for users and communities. This paper provides a cross-topic and cross-lingual analysis of toxicity in Reddit conversations. We collect 1.5 million comment threads from 481 communities in six languages: English, German, Spanish, Turkish,Arabic, and Dutch, covering 80 topics such as Culture, Politics, and News. We thoroughly analyze how toxicity spikes within different communities in relation to specific topics. We observe consistent patterns of increased toxicity across languages for certain topics, while also noting significant variations within specific language communities.

4/30/2024

cs.CL

Analyzing Toxicity in Deep Conversations: A Reddit Case Study

Vigneshwaran Shankaran, Rajesh Sharma

Online social media has become increasingly popular in recent years due to its ease of access and ability to connect with others. One of social media's main draws is its anonymity, allowing users to share their thoughts and opinions without fear of judgment or retribution. This anonymity has also made social media prone to harmful content, which requires moderation to ensure responsible and productive use. Several methods using artificial intelligence have been employed to detect harmful content. However, conversation and contextual analysis of hate speech are still understudied. Most promising works only analyze a single text at a time rather than the conversation supporting it. In this work, we employ a tree-based approach to understand how users behave concerning toxicity in public conversation settings. To this end, we collect both the posts and the comment sections of the top 100 posts from 8 Reddit communities that allow profanity, totaling over 1 million responses. We find that toxic comments increase the likelihood of subsequent toxic comments being produced in online conversations. Our analysis also shows that immediate context plays a vital role in shaping a response rather than the original post. We also study the effect of consensual profanity and observe overlapping similarities with non-consensual profanity in terms of user behavior and patterns.

4/12/2024

cs.CL cs.CY cs.SI

U.S. Election Hardens Hate Universe

Akshay Verma, Richard Sear, Neil F. Johnson

Local or national politics can trigger potentially dangerous hate in someone. But with a third of the world's population eligible to vote in elections in 2024 alone, we lack understanding of how individual-level hate multiplies up to hate behavior at the collective global scale. Here we show, based on the most recent U.S. election, that offline events are associated with a rapid adaptation of the global online hate universe that hardens (strengthens) both its network-of-networks structure and the 'flavors' of hate content that it collectively produces. Approximately 50 million potential voters in hate communities are drawn closer to each other and to the broad mainstream of approximately 2 billion others. It triggers new hate content at scale around immigration, ethnicity, and antisemitism that aligns with conspiracy theories about Jewish-led replacement before blending in hate around gender identity/sexual orientation, and religion. Telegram acts as a key hardening agent - yet is overlooked by U.S. Congressional hearings and new E.U. legislation. Because the hate universe has remained robust since 2020, anti-hate messaging surrounding not only upcoming elections but also other events like the war in Gaza, should pivot to blending multiple hate 'flavors' while targeting previously untouched social media structures.

5/2/2024

cs.SI cs.HC

IndoToxic2024: A Demographically-Enriched Dataset of Hate Speech and Toxicity Types for Indonesian Language

Lucky Susanto, Musa Izzanardi Wijanarko, Prasetia Anugrah Pratama, Traci Hong, Ika Idris, Alham Fikri Aji, Derry Wijaya

Hate speech poses a significant threat to social harmony. Over the past two years, Indonesia has seen a ten-fold increase in the online hate speech ratio, underscoring the urgent need for effective detection mechanisms. However, progress is hindered by the limited availability of labeled data for Indonesian texts. The condition is even worse for marginalized minorities, such as Shia, LGBTQ, and other ethnic minorities because hate speech is underreported and less understood by detection tools. Furthermore, the lack of accommodation for subjectivity in current datasets compounds this issue. To address this, we introduce IndoToxic2024, a comprehensive Indonesian hate speech and toxicity classification dataset. Comprising 43,692 entries annotated by 19 diverse individuals, the dataset focuses on texts targeting vulnerable groups in Indonesia, specifically during the hottest political event in the country: the presidential election. We establish baselines for seven binary classification tasks, achieving a macro-F1 score of 0.78 with a BERT model (IndoBERTweet) fine-tuned for hate speech classification. Furthermore, we demonstrate how incorporating demographic information can enhance the zero-shot performance of the large language model, gpt-3.5-turbo. However, we also caution that an overemphasis on demographic information can negatively impact the fine-tuned model performance due to data fragmentation.

6/28/2024

cs.CL cs.AI