Reddit-Impacts: A Named Entity Recognition Dataset for Analyzing Clinical and Social Effects of Substance Use Derived from Social Media

2405.06145

Published 5/13/2024 by Yao Ge, Sudeshna Das, Karen O'Connor, Mohammed Ali Al-Garadi, Graciela Gonzalez-Hernandez, Abeed Sarker

cs.CL cs.AI cs.LG

Reddit-Impacts: A Named Entity Recognition Dataset for Analyzing Clinical and Social Effects of Substance Use Derived from Social Media

Abstract

Substance use disorders (SUDs) are a growing concern globally, necessitating enhanced understanding of the problem and its trends through data-driven research. Social media are unique and important sources of information about SUDs, particularly since the data in such sources are often generated by people with lived experiences. In this paper, we introduce Reddit-Impacts, a challenging Named Entity Recognition (NER) dataset curated from subreddits dedicated to discussions on prescription and illicit opioids, as well as medications for opioid use disorder. The dataset specifically concentrates on the lesser-studied, yet critically important, aspects of substance use--its clinical and social impacts. We collected data from chosen subreddits using the publicly available Application Programming Interface for Reddit. We manually annotated text spans representing clinical and social impacts reported by people who also reported personal nonmedical use of substances including but not limited to opioids, stimulants and benzodiazepines. Our objective is to create a resource that can enable the development of systems that can automatically detect clinical and social impacts of substance use from text-based social media data. The successful development of such systems may enable us to better understand how nonmedical use of substances affects individual health and societal dynamics, aiding the development of effective public health strategies. In addition to creating the annotated data set, we applied several machine learning models to establish baseline performances. Specifically, we experimented with transformer models like BERT, and RoBERTa, one few-shot learning model DANN by leveraging the full training dataset, and GPT-3.5 by using one-shot learning, for automatic NER of clinical and social impacts. The dataset has been made available through the 2024 SMM4H shared tasks.

Create account to get full access

Overview

This paper introduces the Reddit-Impacts dataset, a Named Entity Recognition (NER) dataset derived from Reddit posts related to substance use and its impacts.
The dataset aims to enable analysis of the clinical and social effects of substance use as reflected in user-generated social media content.
The dataset covers a wide range of substance-related entities, including drug names, routes of administration, effects, and social/interpersonal impacts.
The paper describes the dataset creation process, benchmark NER model performance, and potential applications in understanding substance use patterns and impacts.

Plain English Explanation

The researchers created a dataset called Reddit-Impacts that contains information from Reddit posts about the use of different substances, such as drugs or alcohol, and the effects they have on people's lives. This dataset can be used to study how substance use is discussed on social media and what kinds of impacts it has, both medically and socially.

The dataset covers a lot of different types of substance-related information, like the names of specific drugs, how people are taking them, what kinds of effects the drugs have, and how substance use affects people's relationships and daily lives. The researchers used machine learning techniques to automatically identify and classify all this information in the Reddit posts.

This dataset can be really valuable for researchers and organizations who want to better understand patterns of substance use and the challenges people face. By analyzing the social media conversations captured in the dataset, they can get insights into things like the sentiment around different substances, how people talk about addiction, and the social effects of substance use. This information could inform better prevention and treatment programs.

Technical Explanation

The Reddit-Impacts dataset was created by extracting and annotating substance-related entities from a large corpus of Reddit posts. The researchers used a combination of keyword searches, language model-based classifiers, and manual review to identify relevant posts and annotate a wide range of entities, including drug names, routes of administration, effects, and social/interpersonal impacts.

The dataset consists of over 100,000 annotated entity mentions across 25,000 Reddit posts. The researchers evaluated the performance of various Named Entity Recognition (NER) models on the dataset, including transformer-based models like BERT, and found that they achieved strong results, with F1 scores over 0.80 for many entity types.

The Reddit-Impacts dataset and associated NER models have a number of potential applications. They could be used to enhance suicide risk detection on social media by identifying substance-related risk factors. The dataset could also enable large-scale analysis of substance use patterns and their impacts on social, interpersonal, and clinical outcomes.

Critical Analysis

The Reddit-Impacts dataset represents a valuable resource for understanding the experiences and impacts of substance use as reflected in user-generated social media content. By capturing a wide range of entity types, the dataset enables rich analysis of the complex relationships between substances, their effects, and their broader social and clinical implications.

However, the dataset is limited to a single social media platform (Reddit), and the posts may not be representative of the broader population or substance use experiences. Additionally, the dataset relies on the accuracy of the automated annotation process, which could introduce errors or biases.

Future research could explore ways to expand the dataset, such as incorporating data from other social media platforms or incorporating additional context and metadata. Evaluating the dataset's performance on real-world applications, such as suicide risk detection or substance use monitoring, would also be an important next step.

Overall, the Reddit-Impacts dataset is a promising tool for advancing research on the societal and clinical impacts of substance use, but its limitations and potential biases should be carefully considered when using the data.

Conclusion

The Reddit-Impacts dataset provides a unique and valuable resource for studying the patterns and impacts of substance use as reflected in user-generated social media content. By annotating a wide range of substance-related entities, the dataset enables rich analysis of the complex relationships between substances, their effects, and their broader social and clinical implications.

The dataset and associated NER models have significant potential applications, from enhancing suicide risk detection to enabling large-scale analysis of substance use patterns and their impacts. While the dataset has some limitations, it represents an important step forward in leveraging social media data to improve our understanding of substance use and its effects on individuals and communities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤿

Identifying Self-Disclosures of Use, Misuse and Addiction in Community-based Social Media Posts

Chenghao Yang, Tuhin Chakrabarty, Karli R Hochstatter, Melissa N Slavin, Nabila El-Bassel, Smaranda Muresan

In the last decade, the United States has lost more than 500,000 people from an overdose involving prescription and illicit opioids making it a national public health emergency (USDHHS, 2017). Medical practitioners require robust and timely tools that can effectively identify at-risk patients. Community-based social media platforms such as Reddit allow self-disclosure for users to discuss otherwise sensitive drug-related behaviors. We present a moderate size corpus of 2500 opioid-related posts from various subreddits labeled with six different phases of opioid use: Medical Use, Misuse, Addiction, Recovery, Relapse, Not Using. For every post, we annotate span-level extractive explanations and crucially study their role both in annotation quality and model development. We evaluate several state-of-the-art models in a supervised, few-shot, or zero-shot setting. Experimental results and error analysis show that identifying the phases of opioid use disorder is highly contextual and challenging. However, we find that using explanations during modeling leads to a significant boost in classification accuracy demonstrating their beneficial role in a high-stakes domain such as studying the opioid use disorder continuum.

6/17/2024

cs.CL

Decoding the Narratives: Analyzing Personal Drug Experiences Shared on Reddit

Layla Bouzoubaa, Elham Aghakhani, Max Song, Minh Trinh, Rezvaneh Rezapour

Online communities such as drug-related subreddits serve as safe spaces for people who use drugs (PWUD), fostering discussions on substance use experiences, harm reduction, and addiction recovery. Users' shared narratives on these forums provide insights into the likelihood of developing a substance use disorder (SUD) and recovery potential. Our study aims to develop a multi-level, multi-label classification model to analyze online user-generated texts about substance use experiences. For this purpose, we first introduce a novel taxonomy to assess the nature of posts, including their intended connections (Inquisition or Disclosure), subjects (e.g., Recovery, Dependency), and specific objectives (e.g., Relapse, Quality, Safety). Using various multi-label classification algorithms on a set of annotated data, we show that GPT-4, when prompted with instructions, definitions, and examples, outperformed all other models. We apply this model to label an additional 1,000 posts and analyze the categories of linguistic expression used within posts in each class. Our analysis shows that topics such as Safety, Combination of Substances, and Mental Health see more disclosure, while discussions about physiological Effects focus on harm reduction. Our work enriches the understanding of PWUD's experiences and informs the broader knowledge base on SUD and drug use.

6/19/2024

cs.CL

🗣️

Exploring Social Media Posts for Depression Identification: A Study on Reddit Dataset

Nandigramam Sai Harshit, Nilesh Kumar Sahu, Haroon R. Lone

Depression is one of the most common mental disorders affecting an individual's personal and professional life. In this work, we investigated the possibility of utilizing social media posts to identify depression in individuals. To achieve this goal, we conducted a preliminary study where we extracted and analyzed the top Reddit posts made in 2022 from depression-related forums. The collected data were labeled as depressive and non-depressive using UMLS Metathesaurus. Further, the pre-processed data were fed to classical machine learning models, where we achieved an accuracy of 92.28% in predicting the depressive and non-depressive posts.

5/14/2024

cs.CL cs.SI

iDRAMA-Scored-2024: A Dataset of the Scored Social Media Platform from 2020 to 2023

Jay Patel, Pujan Paudel, Emiliano De Cristofaro, Gianluca Stringhini, Jeremy Blackburn

Online web communities often face bans for violating platform policies, encouraging their migration to alternative platforms. This migration, however, can result in increased toxicity and unforeseen consequences on the new platform. In recent years, researchers have collected data from many alternative platforms, indicating coordinated efforts leading to offline events, conspiracy movements, hate speech propagation, and harassment. Thus, it becomes crucial to characterize and understand these alternative platforms. To advance research in this direction, we collect and release a large-scale dataset from Scored -- an alternative Reddit platform that sheltered banned fringe communities, for example, c/TheDonald (a prominent right-wing community) and c/GreatAwakening (a conspiratorial community). Over four years, we collected approximately 57M posts from Scored, with at least 58 communities identified as migrating from Reddit and over 950 communities created since the platform's inception. Furthermore, we provide sentence embeddings of all posts in our dataset, generated through a state-of-the-art model, to further advance the field in characterizing the discussions within these communities. We aim to provide these resources to facilitate their investigations without the need for extensive data collection and processing efforts.

5/17/2024

cs.SI cs.CY cs.IR