A Big Data Analytics System for Predicting Suicidal Ideation in Real-Time Based on Social Media Streaming Data

2404.12394

Published 4/22/2024 by Mohamed A. Allayla, Serkan Ayvaz

A Big Data Analytics System for Predicting Suicidal Ideation in Real-Time Based on Social Media Streaming Data

Abstract

Online social media platforms have recently become integral to our society and daily routines. Every day, users worldwide spend a couple of hours on such platforms, expressing their sentiments and emotional state and contacting each other. Analyzing such huge amounts of data from these platforms can provide a clear insight into public sentiments and help detect their mental status. The early identification of these health condition risks may assist in preventing or reducing the number of suicide ideation and potentially saving people's lives. The traditional techniques have become ineffective in processing such streams and large-scale datasets. Therefore, the paper proposed a new methodology based on a big data architecture to predict suicidal ideation from social media content. The proposed approach provides a practical analysis of social media data in two phases: batch processing and real-time streaming prediction. The batch dataset was collected from the Reddit forum and used for model building and training, while streaming big data was extracted using Twitter streaming API and used for real-time prediction. After the raw data was preprocessed, the extracted features were fed to multiple Apache Spark ML classifiers: NB, LR, LinearSVC, DT, RF, and MLP. We conducted various experiments using various feature-extraction techniques with different testing scenarios. The experimental results of the batch processing phase showed that the features extracted of (Unigram + Bigram) + CV-IDF with MLP classifier provided high performance for classifying suicidal ideation, with an accuracy of 93.47%, and then applied for real-time streaming prediction phase.

Create account to get full access

Overview

This paper presents a big data analytics system for predicting suicidal ideation in real-time based on social media streaming data.
The system aims to detect early warning signs of suicide risk by analyzing user posts on social media platforms.
The researchers developed a multi-stage architecture that combines natural language processing, machine learning, and stream processing technologies.
The goal is to enable timely intervention and support for individuals at risk of suicide.

Plain English Explanation

The paper describes a new system designed to detect signs of suicidal thoughts and behavior on social media. The researchers recognized that people often express their struggles and distress on platforms like Twitter and Facebook. By analyzing the text and patterns in these online posts, the system can identify users who may be at risk of suicide and alert the appropriate support services.

The key idea is to use a combination of natural language processing techniques, machine learning models, and real-time data processing to continuously monitor social media activity. When the system detects warning signs, like mentions of self-harm or hopelessness, it can automatically notify mental health professionals or suicide prevention hotlines. This allows for early intervention, which is crucial for preventing tragic outcomes.

The researchers tested their system on a large dataset of social media posts and found that it was able to accurately predict suicidal ideation in many cases. This builds on previous research like the SOS-1K dataset for fine-grained suicide risk classification and efforts to enhance suicide risk assessment using speech-based automated approaches.

Overall, the goal is to leverage the wealth of data on social media to identify individuals who may be struggling with suicidal thoughts and get them the support they need. This kind of proactive, technology-driven approach has significant potential to save lives.

Technical Explanation

The proposed system consists of a multi-stage architecture that combines natural language processing, machine learning, and stream processing technologies to detect suicidal ideation in real-time from social media data.

The first stage involves data collection, where the system continuously gathers posts from various social media platforms like Twitter and Facebook. This builds on research into detecting financial opportunities from micro-blogging data using stacked models.

Next, the data goes through natural language processing to extract relevant features, such as sentiment, emotion, and linguistic patterns. This is similar to the EmoScan system for automatic screening of depression symptoms in Romanized Sinhala.

The processed data is then fed into a machine learning model trained to detect suicidal ideation. The researchers experimented with various classification algorithms and natural language processing techniques to optimize the model's performance. This aligns with research on assessing machine learning classification algorithms and NLP techniques for depression detection.

Finally, the system uses stream processing to continuously analyze the incoming social media data in real-time, triggering alerts when potential suicidal ideation is detected. This allows for timely intervention and support to be provided to at-risk individuals.

Critical Analysis

The paper presents a promising approach to addressing the critical issue of suicide prevention. By leveraging the wealth of data available on social media, the proposed system has the potential to identify individuals in distress and connect them with the appropriate support resources.

However, the researchers acknowledge several limitations and areas for further research. For example, the system's accuracy may be affected by the quality and reliability of the social media data, as well as the potential for individuals to hide or disguise their suicidal thoughts online. Additionally, there are ethical considerations around privacy and the use of personal data for this purpose.

Furthermore, the paper does not delve into the long-term sustainability and scalability of the system. Implementing this technology at a large scale would require substantial resources and coordination between various stakeholders, such as mental health professionals, social media companies, and policymakers.

Despite these challenges, the core idea of using advanced analytics and artificial intelligence to enhance suicide prevention efforts is compelling. Continued research and refinement of the system, along with careful consideration of the ethical implications, could lead to significant improvements in how we identify and support individuals at risk of suicide.

Conclusion

This paper presents a novel big data analytics system for predicting suicidal ideation in real-time based on social media streaming data. The system combines natural language processing, machine learning, and stream processing technologies to continuously monitor social media activity and detect early warning signs of suicide risk.

The researchers have demonstrated the potential of this approach through extensive testing and validation. By leveraging the vast amount of data available on social media, the system has the ability to identify individuals in distress and trigger timely intervention and support, potentially saving lives.

While there are some limitations and challenges that need to be addressed, the core idea of this research is highly relevant and impactful. As the use of social media continues to grow, the ability to harness this data for mental health applications could have far-reaching consequences for suicide prevention efforts worldwide.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

👨‍🏫

Supervised Learning and Large Language Model Benchmarks on Mental Health Datasets: Cognitive Distortions and Suicidal Risks in Chinese Social Media

Hongzhi Qi, Qing Zhao, Jianqiang Li, Changwei Song, Wei Zhai, Dan Luo, Shuo Liu, Yi Jing Yu, Fan Wang, Huijing Zou, Bing Xiang Yang, Guanghui Fu

On social media, users often express their personal feelings, which may exhibit cognitive distortions or even suicidal tendencies on certain specific topics. Early recognition of these signs is critical for effective psychological intervention. In this paper, we introduce two novel datasets from Chinese social media: SOS-HL-1K for suicidal risk classification and SocialCD-3K for cognitive distortions detection. The SOS-HL-1K dataset contained 1,249 posts and SocialCD-3K dataset was a multi-label classification dataset that containing 3,407 posts. We propose a comprehensive evaluation using two supervised learning methods and eight large language models (LLMs) on the proposed datasets. From the prompt engineering perspective, we experimented with two types of prompt strategies, including four zero-shot and five few-shot strategies. We also evaluated the performance of the LLMs after fine-tuning on the proposed tasks. The experimental results show that there is still a huge gap between LLMs relying only on prompt engineering and supervised learning. In the suicide classification task, this gap is 6.95% points in F1-score, while in the cognitive distortion task, the gap is even more pronounced, reaching 31.53% points in F1-score. However, after fine-tuning, this difference is significantly reduced. In the suicide and cognitive distortion classification tasks, the gap decreases to 4.31% and 3.14%, respectively. This research highlights the potential of LLMs in psychological contexts, but supervised learning remains necessary for more challenging tasks. All datasets and code are made available.

6/11/2024

cs.CL cs.LG

Enhancing Suicide Risk Detection on Social Media through Semi-Supervised Deep Label Smoothing

Matthew Squires, Xiaohui Tao, Soman Elangovan, U Rajendra Acharya, Raj Gururajan, Haoran Xie, Xujuan Zhou

Suicide is a prominent issue in society. Unfortunately, many people at risk for suicide do not receive the support required. Barriers to people receiving support include social stigma and lack of access to mental health care. With the popularity of social media, people have turned to online forums, such as Reddit to express their feelings and seek support. This provides the opportunity to support people with the aid of artificial intelligence. Social media posts can be classified, using text classification, to help connect people with professional help. However, these systems fail to account for the inherent uncertainty in classifying mental health conditions. Unlike other areas of healthcare, mental health conditions have no objective measurements of disease often relying on expert opinion. Thus when formulating deep learning problems involving mental health, using hard, binary labels does not accurately represent the true nature of the data. In these settings, where human experts may disagree, fuzzy or soft labels may be more appropriate. The current work introduces a novel label smoothing method which we use to capture any uncertainty within the data. We test our approach on a five-label multi-class classification problem. We show, our semi-supervised deep label smoothing method improves classification accuracy above the existing state of the art. Where existing research reports an accuracy of 43% on the Reddit C-SSRS dataset, using empirical experiments to evaluate our novel label smoothing method, we improve upon this existing benchmark to 52%. These improvements in model performance have the potential to better support those experiencing mental distress. Future work should explore the use of probabilistic methods in both natural language processing and quantifying contributions of both epistemic and aleatoric uncertainty in noisy datasets.

5/10/2024

cs.LG

Multi Class Depression Detection Through Tweets using Artificial Intelligence

Muhammad Osama Nusrat, Waseem Shahzad, Saad Ahmed Jamal

Depression is a significant issue nowadays. As per the World Health Organization (WHO), in 2023, over 280 million individuals are grappling with depression. This is a huge number; if not taken seriously, these numbers will increase rapidly. About 4.89 billion individuals are social media users. People express their feelings and emotions on platforms like Twitter, Facebook, Reddit, Instagram, etc. These platforms contain valuable information which can be used for research purposes. Considerable research has been conducted across various social media platforms. However, certain limitations persist in these endeavors. Particularly, previous studies were only focused on detecting depression and the intensity of depression in tweets. Also, there existed inaccuracies in dataset labeling. In this research work, five types of depression (Bipolar, major, psychotic, atypical, and postpartum) were predicted using tweets from the Twitter database based on lexicon labeling. Explainable AI was used to provide reasoning by highlighting the parts of tweets that represent type of depression. Bidirectional Encoder Representations from Transformers (BERT) was used for feature extraction and training. Machine learning and deep learning methodologies were used to train the model. The BERT model presented the most promising results, achieving an overall accuracy of 0.96.

4/23/2024

cs.CL cs.AI

🏷️

SOS-1K: A Fine-grained Suicide Risk Classification Dataset for Chinese Social Media Analysis

Hongzhi Qi, Hanfei Liu, Jianqiang Li, Qing Zhao, Wei Zhai, Dan Luo, Tian Yu He, Shuo Liu, Bing Xiang Yang, Guanghui Fu

In the social media, users frequently express personal emotions, a subset of which may indicate potential suicidal tendencies. The implicit and varied forms of expression in internet language complicate accurate and rapid identification of suicidal intent on social media, thus creating challenges for timely intervention efforts. The development of deep learning models for suicide risk detection is a promising solution, but there is a notable lack of relevant datasets, especially in the Chinese context. To address this gap, this study presents a Chinese social media dataset designed for fine-grained suicide risk classification, focusing on indicators such as expressions of suicide intent, methods of suicide, and urgency of timing. Seven pre-trained models were evaluated in two tasks: high and low suicide risk, and fine-grained suicide risk classification on a level of 0 to 10. In our experiments, deep learning models show good performance in distinguishing between high and low suicide risk, with the best model achieving an F1 score of 88.39%. However, the results for fine-grained suicide risk classification were still unsatisfactory, with an weighted F1 score of 50.89%. To address the issues of data imbalance and limited dataset size, we investigated both traditional and advanced, large language model based data augmentation techniques, demonstrating that data augmentation can enhance model performance by up to 4.65% points in F1-score. Notably, the Chinese MentalBERT model, which was pre-trained on psychological domain data, shows superior performance in both tasks. This study provides valuable insights for automatic identification of suicidal individuals, facilitating timely psychological intervention on social media platforms. The source code and data are publicly available.

4/22/2024

cs.CL