Large Language Models Perform on Par with Experts Identifying Mental Health Factors in Adolescent Online Forums

2404.16461

Published 4/29/2024 by Isabelle Lorge, Dan W. Joyce, Andrey Kormilitzin

💬

Abstract

Mental health in children and adolescents has been steadily deteriorating over the past few years. The recent advent of Large Language Models (LLMs) offers much hope for cost and time efficient scaling of monitoring and intervention, yet despite specifically prevalent issues such as school bullying and eating disorders, previous studies on have not investigated performance in this domain or for open information extraction where the set of answers is not predetermined. We create a new dataset of Reddit posts from adolescents aged 12-19 annotated by expert psychiatrists for the following categories: TRAUMA, PRECARITY, CONDITION, SYMPTOMS, SUICIDALITY and TREATMENT and compare expert labels to annotations from two top performing LLMs (GPT3.5 and GPT4). In addition, we create two synthetic datasets to assess whether LLMs perform better when annotating data as they generate it. We find GPT4 to be on par with human inter-annotator agreement and performance on synthetic data to be substantially higher, however we find the model still occasionally errs on issues of negation and factuality and higher performance on synthetic data is driven by greater complexity of real data rather than inherent advantage.

Create account to get full access

Overview

Mental health issues among children and adolescents have been on the rise in recent years.
Large Language Models (LLMs) offer potential for cost-effective and scalable monitoring and intervention.
Previous studies have not investigated LLM performance in this domain, particularly for open-ended information extraction.
The researchers created a new dataset of Reddit posts from adolescents aged 12-19, annotated by expert psychiatrists for categories like trauma, precarity, condition, symptoms, suicidality, and treatment.
They compared expert labels to annotations from two top-performing LLMs (GPT-3.5 and GPT-4) and also created synthetic datasets to assess LLM performance on generated data.

Plain English Explanation

Mental health problems in young people have been increasing over the past few years. New large language models (LLMs) like GPT-3 and GPT-4 could potentially help monitor and address these issues more efficiently and at a larger scale than humans alone.

However, previous research has not looked at how well these LLMs perform when it comes to understanding and categorizing mental health-related content, especially in open-ended situations where the possible responses are not pre-defined.

In this study, the researchers created a new dataset of Reddit posts made by adolescents aged 12-19. Expert psychiatrists labeled these posts for different mental health-related categories, like trauma, symptoms of disorders, suicidal thoughts, and treatment. The researchers then compared how well the LLMs GPT-3.5 and GPT-4 were able to match the expert labels.

They also created some synthetic (artificial) datasets to see if the LLMs performed better when the content was simpler and more straightforward, compared to the more complex real-world Reddit posts.

Technical Explanation

The researchers created a new dataset of Reddit posts from adolescents aged 12-19, which was annotated by expert psychiatrists for the following categories: TRAUMA, PRECARITY, CONDITION, SYMPTOMS, SUICIDALITY, and TREATMENT. They then compared the expert labels to annotations made by two top-performing LLMs: GPT-3.5 and GPT-4.

In addition, the researchers created two synthetic datasets to assess whether LLMs perform better when annotating data they generate themselves, as opposed to real-world social media posts. This was done to understand if the complexity of the real-world data was the primary challenge, or if there were inherent limitations in the LLMs' ability to handle certain concepts like negation and factuality.

The results showed that GPT-4 performed on par with human inter-annotator agreement on the real-world Reddit dataset. The performance on the synthetic datasets was substantially higher, suggesting that the complexity of the real-world data was a key factor in the LLMs' performance. However, the researchers found that the models still occasionally made mistakes in understanding negation and factuality, even on the simpler synthetic data.

Critical Analysis

While the findings suggest that LLMs like GPT-4 can achieve human-level performance on mental health-related text annotation tasks, the researchers acknowledge several limitations and areas for further research:

The dataset, while novel, is still relatively small, and may not capture the full diversity of mental health experiences among adolescents. Expanding the dataset size and scope could help validate the findings.
The synthetic datasets, while useful for isolating the impact of data complexity, may not fully reflect the nuances and contextual challenges of real-world social media posts. Further research is needed to understand the limitations of LLMs in handling such natural language data.
The study focused on a limited set of mental health-related categories. Expanding the scope to cover a wider range of issues, such as school bullying and eating disorders, could provide a more comprehensive evaluation of LLM capabilities in this domain.
The study did not explore the potential biases or ethical considerations of using LLMs for mental health monitoring and intervention. Ongoing research in this area should be considered to ensure the responsible development and deployment of such systems.

Conclusion

This study represents an important step in understanding the potential of Large Language Models to assist in addressing the growing mental health challenges faced by children and adolescents. The findings suggest that LLMs can achieve human-level performance on certain mental health-related text annotation tasks, but also highlight the need for further research to address the complexities and ethical considerations of using these models in real-world settings. As the use of AI systems in mental health care continues to evolve, it will be crucial to carefully evaluate their capabilities and limitations to ensure they are deployed in a safe and responsible manner.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Unveiling and Mitigating Bias in Mental Health Analysis with Large Language Models

Yuqing Wang, Yun Zhao, Sara Alessandra Keller, Anne de Hond, Marieke M. van Buchem, Malvika Pillai, Tina Hernandez-Boussard

The advancement of large language models (LLMs) has demonstrated strong capabilities across various applications, including mental health analysis. However, existing studies have focused on predictive performance, leaving the critical issue of fairness underexplored, posing significant risks to vulnerable populations. Despite acknowledging potential biases, previous works have lacked thorough investigations into these biases and their impacts. To address this gap, we systematically evaluate biases across seven social factors (e.g., gender, age, religion) using ten LLMs with different prompting methods on eight diverse mental health datasets. Our results show that GPT-4 achieves the best overall balance in performance and fairness among LLMs, although it still lags behind domain-specific models like MentalRoBERTa in some cases. Additionally, our tailored fairness-aware prompts can effectively mitigate bias in mental health predictions, highlighting the great potential for fair analysis in this field.

6/21/2024

cs.CL

Can AI Relate: Testing Large Language Model Response for Mental Health Support

Saadia Gabriel, Isha Puri, Xuhai Xu, Matteo Malgaroli, Marzyeh Ghassemi

Large language models (LLMs) are already being piloted for clinical use in hospital systems like NYU Langone, Dana-Farber and the NHS. A proposed deployment use case is psychotherapy, where a LLM-powered chatbot can treat a patient undergoing a mental health crisis. Deployment of LLMs for mental health response could hypothetically broaden access to psychotherapy and provide new possibilities for personalizing care. However, recent high-profile failures, like damaging dieting advice offered by the Tessa chatbot to patients with eating disorders, have led to doubt about their reliability in high-stakes and safety-critical settings. In this work, we develop an evaluation framework for determining whether LLM response is a viable and ethical path forward for the automation of mental health treatment. Using human evaluation with trained clinicians and automatic quality-of-care metrics grounded in psychology research, we compare the responses provided by peer-to-peer responders to those provided by a state-of-the-art LLM. We show that LLMs like GPT-4 use implicit and explicit cues to infer patient demographics like race. We then show that there are statistically significant discrepancies between patient subgroups: Responses to Black posters consistently have lower empathy than for any other demographic group (2%-13% lower than the control group). Promisingly, we do find that the manner in which responses are generated significantly impacts the quality of the response. We conclude by proposing safety guidelines for the potential deployment of LLMs for mental health response.

5/21/2024

cs.CL

💬

Large Language Model for Mental Health: A Systematic Review

Zhijun Guo, Alvina Lai, Johan Hilge Thygesen, Joseph Farrington, Thomas Keen, Kezhi Li

Large language models (LLMs) have attracted significant attention for potential applications in digital health, while their application in mental health is subject to ongoing debate. This systematic review aims to evaluate the usage of LLMs in mental health, focusing on their strengths and limitations in early screening, digital interventions, and clinical applications. Adhering to PRISMA guidelines, we searched PubMed, IEEE Xplore, Scopus, and the JMIR using keywords: 'mental health OR mental illness OR mental disorder OR psychiatry' AND 'large language models'. We included articles published between January 1, 2017, and December 31, 2023, excluding non-English articles. 30 articles were evaluated, which included research on mental illness and suicidal ideation detection through text (n=12), usage of LLMs for mental health conversational agents (CAs) (n=5), and other applications and evaluations of LLMs in mental health (n=13). LLMs exhibit substantial effectiveness in detecting mental health issues and providing accessible, de-stigmatized eHealth services. However, the current risks associated with the clinical use might surpass their benefits. The study identifies several significant issues: the lack of multilingual datasets annotated by experts, concerns about the accuracy and reliability of the content generated, challenges in interpretability due to the 'black box' nature of LLMs, and persistent ethical dilemmas. These include the lack of a clear ethical framework, concerns about data privacy, and the potential for over-reliance on LLMs by both therapists and patients, which could compromise traditional medical practice. Despite these issues, the rapid development of LLMs underscores their potential as new clinical aids, emphasizing the need for continued research and development in this area.

5/31/2024

cs.CY cs.AI cs.CL

💬

Utilizing Large Language Models to Generate Synthetic Data to Increase the Performance of BERT-Based Neural Networks

Chancellor R. Woolsey, Prakash Bisht, Joshua Rothman, Gondy Leroy

An important issue impacting healthcare is a lack of available experts. Machine learning (ML) models could resolve this by aiding in diagnosing patients. However, creating datasets large enough to train these models is expensive. We evaluated large language models (LLMs) for data creation. Using Autism Spectrum Disorders (ASD), we prompted ChatGPT and GPT-Premium to generate 4,200 synthetic observations to augment existing medical data. Our goal is to label behaviors corresponding to autism criteria and improve model accuracy with synthetic training data. We used a BERT classifier pre-trained on biomedical literature to assess differences in performance between models. A random sample (N=140) from the LLM-generated data was evaluated by a clinician and found to contain 83% correct example-label pairs. Augmenting data increased recall by 13% but decreased precision by 16%, correlating with higher quality and lower accuracy across pairs. Future work will analyze how different synthetic data traits affect ML outcomes.

5/14/2024

cs.CL cs.AI