Evaluating Large Language Models for Anxiety and Depression Classification using Counseling and Psychotherapy Transcripts

Read original: arXiv:2407.13228 - Published 7/19/2024 by Junwei Sun, Siqi Ma, Yiran Fan, Peter Washington

💬

Overview

Researchers evaluated the effectiveness of traditional machine learning and large language models (LLMs) in classifying anxiety and depression from long conversational transcripts.
They fine-tuned established transformer models (BERT, RoBERTa, Longformer) and a recent large model (Mistral-7B), trained a Support Vector Machine with feature engineering, and assessed GPT models through prompting.
The results show that state-of-the-art models did not outperform traditional machine learning methods in improving classification outcomes.

Plain English Explanation

The researchers were interested in seeing how well different AI models could detect signs of anxiety and depression from lengthy conversation transcripts. They tested both established machine learning methods as well as the latest large language models (LLMs) like BERT, RoBERTa, Longformer, and Mistral-7B.

The traditional approach involved training a Support Vector Machine (a common machine learning algorithm) with carefully selected features from the text. The LLM-based approach involved fine-tuning the pre-trained models on the task of classifying the transcripts into anxiety, depression, or neither.

Surprisingly, the researchers found that the state-of-the-art LLMs did not significantly outperform the traditional machine learning method. In other words, the latest and greatest AI models did not provide a clear advantage over more established techniques when it came to detecting mental health issues from conversational data.

This suggests that there may still be room for improvement in applying large language models to mental health assessment tasks. The existing models may be missing key capabilities or insights that the traditional feature-engineering approach was able to capture more effectively.

Technical Explanation

The researchers explored the performance of both traditional machine learning and large language models (LLMs) in classifying anxiety and depression from lengthy conversation transcripts. They fine-tuned several established transformer-based models, including BERT, RoBERTa, and Longformer, as well as the more recent Mistral-7B model.

In parallel, they trained a Support Vector Machine (SVM) model with carefully engineered features from the conversation transcripts. They also explored using GPT models through a prompting approach.

Contrary to expectations, the researchers found that the state-of-the-art LLMs did not demonstrate a clear advantage over the traditional machine learning method in terms of classification performance. The SVM model with feature engineering was able to achieve comparable, if not better, results compared to the fine-tuned transformer and GPT models.

This suggests that while LLMs have shown impressive capabilities in many natural language processing tasks, there may still be room for improvement when it comes to applying them to mental health assessment from conversational data. The existing models may be missing key insights or capabilities that the traditional feature engineering approach was able to capture more effectively.

Critical Analysis

The researchers acknowledge several limitations to their study, including the relatively small size of the dataset and the potential for bias in the transcript annotations. They also note that the performance of the LLMs may be improved with more extensive fine-tuning or the use of ensemble techniques.

Additionally, the paper does not delve into the potential reasons why the traditional machine learning approach was able to match or outperform the LLMs in this specific task. It would be helpful to understand the underlying factors that contributed to this outcome, as it could provide valuable insights for future research in this area.

Further research could explore the use of more advanced LLM architectures, such as Conversational Topic Recommendation or Assessing ML Classification Algorithms, to see if they can better capture the nuances of mental health assessment from conversational data. Additionally, a more comprehensive review of the use of large language models for mental health could shed light on the strengths and limitations of this approach.

Conclusion

This study provides a thought-provoking comparison of traditional machine learning and state-of-the-art large language models in the task of classifying anxiety and depression from conversational transcripts. The key finding that LLMs did not outperform the traditional approach is a valuable lesson in the continued importance of feature engineering and traditional machine learning techniques, even as LLMs continue to advance.

The results suggest that while LLMs have shown impressive capabilities in many NLP tasks, there is still work to be done in adapting these models to the specific challenges of mental health assessment from conversational data. Ongoing research in this area could lead to important advancements in the use of AI for supporting mental health interventions and improving patient outcomes.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Evaluating Large Language Models for Anxiety and Depression Classification using Counseling and Psychotherapy Transcripts

Junwei Sun, Siqi Ma, Yiran Fan, Peter Washington

We aim to evaluate the efficacy of traditional machine learning and large language models (LLMs) in classifying anxiety and depression from long conversational transcripts. We fine-tune both established transformer models (BERT, RoBERTa, Longformer) and more recent large models (Mistral-7B), trained a Support Vector Machine with feature engineering, and assessed GPT models through prompting. We observe that state-of-the-art models fail to enhance classification outcomes compared to traditional machine learning methods.

7/19/2024

💬

Optimizing Psychological Counseling with Instruction-Tuned Large Language Models

Wenjie Li, Tianyu Sun, Kun Qian, Wenhong Wang

The advent of large language models (LLMs) has significantly advanced various fields, including natural language processing and automated dialogue systems. This paper explores the application of LLMs in psychological counseling, addressing the increasing demand for mental health services. We present a method for instruction tuning LLMs with specialized prompts to enhance their performance in providing empathetic, relevant, and supportive responses. Our approach involves developing a comprehensive dataset of counseling-specific prompts, refining them through feedback from professional counselors, and conducting rigorous evaluations using both automatic metrics and human assessments. The results demonstrate that our instruction-tuned model outperforms several baseline LLMs, highlighting its potential as a scalable and accessible tool for mental health support.

6/21/2024

Integrating Large Language Models into a Tri-Modal Architecture for Automated Depression Classification

Santosh V. Patapati

Major Depressive Disorder (MDD) is a pervasive mental health condition that affects 300 million people worldwide. This work presents a novel, BiLSTM-based tri-modal model-level fusion architecture for the binary classification of depression from clinical interview recordings. The proposed architecture incorporates Mel Frequency Cepstral Coefficients, Facial Action Units, and uses a two-shot learning based GPT-4 model to process text data. This is the first work to incorporate large language models into a multi-modal architecture for this task. It achieves impressive results on the DAIC-WOZ AVEC 2016 Challenge cross-validation split and Leave-One-Subject-Out cross-validation split, surpassing all baseline models and multiple state-of-the-art models. In Leave-One-Subject-Out testing, it achieves an accuracy of 91.01%, an F1-Score of 85.95%, a precision of 80%, and a recall of 92.86%.

8/20/2024

🏷️

From Classification to Clinical Insights: Towards Analyzing and Reasoning About Mobile and Behavioral Health Data With Large Language Models

Zachary Englhardt, Chengqian Ma, Margaret E. Morris, Xuhai Orson Xu, Chun-Cheng Chang, Lianhui Qin, Daniel McDuff, Xin Liu, Shwetak Patel, Vikram Iyer

Passively collected behavioral health data from ubiquitous sensors holds significant promise to provide mental health professionals insights from patient's daily lives; however, developing analysis tools to use this data in clinical practice requires addressing challenges of generalization across devices and weak or ambiguous correlations between the measured signals and an individual's mental health. To address these challenges, we take a novel approach that leverages large language models (LLMs) to synthesize clinically useful insights from multi-sensor data. We develop chain of thought prompting methods that use LLMs to generate reasoning about how trends in data such as step count and sleep relate to conditions like depression and anxiety. We first demonstrate binary depression classification with LLMs achieving accuracies of 61.1% which exceed the state of the art. While it is not robust for clinical use, this leads us to our key finding: even more impactful and valued than classification is a new human-AI collaboration approach in which clinician experts interactively query these tools and combine their domain expertise and context about the patient with AI generated reasoning to support clinical decision-making. We find models like GPT-4 correctly reference numerical data 75% of the time, and clinician participants express strong interest in using this approach to interpret self-tracking data.

8/27/2024