A Novel Audio-Visual Information Fusion System for Mental Disorders Detection

Read original: arXiv:2409.02243 - Published 9/5/2024 by Yichun Li, Shuanglin Li, Syed Mohsen Naqvi

A Novel Audio-Visual Information Fusion System for Mental Disorders Detection

Overview

This paper presents a novel audio-visual information fusion system for detecting mental disorders like depression and ADHD.
The system combines audio and visual data from patients to improve the accuracy of mental disorder diagnosis.
The authors conducted experiments to evaluate the performance of their system on various mental health datasets.

Plain English Explanation

The researchers developed a new way to detect mental health issues like depression and ADHD. Their system combines audio and visual information from patients to make a more accurate diagnosis.

Typically, doctors rely on just one type of data, like what a patient says or how they behave. But the researchers thought that using both audio (speech patterns) and visual (facial expressions, body language) cues could provide a more complete picture. This multimodal approach has been used before to identify other mental health conditions like schizophrenia.

The researchers tested their system on existing mental health datasets. They found that combining audio and visual information led to better detection of depression and ADHD compared to using just one type of data. This aligns with other research showing the benefits of multimodal methods for detecting depression.

Overall, this new system has the potential to help healthcare providers make more accurate diagnoses and provide better treatment for patients with mental health issues.

Technical Explanation

The researchers developed a multimodal fusion system that leverages both audio and visual information to detect mental disorders. The system consists of two main components:

Audio Feature Extraction: The researchers used various signal processing techniques to extract relevant acoustic features from patient speech data, such as pitch, energy, and voice quality measures.
Visual Feature Extraction: They applied computer vision methods to extract visual features from patient video data, including facial expressions, head pose, and body movements.

The extracted audio and visual features were then fed into a multimodal fusion model that learned to combine the complementary information from both modalities. The researchers experimented with different fusion strategies, such as early fusion (concatenating features) and late fusion (ensemble of unimodal models).

The performance of the multimodal fusion system was evaluated on several mental health datasets covering depression and ADHD. The results showed that the combined audio-visual approach outperformed unimodal systems that used only audio or visual data, demonstrating the benefits of the information fusion approach.

Critical Analysis

The researchers provide a comprehensive evaluation of their multimodal fusion system, testing it on multiple mental health datasets. However, the paper does not discuss potential limitations or caveats of the approach.

For example, the system relies on patients providing both audio and video data, which may not always be feasible in real-world clinical settings. There could also be privacy concerns around collecting and analyzing such sensitive personal data.

Additionally, the paper does not address the interpretability of the multimodal fusion model. It would be valuable to understand how the system combines audio and visual features to arrive at a diagnosis, as this could help clinicians trust and better apply the technology.

Further research could also explore the generalizability of the approach to a wider range of mental health conditions, as the current focus is primarily on depression and ADHD.

Conclusion

This paper presents a novel audio-visual information fusion system for detecting mental disorders, such as depression and ADHD. The system's ability to combine complementary audio and visual cues leads to improved diagnostic accuracy compared to using a single modality.

The researchers demonstrated the effectiveness of their approach through experiments on various mental health datasets. While the paper does not address certain limitations, the proposed multimodal fusion system shows promise in enhancing mental health diagnosis and treatment with technology.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

A Novel Audio-Visual Information Fusion System for Mental Disorders Detection

Yichun Li, Shuanglin Li, Syed Mohsen Naqvi

Mental disorders are among the foremost contributors to the global healthcare challenge. Research indicates that timely diagnosis and intervention are vital in treating various mental disorders. However, the early somatization symptoms of certain mental disorders may not be immediately evident, often resulting in their oversight and misdiagnosis. Additionally, the traditional diagnosis methods incur high time and cost. Deep learning methods based on fMRI and EEG have improved the efficiency of the mental disorder detection process. However, the cost of the equipment and trained staff are generally huge. Moreover, most systems are only trained for a specific mental disorder and are not general-purpose. Recently, physiological studies have shown that there are some speech and facial-related symptoms in a few mental disorders (e.g., depression and ADHD). In this paper, we focus on the emotional expression features of mental disorders and introduce a multimodal mental disorder diagnosis system based on audio-visual information input. Our proposed system is based on spatial-temporal attention networks and innovative uses a less computationally intensive pre-train audio recognition network to fine-tune the video recognition module for better results. We also apply the unified system for multiple mental disorders (ADHD and depression) for the first time. The proposed system achieves over 80% accuracy on the real multimodal ADHD dataset and achieves state-of-the-art results on the depression dataset AVEC 2014.

9/5/2024

Depression Detection and Analysis using Large Language Models on Textual and Audio-Visual Modalities

Avinash Anand, Chayan Tank, Sarthak Pol, Vinayak Katoch, Shaina Mehta, Rajiv Ratn Shah

Depression has proven to be a significant public health issue, profoundly affecting the psychological well-being of individuals. If it remains undiagnosed, depression can lead to severe health issues, which can manifest physically and even lead to suicide. Generally, Diagnosing depression or any other mental disorder involves conducting semi-structured interviews alongside supplementary questionnaires, including variants of the Patient Health Questionnaire (PHQ) by Clinicians and mental health professionals. This approach places significant reliance on the experience and judgment of trained physicians, making the diagnosis susceptible to personal biases. Given that the underlying mechanisms causing depression are still being actively researched, physicians often face challenges in diagnosing and treating the condition, particularly in its early stages of clinical presentation. Recently, significant strides have been made in Artificial neural computing to solve problems involving text, image, and speech in various domains. Our analysis has aimed to leverage these state-of-the-art (SOTA) models in our experiments to achieve optimal outcomes leveraging multiple modalities. The experiments were performed on the Extended Distress Analysis Interview Corpus Wizard of Oz dataset (E-DAIC) corpus presented in the Audio/Visual Emotion Challenge (AVEC) 2019 Challenge. The proposed solutions demonstrate better results achieved by Proprietary and Open-source Large Language Models (LLMs), which achieved a Root Mean Square Error (RMSE) score of 3.98 on Textual Modality, beating the AVEC 2019 challenge baseline results and current SOTA regression analysis architectures. Additionally, the proposed solution achieved an accuracy of 71.43% in the classification task. The paper also includes a novel audio-visual multi-modal network that predicts PHQ-8 scores with an RMSE of 6.51.

7/9/2024

Mental-Perceiver: Audio-Textual Multimodal Learning for Mental Health Assessment

Jinghui Qin, Changsong Liu, Tianchi Tang, Dahuang Liu, Minghao Wang, Qianying Huang, Yang Xu, Rumin Zhang

Mental disorders, such as anxiety and depression, have become a global issue that affects the regular lives of people across different ages. Without proper detection and treatment, anxiety and depression can hinder the sufferer's study, work, and daily life. Fortunately, recent advancements of digital and AI technologies provide new opportunities for better mental health care and many efforts have been made in developing automatic anxiety and depression assessment techniques. However, this field still lacks a publicly available large-scale dataset that can facilitate the development and evaluation of AI-based techniques. To address this limitation, we have constructed a new large-scale textbf{M}ulti-textbf{M}odal textbf{Psy}chological assessment corpus (MMPsy) on anxiety and depression assessment of Mandarin-speaking adolescents. The MMPsy contains audios and extracted transcripts of responses from automated anxiety or depression assessment interviews along with the self-reported anxiety or depression evaluations of the participants using standard mental health assessment questionnaires. Our dataset contains over 7,700 post-processed recordings of interviews for anxiety assessment and over 4,200 recordings for depression assessment. Using this dataset, we have developed a novel deep-learning based mental disorder estimation model, named textbf{Mental-Perceiver}, to detect anxious/depressive mental states from recorded audio and transcript data. Extensive experiments on our MMPsy and the commonly-used DAIC-WOZ datasets have shown the effectiveness and superiority of our proposed Mental-Perceiver model in anxiety and depression detection. The MMPsy dataset will be made publicly available later to facilitate the research and development of AI-based techniques in the mental health care field.

8/23/2024

A multi-modal approach for identifying schizophrenia using cross-modal attention

Gowtham Premananth, Yashish M. Siriwardena, Philip Resnik, Carol Espy-Wilson

This study focuses on how different modalities of human communication can be used to distinguish between healthy controls and subjects with schizophrenia who exhibit strong positive symptoms. We developed a multi-modal schizophrenia classification system using audio, video, and text. Facial action units and vocal tract variables were extracted as low-level features from video and audio respectively, which were then used to compute high-level coordination features that served as the inputs to the audio and video modalities. Context-independent text embeddings extracted from transcriptions of speech were used as the input for the text modality. The multi-modal system is developed by fusing a segment-to-session-level classifier for video and audio modalities with a text model based on a Hierarchical Attention Network (HAN) with cross-modal attention. The proposed multi-modal system outperforms the previous state-of-the-art multi-modal system by 8.53% in the weighted average F1 score.

4/22/2024