A multi-modal approach for identifying schizophrenia using cross-modal attention

2309.15136

Published 4/22/2024 by Gowtham Premananth, Yashish M. Siriwardena, Philip Resnik, Carol Espy-Wilson

A multi-modal approach for identifying schizophrenia using cross-modal attention

Abstract

This study focuses on how different modalities of human communication can be used to distinguish between healthy controls and subjects with schizophrenia who exhibit strong positive symptoms. We developed a multi-modal schizophrenia classification system using audio, video, and text. Facial action units and vocal tract variables were extracted as low-level features from video and audio respectively, which were then used to compute high-level coordination features that served as the inputs to the audio and video modalities. Context-independent text embeddings extracted from transcriptions of speech were used as the input for the text modality. The multi-modal system is developed by fusing a segment-to-session-level classifier for video and audio modalities with a text model based on a Hierarchical Attention Network (HAN) with cross-modal attention. The proposed multi-modal system outperforms the previous state-of-the-art multi-modal system by 8.53% in the weighted average F1 score.

Create account to get full access

Methodology

Overview

The paper proposes a multi-modal approach for identifying schizophrenia using cross-modal attention mechanisms.
The approach combines information from multiple modalities, such as text, audio, and video, to improve the accuracy of schizophrenia identification.
The cross-modal attention mechanism allows the model to focus on the most relevant features across different modalities.

Plain English Explanation

The researchers wanted to develop a way to better identify people with schizophrenia. Schizophrenia is a mental illness that can cause hallucinations, delusions, and difficulty with thinking and behavior. They realized that looking at information from different sources, like a person's speech, facial expressions, and written text, could provide a more complete picture than just focusing on one type of information.

To do this, they used a cross-modal attention mechanism, which is a way for the computer model to learn which parts of the different types of information are most important for identifying schizophrenia. This allows the model to focus on the key details across the various sources of information, rather than trying to use everything equally.

Technical Explanation

The proposed approach uses a text-oriented cross-attention network to combine information from text, audio, and video modalities. The cross-attention mechanism learns to attend to the most relevant features across the different modalities, which helps the model make more accurate predictions about whether an individual has schizophrenia.

The model first extracts features from the text, audio, and video data using pre-trained neural networks. It then uses the cross-modal attention mechanism to fuse the features from the different modalities, allowing the model to focus on the most salient information for the task of schizophrenia identification.

The recursive joint cross-modal attention approach is used to iteratively refine the cross-modal feature representations, further improving the model's performance.

Critical Analysis

The paper provides a promising approach for leveraging multi-modal data to improve the identification of schizophrenia. The use of cross-modal attention mechanisms is a novel and effective way to combine information from different sources, potentially leading to more accurate and robust diagnoses.

However, the paper does not address some potential limitations of the approach. For example, it is unclear how the model would perform with incomplete or noisy data, such as when only some modalities are available or the data quality is poor. Additionally, the paper does not discuss the interpretability of the model's predictions, which is an important consideration for clinical applications.

Further research could explore the generalizability of the approach to other mental health conditions, as well as investigate ways to make the model's decision-making process more transparent and explainable.

Conclusion

This paper presents a multi-modal approach for identifying schizophrenia that combines information from text, audio, and video data using cross-modal attention mechanisms. The results demonstrate the potential of leveraging diverse sources of information to improve the accuracy of mental health diagnoses. While the approach shows promise, further research is needed to address its limitations and explore its broader applicability.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

A Multimodal Framework for the Assessment of the Schizophrenia Spectrum

Gowtham Premananth, Yashish M. Siriwardena, Philip Resnik, Sonia Bansal, Deanna L. Kelly, Carol Espy-Wilson

This paper presents a novel multimodal framework to distinguish between different symptom classes of subjects in the schizophrenia spectrum and healthy controls using audio, video, and text modalities. We implemented Convolution Neural Network and Long Short Term Memory based unimodal models and experimented on various multimodal fusion approaches to come up with the proposed framework. We utilized a minimal Gated multimodal unit (mGMU) to obtain a bi-modal intermediate fusion of the features extracted from the input modalities before finally fusing the outputs of the bimodal fusions to perform subject-wise classifications. The use of mGMU units in the multimodal framework improved the performance in both weighted f1-score and weighted AUC-ROC scores.

6/17/2024

eess.AS

Audio-Visual Person Verification based on Recursive Fusion of Joint Cross-Attention

R. Gnana Praveen, Jahangir Alam

Person or identity verification has been recently gaining a lot of attention using audio-visual fusion as faces and voices share close associations with each other. Conventional approaches based on audio-visual fusion rely on score-level or early feature-level fusion techniques. Though existing approaches showed improvement over unimodal systems, the potential of audio-visual fusion for person verification is not fully exploited. In this paper, we have investigated the prospect of effectively capturing both the intra- and inter-modal relationships across audio and visual modalities, which can play a crucial role in significantly improving the fusion performance over unimodal systems. In particular, we introduce a recursive fusion of a joint cross-attentional model, where a joint audio-visual feature representation is employed in the cross-attention framework in a recursive fashion to progressively refine the feature representations that can efficiently capture the intra-and inter-modal relationships. To further enhance the audio-visual feature representations, we have also explored BLSTMs to improve the temporal modeling of audio-visual feature representations. Extensive experiments are conducted on the Voxceleb1 dataset to evaluate the proposed model. Results indicate that the proposed model shows promising improvement in fusion performance by adeptly capturing the intra-and inter-modal relationships across audio and visual modalities.

4/29/2024

cs.CV cs.SD eess.AS

Missingness-resilient Video-enhanced Multimodal Disfluency Detection

Payal Mohapatra, Shamika Likhite, Subrata Biswas, Bashima Islam, Qi Zhu

Most existing speech disfluency detection techniques only rely upon acoustic data. In this work, we present a practical multimodal disfluency detection approach that leverages available video data together with audio. We curate an audiovisual dataset and propose a novel fusion technique with unified weight-sharing modality-agnostic encoders to learn the temporal and semantic context. Our resilient design accommodates real-world scenarios where the video modality may sometimes be missing during inference. We also present alternative fusion strategies when both modalities are assured to be complete. In experiments across five disfluency-detection tasks, our unified multimodal approach significantly outperforms Audio-only unimodal methods, yielding an average absolute improvement of 10% (i.e., 10 percentage point increase) when both video and audio modalities are always available, and 7% even when video modality is missing in half of the samples.

6/12/2024

cs.CL cs.MM cs.SD eess.AS

Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction

Zhaoxi Mu, Xinyu Yang

The integration of visual cues has revitalized the performance of the target speech extraction task, elevating it to the forefront of the field. Nevertheless, this multi-modal learning paradigm often encounters the challenge of modality imbalance. In audio-visual target speech extraction tasks, the audio modality tends to dominate, potentially overshadowing the importance of visual guidance. To tackle this issue, we propose AVSepChain, drawing inspiration from the speech chain concept. Our approach partitions the audio-visual target speech extraction task into two stages: speech perception and speech production. In the speech perception stage, audio serves as the dominant modality, while visual information acts as the conditional modality. Conversely, in the speech production stage, the roles are reversed. This transformation of modality status aims to alleviate the problem of modality imbalance. Additionally, we introduce a contrastive semantic matching loss to ensure that the semantic information conveyed by the generated speech aligns with the semantic information conveyed by lip movements during the speech production stage. Through extensive experiments conducted on multiple benchmark datasets for audio-visual target speech extraction, we showcase the superior performance achieved by our proposed method.

5/7/2024

cs.SD cs.CV cs.LG cs.MM eess.AS