Towards objective and interpretable speech disorder assessment: a comparative analysis of CNN and transformer-based models

2406.07576

Published 6/13/2024 by Malo Maisonneuve, Corinne Fredouille, Muriel Lalain, Alain Ghio, Virginie Woisard

Towards objective and interpretable speech disorder assessment: a comparative analysis of CNN and transformer-based models

Abstract

Head and Neck Cancers (HNC) significantly impact patients' ability to speak, affecting their quality of life. Commonly used metrics for assessing pathological speech are subjective, prompting the need for automated and unbiased evaluation methods. This study proposes a self-supervised Wav2Vec2-based model for phone classification with HNC patients, to enhance accuracy and improve the discrimination of phonetic features for subsequent interpretability purpose. The impact of pre-training datasets, model size, and fine-tuning datasets and parameters are explored. Evaluation on diverse corpora reveals the effectiveness of the Wav2Vec2 architecture, outperforming a CNN-based approach, used in previous work. Correlation with perceptual measures also affirms the model relevance for impaired speech analysis. This work paves the way for better understanding of pathological speech with interpretable approaches for clinicians, by leveraging complex self-learnt speech representations.

Create account to get full access

Overview

This paper presents a comparative analysis of convolutional neural network (CNN) and transformer-based models for objective and interpretable speech disorder assessment.
The researchers evaluated the performance of these models on two different speech disorder datasets, exploring their ability to accurately detect and assess speech impairments.
The paper also investigates the interpretability of the models, providing insights into the features they use to make their assessments.

Plain English Explanation

The researchers in this study wanted to find better ways to objectively and clearly assess speech disorders using artificial intelligence (AI) models. They compared two different types of AI models - convolutional neural networks (CNNs) and transformer-based models - to see which one was better at detecting and evaluating speech problems.

CNNs are a type of AI model that are good at processing images and sound, while transformer-based models are better at understanding language and context. The researchers tested these models on two different datasets of people with speech disorders, to see how accurately they could identify the speech issues.

In addition to looking at the models' performance, the researchers also investigated how the models make their assessments. This "interpretability" is important, as it allows doctors and patients to understand why the AI has made a certain diagnosis or recommendation. The researchers wanted to see which model provided more insight into the specific speech features it was using to assess the disorders.

Technical Explanation

The researchers in this study compared the performance of convolutional neural network (CNN) and transformer-based models for the task of objective and interpretable speech disorder assessment. They evaluated the models on two different speech disorder datasets - the Pathological Speech Quality Assessment (PSQA) dataset and the Tuning Analysis of Audio Classifier (TAAC) dataset.

The CNN model used in this study was designed to be end-to-end interpretable, allowing the researchers to analyze the specific acoustic features the model was using to make its assessments. The transformer-based model leveraged self-supervised pre-training on pathological speech data to improve its performance on the speech disorder tasks.

The researchers found that the transformer-based model outperformed the CNN model in terms of overall classification accuracy on both datasets. However, the CNN model provided more interpretable insights into the specific acoustic features it was using to detect and assess the speech disorders.

Critical Analysis

The paper provides a valuable contribution by comparing the performance and interpretability of different AI models for speech disorder assessment. The researchers' focus on interpretability is particularly important, as it allows clinicians and patients to understand the reasoning behind the models' assessments.

One limitation of the study is that it only evaluated the models on two specific speech disorder datasets. It would be helpful to see how the models perform on a wider range of datasets and speech disorder types to better understand their generalizability.

Additionally, the paper does not delve deeply into the specific acoustic features that the models were using to make their assessments. Further research could explore these features in more detail, potentially leading to a better understanding of the underlying speech characteristics associated with different disorders.

Conclusion

This paper presents a comparative analysis of CNN and transformer-based models for objective and interpretable speech disorder assessment. The results suggest that while transformer-based models may outperform CNNs in terms of classification accuracy, the interpretability of the CNN model can provide valuable insights into the specific acoustic features associated with speech disorders.

The findings of this research have important implications for the development of AI-powered tools for speech disorder diagnosis and monitoring, as they highlight the need to balance model performance with model interpretability. By understanding the specific features that AI models use to assess speech disorders, clinicians and patients can better trust and integrate these technologies into their care and treatment plans.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Exploring Pathological Speech Quality Assessment with ASR-Powered Wav2Vec2 in Data-Scarce Context

Tuan Nguyen, Corinne Fredouille, Alain Ghio, Mathieu Balaguer, Virginie Woisard

Automatic speech quality assessment has raised more attention as an alternative or support to traditional perceptual clinical evaluation. However, most research so far only gains good results on simple tasks such as binary classification, largely due to data scarcity. To deal with this challenge, current works tend to segment patients' audio files into many samples to augment the datasets. Nevertheless, this approach has limitations, as it indirectly relates overall audio scores to individual segments. This paper introduces a novel approach where the system learns at the audio level instead of segments despite data scarcity. This paper proposes to use the pre-trained Wav2Vec2 architecture for both SSL, and ASR as feature extractor in speech assessment. Carried out on the HNC dataset, our ASR-driven approach established a new baseline compared with other approaches, obtaining average $MSE=0.73$ and $MSE=1.15$ for the prediction of intelligibility and severity scores respectively, using only 95 training samples. It shows that the ASR based Wav2Vec2 model brings the best results and may indicate a strong correlation between ASR and speech quality assessment. We also measure its ability on variable segment durations and speech content, exploring factors influencing its decision.

4/1/2024

eess.AS cs.CL cs.LG cs.SD

Voice Disorder Analysis: a Transformer-based Approach

Alkis Koudounas, Gabriele Ciravegna, Marco Fantini, Giovanni Succo, Erika Crosetti, Tania Cerquitelli, Elena Baralis

Voice disorders are pathologies significantly affecting patient quality of life. However, non-invasive automated diagnosis of these pathologies is still under-explored, due to both a shortage of pathological voice data, and diversity of the recording types used for the diagnosis. This paper proposes a novel solution that adopts transformers directly working on raw voice signals and addresses data shortage through synthetic data generation and data augmentation. Further, we consider many recording types at the same time, such as sentence reading and sustained vowel emission, by employing a Mixture of Expert ensemble to align the predictions on different data types. The experimental results, obtained on both public and private datasets, show the effectiveness of our solution in the disorder detection and classification tasks and largely improve over existing approaches.

6/24/2024

eess.AS cs.LG

🗣️

New!Interpreting Pretrained Speech Models for Automatic Speech Assessment of Voice Disorders

Hok-Shing Lau, Mark Huntly, Nathon Morgan, Adesua Iyenoma, Biao Zeng, Tim Bashford

Speech contains information that is clinically relevant to some diseases, which has the potential to be used for health assessment. Recent work shows an interest in applying deep learning algorithms, especially pretrained large speech models to the applications of Automatic Speech Assessment. One question that has not been explored is how these models output the results based on their inputs. In this work, we train and compare two configurations of Audio Spectrogram Transformer in the context of Voice Disorder Detection and apply the attention rollout method to produce model relevance maps, the computed relevance of the spectrogram regions when the model makes predictions. We use these maps to analyse how models make predictions in different conditions and to show that the spread of attention is reduced as a model is finetuned, and the model attention is concentrated on specific phoneme regions.

7/2/2024

cs.SD cs.AI eess.AS

🚀

Tuning In: Analysis of Audio Classifier Performance in Clinical Settings with Limited Data

Hamza Mahdi, Eptehal Nashnoush, Rami Saab, Arjun Balachandar, Rishit Dagli, Lucas X. Perri, Houman Khosravani

This study assesses deep learning models for audio classification in a clinical setting with the constraint of small datasets reflecting real-world prospective data collection. We analyze CNNs, including DenseNet and ConvNeXt, alongside transformer models like ViT, SWIN, and AST, and compare them against pre-trained audio models such as YAMNet and VGGish. Our method highlights the benefits of pre-training on large datasets before fine-tuning on specific clinical data. We prospectively collected two first-of-their-kind patient audio datasets from stroke patients. We investigated various preprocessing techniques, finding that RGB and grayscale spectrogram transformations affect model performance differently based on the priors they learn from pre-training. Our findings indicate CNNs can match or exceed transformer models in small dataset contexts, with DenseNet-Contrastive and AST models showing notable performance. This study highlights the significance of incremental marginal gains through model selection, pre-training, and preprocessing in sound classification; this offers valuable insights for clinical diagnostics that rely on audio classification.

4/9/2024

cs.SD cs.AI cs.LG eess.AS