Developing vocal system impaired patient-aimed voice quality assessment approach using ASR representation-included multiple features

Read original: arXiv:2408.12279 - Published 8/23/2024 by Shaoxiang Dang, Tetsuya Matsumoto, Yoshinori Takeuchi, Takashi Tsuboi, Yasuhiro Tanaka, Daisuke Nakatsubo, Satoshi Maesawa, Ryuta Saito, Masahisa Katsuno, Hiroaki Kudo

Developing vocal system impaired patient-aimed voice quality assessment approach using ASR representation-included multiple features

Overview

The research paper proposes a new approach for assessing voice quality in patients with vocal system impairments.
The approach uses multiple features, including those derived from automatic speech recognition (ASR) representations, to provide a more comprehensive evaluation.
The goal is to develop a voice quality assessment system that is tailored specifically for patients with vocal system disorders.

Plain English Explanation

The paper focuses on finding better ways to measure the quality of voice in people who have problems with their vocal system. This could include conditions like vocal cord paralysis or spasmodic dysphonia.

The researchers developed a new system that uses a variety of different measurements to evaluate voice quality. This includes not just traditional acoustic features, but also features extracted from automatic speech recognition (ASR) models. ASR models are computer programs that can transcribe speech into text. The researchers found that the representations learned by these ASR models contain useful information for assessing voice problems.

By combining multiple types of voice features, the new approach aims to provide a more comprehensive and accurate evaluation of voice quality, specifically tailored for patients with vocal system disorders. This could lead to better diagnosis, monitoring, and treatment of these conditions.

Technical Explanation

The paper proposes a novel voice quality assessment approach that leverages multiple features, including those derived from automatic speech recognition (ASR) representations. The key elements are:

Feature Extraction: The system extracts a diverse set of acoustic features, such as pitch, energy, and spectral characteristics, as well as features derived from pre-trained ASR models. These ASR-based features capture higher-level speech representations that can be informative for assessing vocal system impairments.
Model Training: The extracted features are used to train a machine learning model (e.g., support vector regression) to predict overall voice quality scores. The model is trained on a dataset of voice samples from patients with various vocal disorders, along with corresponding subjective voice quality ratings provided by expert clinicians.
Evaluation: The proposed approach is evaluated on a held-out test set, and its performance is compared to other voice quality assessment methods. The results demonstrate that the inclusion of ASR-derived features can lead to improved correlation with subjective voice quality ratings, compared to using only traditional acoustic features.

The key insight is that the rich, high-level representations learned by state-of-the-art ASR models can provide valuable information for assessing voice quality, beyond what can be captured by low-level acoustic features alone. This multi-faceted approach aims to enable more accurate and patient-tailored voice quality evaluation for those with vocal system disorders.

Critical Analysis

The paper provides a well-designed and thorough investigation of using ASR-derived features for voice quality assessment in patients with vocal system impairments. The authors carefully construct their feature set, model training procedure, and evaluation methodology, which strengthens the credibility of their findings.

However, the study does have some limitations that could be addressed in future research:

Diversity of Vocal Disorders: The dataset used in the study may not have covered the full spectrum of vocal system disorders. Evaluating the approach on a more diverse patient population would help validate its broader applicability.
Real-World Deployment: The paper focuses on the technical performance of the proposed method, but does not discuss practical considerations for deploying such a system in clinical settings. Factors like ease of use, integration with existing workflows, and interpretability of the results would be important to consider.
Explainability of ASR-derived Features: While the ASR-based features improve performance, their specific contributions and relationships to voice quality characteristics are not fully explored. Providing more insight into how these features relate to clinically relevant voice attributes could enhance the interpretability of the system.

Overall, the research represents a valuable step towards developing more comprehensive and patient-tailored voice quality assessment tools. Further exploration of the limitations mentioned above could strengthen the real-world applicability and clinical utility of this approach.

Conclusion

This paper presents a novel voice quality assessment method that leverages multiple features, including those derived from automatic speech recognition (ASR) models, to provide a more comprehensive and patient-focused evaluation for individuals with vocal system impairments.

The key innovation is the incorporation of ASR-based features, which capture higher-level speech representations that can be informative for assessing vocal disorders. By combining these ASR-derived features with traditional acoustic measures, the proposed approach demonstrates improved performance in predicting subjective voice quality ratings compared to using acoustic features alone.

The findings of this research have the potential to lead to more accurate diagnosis, monitoring, and treatment of vocal system disorders. Further development and validation of this approach in diverse clinical settings could enhance its real-world applicability and impact on patient care.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Developing vocal system impaired patient-aimed voice quality assessment approach using ASR representation-included multiple features

Shaoxiang Dang, Tetsuya Matsumoto, Yoshinori Takeuchi, Takashi Tsuboi, Yasuhiro Tanaka, Daisuke Nakatsubo, Satoshi Maesawa, Ryuta Saito, Masahisa Katsuno, Hiroaki Kudo

The potential of deep learning in clinical speech processing is immense, yet the hurdles of limited and imbalanced clinical data samples loom large. This article addresses these challenges by showcasing the utilization of automatic speech recognition and self-supervised learning representations, pre-trained on extensive datasets of normal speech. This innovative approach aims to estimate voice quality of patients with impaired vocal systems. Experiments involve checks on PVQD dataset, covering various causes of vocal system damage in English, and a Japanese dataset focusing on patients with Parkinson's disease before and after undergoing subthalamic nucleus deep brain stimulation (STN-DBS) surgery. The results on PVQD reveal a notable correlation (>0.8 on PCC) and an extraordinary accuracy (<0.5 on MSE) in predicting Grade, Breathy, and Asthenic indicators. Meanwhile, progress has been achieved in predicting the voice quality of patients in the context of STN-DBS.

8/23/2024

Voice Disorder Analysis: a Transformer-based Approach

Alkis Koudounas, Gabriele Ciravegna, Marco Fantini, Giovanni Succo, Erika Crosetti, Tania Cerquitelli, Elena Baralis

Voice disorders are pathologies significantly affecting patient quality of life. However, non-invasive automated diagnosis of these pathologies is still under-explored, due to both a shortage of pathological voice data, and diversity of the recording types used for the diagnosis. This paper proposes a novel solution that adopts transformers directly working on raw voice signals and addresses data shortage through synthetic data generation and data augmentation. Further, we consider many recording types at the same time, such as sentence reading and sustained vowel emission, by employing a Mixture of Expert ensemble to align the predictions on different data types. The experimental results, obtained on both public and private datasets, show the effectiveness of our solution in the disorder detection and classification tasks and largely improve over existing approaches.

6/24/2024

Exploring Pathological Speech Quality Assessment with ASR-Powered Wav2Vec2 in Data-Scarce Context

Tuan Nguyen, Corinne Fredouille, Alain Ghio, Mathieu Balaguer, Virginie Woisard

Automatic speech quality assessment has raised more attention as an alternative or support to traditional perceptual clinical evaluation. However, most research so far only gains good results on simple tasks such as binary classification, largely due to data scarcity. To deal with this challenge, current works tend to segment patients' audio files into many samples to augment the datasets. Nevertheless, this approach has limitations, as it indirectly relates overall audio scores to individual segments. This paper introduces a novel approach where the system learns at the audio level instead of segments despite data scarcity. This paper proposes to use the pre-trained Wav2Vec2 architecture for both SSL, and ASR as feature extractor in speech assessment. Carried out on the HNC dataset, our ASR-driven approach established a new baseline compared with other approaches, obtaining average $MSE=0.73$ and $MSE=1.15$ for the prediction of intelligibility and severity scores respectively, using only 95 training samples. It shows that the ASR based Wav2Vec2 model brings the best results and may indicate a strong correlation between ASR and speech quality assessment. We also measure its ability on variable segment durations and speech content, exploring factors influencing its decision.

4/1/2024

🗣️

Interpreting Pretrained Speech Models for Automatic Speech Assessment of Voice Disorders

Hok-Shing Lau, Mark Huntly, Nathon Morgan, Adesua Iyenoma, Biao Zeng, Tim Bashford

Speech contains information that is clinically relevant to some diseases, which has the potential to be used for health assessment. Recent work shows an interest in applying deep learning algorithms, especially pretrained large speech models to the applications of Automatic Speech Assessment. One question that has not been explored is how these models output the results based on their inputs. In this work, we train and compare two configurations of Audio Spectrogram Transformer in the context of Voice Disorder Detection and apply the attention rollout method to produce model relevance maps, the computed relevance of the spectrogram regions when the model makes predictions. We use these maps to analyse how models make predictions in different conditions and to show that the spread of attention is reduced as a model is finetuned, and the model attention is concentrated on specific phoneme regions.

7/2/2024