Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis

Read original: arXiv:2406.08568 - Published 6/14/2024 by Wing-Zin Leung, Mattias Cross, Anton Ragni, Stefan Goetze

🏋️

Overview

This paper explores using text-to-dysarthric-speech synthesis to augment training data for automatic speech recognition (ASR) of dysarthric speech.
Dysarthria is a motor speech disorder that can make speech difficult to understand, posing challenges for ASR systems.
The proposed approach aims to generate synthetic dysarthric speech samples to supplement the limited real-world dysarthric speech data available for training ASR models.

Plain English Explanation

Automatic speech recognition (ASR) systems sometimes struggle to understand speech from people with dysarthria, a condition that can make speech sound slurred or unclear. This is because ASR models are typically trained on large datasets of "normal" speech, which may not capture the variability and unique characteristics of dysarthric speech.

To address this, the researchers in this paper explored a novel approach using text-to-dysarthric-speech synthesis. The idea is to generate synthetic speech samples that mimic the acoustic properties of dysarthric speech, and then use these generated samples to supplement the limited real-world dysarthric speech data available for training ASR models.

By increasing the amount and diversity of training data, the researchers hoped to improve the ASR system's ability to accurately recognize speech from people with dysarthria. This is an important problem to solve, as dysarthric speech can pose significant communication challenges for individuals affected by conditions like Parkinson's disease, cerebral palsy, or stroke.

Technical Explanation

The researchers developed a text-to-dysarthric-speech synthesis system to generate realistic synthetic samples of dysarthric speech. They used this system to create a large, diverse dataset of synthetic dysarthric speech, which was then combined with the limited real-world dysarthric speech data to train an end-to-end ASR model.

The text-to-dysarthric-speech synthesis system was based on a phonetic-enhanced language model that learned to map text to the acoustic features of dysarthric speech, including prosodic and phonetic characteristics.

The researchers evaluated the performance of the ASR model trained with the augmented dataset on a held-out test set of real-world dysarthric speech samples. They found that the data augmentation approach significantly improved the ASR model's performance compared to training on the limited real-world data alone.

Critical Analysis

The researchers acknowledge that their approach relies on the ability of the text-to-dysarthric-speech synthesis system to generate realistic and diverse synthetic samples. If the synthetic samples do not accurately capture the full range of variability in real-world dysarthric speech, the benefits of data augmentation may be limited.

Additionally, the paper does not discuss how the synthetic samples were integrated into the training process (e.g., mixing real and synthetic data, weighting, or other techniques). Further research may be needed to explore the optimal strategies for leveraging the synthetic data to improve ASR performance.

Finally, the paper focuses on evaluating the ASR model's performance on a held-out test set, but does not address potential real-world deployment challenges, such as how the system would perform with speech from individuals with different types or severities of dysarthria, or how it would scale to larger, more diverse populations.

Conclusion

This research demonstrates the potential of using text-to-dysarthric-speech synthesis to augment training data for dysarthric automatic speech recognition. By generating realistic synthetic samples, the researchers were able to significantly improve the performance of their ASR model on real-world dysarthric speech data.

If further developed and validated, this approach could have important implications for improving communication and accessibility for individuals with dysarthria, who often struggle with existing speech recognition technologies. Continued research in this area may also yield broader insights into data augmentation techniques for speech recognition and other language technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏋️

Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis

Wing-Zin Leung, Mattias Cross, Anton Ragni, Stefan Goetze

Automatic speech recognition (ASR) research has achieved impressive performance in recent years and has significant potential for enabling access for people with dysarthria (PwD) in augmentative and alternative communication (AAC) and home environment systems. However, progress in dysarthric ASR (DASR) has been limited by high variability in dysarthric speech and limited public availability of dysarthric training data. This paper demonstrates that data augmentation using text-to-dysarthic-speech (TTDS) synthesis for finetuning large ASR models is effective for DASR. Specifically, diffusion-based text-to-speech (TTS) models can produce speech samples similar to dysarthric speech that can be used as additional training data for fine-tuning ASR foundation models, in this case Whisper. Results show improved synthesis metrics and ASR performance for the proposed multi-speaker diffusion-based TTDS data augmentation for ASR fine-tuning compared to current DASR baselines.

6/14/2024

Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis

Cong-Thanh Do, Shuhei Imai, Rama Doddipatla, Thomas Hain

This paper investigates the use of unsupervised text-to-speech synthesis (TTS) as a data augmentation method to improve accented speech recognition. TTS systems are trained with a small amount of accented speech training data and their pseudo-labels rather than manual transcriptions, and hence unsupervised. This approach enables the use of accented speech data without manual transcriptions to perform data augmentation for accented speech recognition. Synthetic accented speech data, generated from text prompts by using the TTS systems, are then combined with available non-accented speech data to train automatic speech recognition (ASR) systems. ASR experiments are performed in a self-supervised learning framework using a Wav2vec2.0 model which was pre-trained on large amount of unsupervised accented speech data. The accented speech data for training the unsupervised TTS are read speech, selected from L2-ARCTIC and British Isles corpora, while spontaneous conversational speech from the Edinburgh international accents of English corpus are used as the evaluation data. Experimental results show that Wav2vec2.0 models which are fine-tuned to downstream ASR task with synthetic accented speech data, generated by the unsupervised TTS, yield up to 6.1% relative word error rate reductions compared to a Wav2vec2.0 baseline which is fine-tuned with the non-accented speech data from Librispeech corpus.

7/8/2024

🏋️

On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures

Nick Rossenbach, Benedikt Hilmes, Ralf Schluter

In this work we evaluate the utility of synthetic data for training automatic speech recognition (ASR). We use the ASR training data to train a text-to-speech (TTS) system similar to FastSpeech-2. With this TTS we reproduce the original training data, training ASR systems solely on synthetic data. For ASR, we use three different architectures, attention-based encoder-decoder, hybrid deep neural network hidden Markov model and a Gaussian mixture hidden Markov model, showing the different sensitivity of the models to synthetic data generation. In order to extend previous work, we present a number of ablation studies on the effectiveness of synthetic vs. real training data for ASR. In particular we focus on how the gap between training on synthetic and real data changes by varying the speaker embedding or by scaling the model size. For the latter we show that the TTS models generalize well, even when training scores indicate overfitting.

7/26/2024

Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation

Dena Mujtaba, Nihar R. Mahapatra, Megan Arney, J. Scott Yaruss, Caryn Herring, Jia Bin

Automatic speech recognition (ASR) systems often falter while processing stuttering-related disfluencies -- such as involuntary blocks and word repetitions -- yielding inaccurate transcripts. A critical barrier to progress is the scarcity of large, annotated disfluent speech datasets. Therefore, we present an inclusive ASR design approach, leveraging large-scale self-supervised learning on standard speech followed by targeted fine-tuning and data augmentation on a smaller, curated dataset of disfluent speech. Our data augmentation technique enriches training datasets with various disfluencies, enhancing ASR processing of these speech patterns. Results show that fine-tuning wav2vec 2.0 with even a relatively small, labeled dataset, alongside data augmentation, can significantly reduce word error rates for disfluent speech. Our approach not only advances ASR inclusivity for people who stutter, but also paves the way for ASRs that can accommodate wider speech variations.

6/17/2024