Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis

Read original: arXiv:2407.04047 - Published 7/8/2024 by Cong-Thanh Do, Shuhei Imai, Rama Doddipatla, Thomas Hain

Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis

Overview

Improving the accuracy of speech recognition for accented speech using data augmentation based on unsupervised text-to-speech synthesis
Addresses the challenge of limited training data for accented speech recognition
Proposes a novel approach to generate synthetic accented speech samples for data augmentation

Plain English Explanation

Speech recognition models often struggle to accurately transcribe speech from individuals with accents, as the models are typically trained on speech samples from native speakers. This can be a significant barrier for many users, especially those from diverse linguistic backgrounds.

To address this issue, the researchers developed a data augmentation approach that leverages unsupervised text-to-speech (TTS) synthesis. The key idea is to generate realistic synthetic speech samples with different accents, which can then be used to augment the training data for the speech recognition model.

The process works as follows:

Obtain Accented Speech Samples: The researchers collect a small set of accented speech samples from various speakers.
Train Unsupervised TTS Model: Using the accented speech samples, they train an unsupervised TTS model that can generate synthetic speech without any text transcripts.
Generate Synthetic Accented Speech: The TTS model is then used to generate a large number of synthetic speech samples with different accents, based on the initial set of accented speech samples.
Augment Training Data: The synthetic accented speech samples are combined with the original training data to create a more diverse and representative dataset for training the speech recognition model.

By incorporating this data augmentation approach, the researchers were able to significantly improve the performance of the speech recognition model on accented speech, outperforming other methods.

Technical Explanation

The researchers used the Wav2vec 2.0 speech recognition model as the base, and augmented the training data using synthetic accented speech generated by an unsupervised TTS model.

The TTS model was trained in a self-supervised manner, using only the accented speech samples without any text transcripts. This allowed the model to learn the acoustic characteristics of the accents, and then generate realistic synthetic speech samples with various accents.

The researchers evaluated the performance of the speech recognition model on several benchmark datasets, including recordings of non-native English speakers. By incorporating the synthetic accented speech samples during training, the model demonstrated significantly improved accuracy on accented speech, compared to models trained only on the original data.

The experiments also showed that the unsupervised TTS-based data augmentation approach outperformed other techniques, such as audio perturbation or phonemic-prosodic annotation.

Critical Analysis

The proposed approach has several strengths, including its ability to generate realistic synthetic accented speech samples without any text transcripts, and its demonstrated effectiveness in improving speech recognition performance on accented speech.

However, the researchers acknowledge several limitations and areas for further research. First, the quality and diversity of the synthetic speech samples generated by the TTS model may still be limited, and more advanced TTS architectures or techniques could potentially improve the results.

Second, the approach relies on the availability of a small set of accented speech samples to train the TTS model. In cases where such samples are scarce, the effectiveness of the data augmentation may be reduced.

Third, the researchers note that the proposed method may not generalize equally well to all types of accents, and further investigation is needed to understand the model's performance on a broader range of linguistic variations.

Finally, the paper does not explore the potential trade-offs or unintended consequences of using synthetic speech samples for data augmentation, such as the risk of introducing biases or artifacts into the speech recognition model.

Overall, the research presents a promising approach to improving accented speech recognition, but further refinement and more comprehensive evaluation are needed to fully understand its limitations and potential real-world applicability.

Conclusion

This paper introduces a novel data augmentation technique for improving the accuracy of speech recognition models on accented speech. By leveraging unsupervised text-to-speech synthesis to generate synthetic accented speech samples, the researchers were able to significantly boost the performance of the Wav2vec 2.0 speech recognition model on non-native English speakers.

The proposed approach addresses an important challenge in the field of automatic speech recognition, where models often struggle with linguistic diversity and accents. While the method has some limitations, the results demonstrate the potential of using synthetic data generation to enhance the robustness and inclusivity of speech recognition systems, with broader implications for accessibility and user experience in a wide range of applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis

Cong-Thanh Do, Shuhei Imai, Rama Doddipatla, Thomas Hain

This paper investigates the use of unsupervised text-to-speech synthesis (TTS) as a data augmentation method to improve accented speech recognition. TTS systems are trained with a small amount of accented speech training data and their pseudo-labels rather than manual transcriptions, and hence unsupervised. This approach enables the use of accented speech data without manual transcriptions to perform data augmentation for accented speech recognition. Synthetic accented speech data, generated from text prompts by using the TTS systems, are then combined with available non-accented speech data to train automatic speech recognition (ASR) systems. ASR experiments are performed in a self-supervised learning framework using a Wav2vec2.0 model which was pre-trained on large amount of unsupervised accented speech data. The accented speech data for training the unsupervised TTS are read speech, selected from L2-ARCTIC and British Isles corpora, while spontaneous conversational speech from the Edinburgh international accents of English corpus are used as the evaluation data. Experimental results show that Wav2vec2.0 models which are fine-tuned to downstream ASR task with synthetic accented speech data, generated by the unsupervised TTS, yield up to 6.1% relative word error rate reductions compared to a Wav2vec2.0 baseline which is fine-tuned with the non-accented speech data from Librispeech corpus.

7/8/2024

🏋️

Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis

Wing-Zin Leung, Mattias Cross, Anton Ragni, Stefan Goetze

Automatic speech recognition (ASR) research has achieved impressive performance in recent years and has significant potential for enabling access for people with dysarthria (PwD) in augmentative and alternative communication (AAC) and home environment systems. However, progress in dysarthric ASR (DASR) has been limited by high variability in dysarthric speech and limited public availability of dysarthric training data. This paper demonstrates that data augmentation using text-to-dysarthic-speech (TTDS) synthesis for finetuning large ASR models is effective for DASR. Specifically, diffusion-based text-to-speech (TTS) models can produce speech samples similar to dysarthric speech that can be used as additional training data for fine-tuning ASR foundation models, in this case Whisper. Results show improved synthesis metrics and ASR performance for the proposed multi-speaker diffusion-based TTDS data augmentation for ASR fine-tuning compared to current DASR baselines.

6/14/2024

🏋️

On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures

Nick Rossenbach, Benedikt Hilmes, Ralf Schluter

In this work we evaluate the utility of synthetic data for training automatic speech recognition (ASR). We use the ASR training data to train a text-to-speech (TTS) system similar to FastSpeech-2. With this TTS we reproduce the original training data, training ASR systems solely on synthetic data. For ASR, we use three different architectures, attention-based encoder-decoder, hybrid deep neural network hidden Markov model and a Gaussian mixture hidden Markov model, showing the different sensitivity of the models to synthetic data generation. In order to extend previous work, we present a number of ablation studies on the effectiveness of synthetic vs. real training data for ASR. In particular we focus on how the gap between training on synthetic and real data changes by varying the speaker embedding or by scaling the model size. For the latter we show that the TTS models generalize well, even when training scores indicate overfitting.

7/26/2024

On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition

Nick Rossenbach, Ralf Schluter, Sakriani Sakti

The rapid development of neural text-to-speech (TTS) systems enabled its usage in other areas of natural language processing such as automatic speech recognition (ASR) or spoken language translation (SLT). Due to the large number of different TTS architectures and their extensions, selecting which TTS systems to use for synthetic data creation is not an easy task. We use the comparison of five different TTS decoder architectures in the scope of synthetic data generation to show the impact on CTC-based speech recognition training. We compare the recognition results to computable metrics like NISQA MOS and intelligibility, finding that there are no clear relations to the ASR performance. We also observe that for data generation auto-regressive decoding performs better than non-autoregressive decoding, and propose an approach to quantify TTS generalization capabilities.

8/1/2024