Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora

Read original: arXiv:2409.11107 - Published 9/18/2024 by Francesco Nespoli, Daniel Barreda, Patrick A. Naylor

Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora

Overview

This paper proposes a zero-shot text-to-speech (TTS) augmentation technique to improve automatic speech recognition (ASR) on low-resource accented speech corpora.
The method uses a pre-trained TTS model to generate synthetic speech samples from text, which are then used to augment the training data for the ASR model.
The authors demonstrate the effectiveness of their approach on three low-resource accented speech datasets, showing improvements in ASR performance.

Plain English Explanation

The paper aims to address a common challenge in speech recognition: how to build accurate models for languages or accents that have limited training data available. To overcome this, the researchers developed a technique that [leverages text-to-speech (TTS) technology to artificially generate more training data].

Here's how it works:

The researchers start with a pre-trained TTS model that can convert text into synthetic speech. This model doesn't need to be specific to the low-resource language or accent they're targeting.
They then take the limited real speech data they have and use the TTS model to [generate additional synthetic speech samples by feeding in the corresponding text transcripts].
These synthetic samples are then combined with the original real speech data to train the final ASR model, effectively [expanding the training dataset without needing more real-world recordings].

The key insight is that even though the synthetic speech may not be perfect, it can still provide valuable training signal to the ASR model and help it become more robust to the target accents and languages. The authors show this approach leads to significant improvements in ASR performance on several low-resource speech datasets.

Technical Explanation

The proposed [zero-shot text-to-speech (TTS) augmentation] approach works as follows:

[Obtain a pre-trained TTS model]: The researchers leverage a pre-trained TTS model that can convert text into synthetic speech, without requiring the TTS model to be specifically trained on the low-resource language or accent.
[Generate synthetic speech samples]: For each utterance in the limited real speech dataset, the authors use the pre-trained TTS model to generate a corresponding synthetic speech sample by feeding in the text transcript.
[Augment the training data]: The synthetic speech samples are then combined with the original real speech data to create an [augmented training dataset] for the final ASR model.

The key intuition is that even though the synthetic speech may not be perfect, it can still provide valuable training signal to the ASR model and help it become more robust to the target accents and languages. The authors evaluate their approach on three low-resource accented speech datasets: CommonVoice French Canadian, CommonVoice Kenyan English, and NCHLT Afrikaans. Their results show that [the zero-shot TTS augmentation leads to significant improvements in ASR performance compared to baselines that do not use synthetic data].

Critical Analysis

The authors acknowledge several limitations and caveats in their work:

[Dependence on pre-trained TTS model]: The performance of the proposed approach relies heavily on the quality of the pre-trained TTS model used. If the TTS model produces poor-quality synthetic speech, it may not provide sufficient training signal to the ASR model.
[Potential mismatch between synthetic and real speech]: Even with a high-quality TTS model, there may still be differences in characteristics like prosody, intonation, and acoustic properties between the synthetic and real speech samples. This mismatch could limit the effectiveness of the augmentation.
[Evaluation on limited datasets]: The authors only evaluate their approach on three low-resource accented speech datasets. Further research is needed to assess the generalizability of the technique to a wider range of low-resource languages and accents.

Additionally, it would be valuable to explore [the impact of the TTS model's synthesis quality on the final ASR performance], as well as investigate [techniques to better align the synthetic and real speech characteristics] to further improve the augmentation effectiveness.

Conclusion

This paper presents a novel [zero-shot text-to-speech (TTS) augmentation] approach to enhance automatic speech recognition (ASR) performance on low-resource accented speech corpora. By leveraging a pre-trained TTS model to generate synthetic speech samples, the authors are able to effectively [expand the training data for the ASR model without requiring additional real-world speech recordings].

The results demonstrate the effectiveness of this technique, showing significant improvements in ASR accuracy across multiple low-resource accented speech datasets. This work highlights the potential of [data augmentation techniques that leverage auxiliary resources, like pre-trained TTS models, to address the challenges of limited training data] - a common issue in speech recognition, especially for underrepresented languages and accents.

While the approach has some limitations, the authors' insights and findings serve as a valuable contribution to the field of speech recognition, providing a promising direction for further research and development in this area.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

New!Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora

Francesco Nespoli, Daniel Barreda, Patrick A. Naylor

In recent years, automatic speech recognition (ASR) models greatly improved transcription performance both in clean, low noise, acoustic conditions and in reverberant environments. However, all these systems rely on the availability of hundreds of hours of labelled training data in specific acoustic conditions. When such a training dataset is not available, the performance of the system is heavily impacted. For example, this happens when a specific acoustic environment or a particular population of speakers is under-represented in the training dataset. Specifically, in this paper we investigate the effect of accented speech data on an off-the-shelf ASR system. Furthermore, we suggest a strategy based on zero-shot text-to-speech to augment the accented speech corpora. We show that this augmentation method is able to mitigate the loss in performance of the ASR system on accented data up to 5% word error rate reduction (WERR). In conclusion, we demonstrate that by incorporating a modest fraction of real with synthetically generated data, the ASR system exhibits superior performance compared to a model trained exclusively on authentic accented speech with up to 14% WERR.

9/18/2024

Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis

Cong-Thanh Do, Shuhei Imai, Rama Doddipatla, Thomas Hain

This paper investigates the use of unsupervised text-to-speech synthesis (TTS) as a data augmentation method to improve accented speech recognition. TTS systems are trained with a small amount of accented speech training data and their pseudo-labels rather than manual transcriptions, and hence unsupervised. This approach enables the use of accented speech data without manual transcriptions to perform data augmentation for accented speech recognition. Synthetic accented speech data, generated from text prompts by using the TTS systems, are then combined with available non-accented speech data to train automatic speech recognition (ASR) systems. ASR experiments are performed in a self-supervised learning framework using a Wav2vec2.0 model which was pre-trained on large amount of unsupervised accented speech data. The accented speech data for training the unsupervised TTS are read speech, selected from L2-ARCTIC and British Isles corpora, while spontaneous conversational speech from the Edinburgh international accents of English corpus are used as the evaluation data. Experimental results show that Wav2vec2.0 models which are fine-tuned to downstream ASR task with synthetic accented speech data, generated by the unsupervised TTS, yield up to 6.1% relative word error rate reductions compared to a Wav2vec2.0 baseline which is fine-tuned with the non-accented speech data from Librispeech corpus.

7/8/2024

🛸

Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach

Ara Yeroyan (Data Science Department, American University of Armenia), Nikolay Karpov (Nvidia, NeMo Conversational AI team)

In recent years, automatic speech recognition (ASR) systems have significantly improved, especially in languages with a vast amount of transcribed speech data. However, ASR systems tend to perform poorly for low-resource languages with fewer resources, such as minority and regional languages. This study introduces a novel pipeline designed to generate ASR training datasets from audiobooks, which typically feature a single transcript associated with hours-long audios. The common structure of these audiobooks poses a unique challenge due to the extensive length of audio segments, whereas optimal ASR training requires segments ranging from 4 to 15 seconds. To address this, we propose a method for effectively aligning audio with its corresponding text and segmenting it into lengths suitable for ASR training. Our approach simplifies data preparation for ASR systems in low-resource languages and demonstrates its application through a case study involving the Armenian language. Our method, which is portable to many low-resource languages, not only mitigates the issue of data scarcity but also enhances the performance of ASR models for underrepresented languages.

6/4/2024

New!Exploring the Impact of Data Quantity on ASR in Extremely Low-resource Languages

Yao-Fei Cheng, Li-Wei Chen, Hung-Shin Lee, Hsin-Min Wang

This study investigates the efficacy of data augmentation techniques for low-resource automatic speech recognition (ASR), focusing on two endangered Austronesian languages, Amis and Seediq. Recognizing the potential of self-supervised learning (SSL) in low-resource settings, we explore the impact of data volume on the continued pre-training of SSL models. We propose a novel data-selection scheme leveraging a multilingual corpus to augment the limited target language data. This scheme utilizes a language classifier to extract utterance embeddings and employs one-class classifiers to identify utterances phonetically and phonologically proximate to the target languages. Utterances are ranked and selected based on their decision scores, ensuring the inclusion of highly relevant data in the SSL-ASR pipeline. Our experimental results demonstrate the effectiveness of this approach, yielding substantial improvements in ASR performance for both Amis and Seediq. These findings underscore the feasibility and promise of data augmentation through cross-lingual transfer learning for low-resource language ASR.

9/16/2024