Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach

2406.01446

Published 6/4/2024 by Ara Yeroyan (Data Science Department, American University of Armenia), Nikolay Karpov (Nvidia, NeMo Conversational AI team)

cs.CL cs.LG eess.AS eess.SP

🛸

Abstract

In recent years, automatic speech recognition (ASR) systems have significantly improved, especially in languages with a vast amount of transcribed speech data. However, ASR systems tend to perform poorly for low-resource languages with fewer resources, such as minority and regional languages. This study introduces a novel pipeline designed to generate ASR training datasets from audiobooks, which typically feature a single transcript associated with hours-long audios. The common structure of these audiobooks poses a unique challenge due to the extensive length of audio segments, whereas optimal ASR training requires segments ranging from 4 to 15 seconds. To address this, we propose a method for effectively aligning audio with its corresponding text and segmenting it into lengths suitable for ASR training. Our approach simplifies data preparation for ASR systems in low-resource languages and demonstrates its application through a case study involving the Armenian language. Our method, which is portable to many low-resource languages, not only mitigates the issue of data scarcity but also enhances the performance of ASR models for underrepresented languages.

Create account to get full access

Overview

This study focuses on addressing the challenge of low-resource languages in automatic speech recognition (ASR) systems.
The researchers introduce a novel pipeline to generate ASR training datasets from audiobooks, which typically have longer audio segments than the optimal length for ASR training.
The proposed method aligns the audio with the corresponding text and segments it into shorter lengths suitable for ASR model training.
The researchers demonstrate the effectiveness of their approach through a case study involving the Armenian language, a low-resource language.

Plain English Explanation

Automatic speech recognition (ASR) systems have made significant progress in recent years, especially for languages with abundant transcribed speech data. However, these systems tend to perform poorly for low-resource languages, such as minority and regional languages, that have fewer resources available.

To address this issue, the researchers in this study developed a new method to create ASR training datasets from audiobooks. Audiobooks typically have long audio segments that are paired with a single transcript, which poses a unique challenge for ASR training. Optimal ASR training requires audio segments between 4 to 15 seconds, whereas audiobooks often have much longer segments.

The researchers' proposed method solves this problem by aligning the audio with the corresponding text and then breaking it down into shorter segments that are more suitable for training ASR models. This approach simplifies the data preparation process for low-resource languages and can enhance the performance of ASR models for underrepresented languages.

To demonstrate the effectiveness of their method, the researchers applied it to the Armenian language, a low-resource language. Their case study shows that this approach can significantly improve ASR performance for languages with limited resources, making it a valuable tool for expanding the reach of speech recognition technologies.

Technical Explanation

The researchers' proposed pipeline addresses the challenge of generating ASR training datasets from audiobooks, which typically have lengthy audio segments that do not align well with the optimal length requirements for ASR model training.

To solve this problem, the researchers developed a method for effectively aligning the audio with its corresponding text and then segmenting the audio into shorter lengths suitable for ASR training. This involves several key steps:

Audio-Text Alignment: The researchers use advanced alignment techniques to precisely match the audio recordings with their corresponding textual transcripts, even in the presence of long audio segments.
Audio Segmentation: After the alignment, the researchers divide the long audio segments into shorter chunks ranging from 4 to 15 seconds, which is the optimal length for ASR training.
Dataset Generation: The aligned and segmented audio-text pairs are then used to create a high-quality ASR training dataset, which can be used to train robust ASR models for low-resource languages.

The researchers evaluated their method through a case study focusing on the Armenian language, a low-resource language with limited speech data available. Their results demonstrate the effectiveness of this approach in enhancing the performance of ASR models for underrepresented languages.

Critical Analysis

The researchers have presented a novel and promising solution to address the data scarcity challenge in ASR systems for low-resource languages. By leveraging the abundant audio and text resources available in audiobooks, their method effectively mitigates the issue of limited training data.

However, the researchers acknowledge potential limitations of their approach. For instance, the accuracy of the audio-text alignment may be affected by factors such as speaker accents, audio quality, and the presence of background noise. Additionally, the researchers note that their method may not be equally effective for all low-resource languages, and further evaluation is needed to assess its broader applicability.

Furthermore, the researchers do not provide a detailed comparison of their approach with other techniques for generating ASR training data in low-resource settings. Such a comparison could help readers better understand the relative strengths and weaknesses of the proposed method.

Despite these potential limitations, the researchers' work represents a significant contribution to the field of ASR, particularly in the context of underrepresented languages. Their novel pipeline demonstrates the potential of leveraging audiobooks as a rich source of training data and provides a valuable framework for future research and development in this area.

Conclusion

This study introduces a novel pipeline for generating ASR training datasets from audiobooks, which are a valuable but underutilized resource for low-resource languages. The researchers' approach effectively aligns the audio with the corresponding text and then segments the audio into shorter lengths suitable for training ASR models.

The researchers' case study on the Armenian language showcases the potential of their method to enhance the performance of ASR systems for underrepresented languages. By mitigating the issue of data scarcity, this work paves the way for more inclusive and accessible speech recognition technologies, benefiting a wider range of language communities.

Overall, the researchers' innovative approach and its promising results highlight the importance of exploring novel data sources and techniques to address the challenges faced by ASR systems in low-resource settings. This study serves as an inspiring example of how researchers can leverage existing resources to overcome the barriers faced by minority and regional languages in the field of automatic speech recognition.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🗣️

Transcribe, Align and Segment: Creating speech datasets for low-resource languages

Taras Sereda

In this work, we showcase a cost-effective method for generating training data for speech processing tasks. First, we transcribe unlabeled speech using a state-of-the-art Automatic Speech Recognition (ASR) model. Next, we align generated transcripts with the audio and apply segmentation on short utterances. Our focus is on ASR for low-resource languages, such as Ukrainian, using podcasts as a source of unlabeled speech. We release a new dataset UK-PODS that features modern conversational Ukrainian language. It contains over 50 hours of text audio-pairs as well as uk-pods-conformer, a 121 M parameters ASR model that is trained on MCV-10 and UK-PODS and achieves 3x reduction of Word Error Rate (WER) on podcasts comparing to publically available uk-nvidia-citrinet while maintaining comparable WER on MCV-10 test split. Both dataset UK-PODS https://huggingface.co/datasets/taras-sereda/uk-pods and ASR uk-pods-conformer https://huggingface.co/taras-sereda/uk-pods-conformer are available on the hugging-face hub.

6/19/2024

eess.AS

Error-preserving Automatic Speech Recognition of Young English Learners' Language

Janick Michot, Manuela Hurlimann, Jan Deriu, Luzia Sauer, Katsiaryna Mlynchyk, Mark Cieliebak

One of the central skills that language learners need to practice is speaking the language. Currently, students in school do not get enough speaking opportunities and lack conversational practice. Recent advances in speech technology and natural language processing allow for the creation of novel tools to practice their speaking skills. In this work, we tackle the first component of such a pipeline, namely, the automated speech recognition module (ASR), which faces a number of challenges: first, state-of-the-art ASR models are often trained on adult read-aloud data by native speakers and do not transfer well to young language learners' speech. Second, most ASR systems contain a powerful language model, which smooths out errors made by the speakers. To give corrective feedback, which is a crucial part of language learning, the ASR systems in our setting need to preserve the errors made by the language learners. In this work, we build an ASR system that satisfies these requirements: it works on spontaneous speech by young language learners and preserves their errors. For this, we collected a corpus containing around 85 hours of English audio spoken by learners in Switzerland from grades 4 to 6 on different language learning tasks, which we used to train an ASR model. Our experiments show that our model benefits from direct fine-tuning on children's voices and has a much higher error preservation rate than other models.

6/6/2024

cs.CL cs.AI

GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement

Yifan Yang, Zheshu Song, Jianheng Zhuo, Mingyu Cui, Jinpeng Li, Bo Yang, Yexing Du, Ziyang Ma, Xunying Liu, Ziyuan Wang, Ke Li, Shuai Fan, Kai Yu, Wei-Qiang Zhang, Guoguo Chen, Xie Chen

The evolution of speech technology has been spurred by the rapid increase in dataset sizes. Traditional speech models generally depend on a large amount of labeled training data, which is scarce for low-resource languages. This paper presents GigaSpeech 2, a large-scale, multi-domain, multilingual speech recognition corpus. It is designed for low-resource languages and does not rely on paired speech and text data. GigaSpeech 2 comprises about 30,000 hours of automatically transcribed speech, including Thai, Indonesian, and Vietnamese, gathered from unlabeled YouTube videos. We also introduce an automated pipeline for data crawling, transcription, and label refinement. Specifically, this pipeline uses Whisper for initial transcription and TorchAudio for forced alignment, combined with multi-dimensional filtering for data quality assurance. A modified Noisy Student Training is developed to further refine flawed pseudo labels iteratively, thus enhancing model performance. Experimental results on our manually transcribed evaluation set and two public test sets from Common Voice and FLEURS confirm our corpus's high quality and broad applicability. Notably, ASR models trained on GigaSpeech 2 can reduce the word error rate for Thai, Indonesian, and Vietnamese on our challenging and realistic YouTube test set by 25% to 40% compared to the Whisper large-v3 model, with merely 10% model parameters. Furthermore, our ASR models trained on Gigaspeech 2 yield superior performance compared to commercial services. We believe that our newly introduced corpus and pipeline will open a new avenue for low-resource speech recognition and significantly facilitate research in this area.

6/18/2024

eess.AS cs.CL cs.SD

🤿

Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

Xuelong Geng, Tianyi Xu, Kun Wei, Bingshen Mu, Hongfei Xue, He Wang, Yangze Li, Pengcheng Guo, Yuhang Dai, Longhao Li, Mingchen Shao, Lei Xie

Large Language Models (LLMs) have demonstrated unparalleled effectiveness in various NLP tasks, and integrating LLMs with automatic speech recognition (ASR) is becoming a mainstream paradigm. Building upon this momentum, our research delves into an in-depth examination of this paradigm on a large open-source Chinese dataset. Specifically, our research aims to evaluate the impact of various configurations of speech encoders, LLMs, and projector modules in the context of the speech foundation encoder-LLM ASR paradigm. Furthermore, we introduce a three-stage training approach, expressly developed to enhance the model's ability to align auditory and textual information. The implementation of this approach, alongside the strategic integration of ASR components, enabled us to achieve the SOTA performance on the AISHELL-1, Test_Net, and Test_Meeting test sets. Our analysis presents an empirical foundation for future research in LLM-based ASR systems and offers insights into optimizing performance using Chinese datasets. We will publicly release all scripts used for data preparation, training, inference, and scoring, as well as pre-trained models and training logs to promote reproducible research.

5/7/2024

cs.SD cs.CL eess.AS