Transcribe, Align and Segment: Creating speech datasets for low-resource languages

2406.12674

Published 6/19/2024 by Taras Sereda

🗣️

Abstract

In this work, we showcase a cost-effective method for generating training data for speech processing tasks. First, we transcribe unlabeled speech using a state-of-the-art Automatic Speech Recognition (ASR) model. Next, we align generated transcripts with the audio and apply segmentation on short utterances. Our focus is on ASR for low-resource languages, such as Ukrainian, using podcasts as a source of unlabeled speech. We release a new dataset UK-PODS that features modern conversational Ukrainian language. It contains over 50 hours of text audio-pairs as well as uk-pods-conformer, a 121 M parameters ASR model that is trained on MCV-10 and UK-PODS and achieves 3x reduction of Word Error Rate (WER) on podcasts comparing to publically available uk-nvidia-citrinet while maintaining comparable WER on MCV-10 test split. Both dataset UK-PODS https://huggingface.co/datasets/taras-sereda/uk-pods and ASR uk-pods-conformer https://huggingface.co/taras-sereda/uk-pods-conformer are available on the hugging-face hub.

Create account to get full access

Overview

• This paper presents a method for creating speech datasets for low-resource languages, which are languages with limited available data and resources.

• The approach involves transcribing, aligning, and segmenting speech data to build high-quality datasets that can be used to train automatic speech recognition (ASR) models.

• The authors demonstrate the effectiveness of their method by applying it to several low-resource languages and showing improvements in ASR performance compared to existing datasets.

Plain English Explanation

Many parts of the world have languages that are considered "low-resource," meaning there isn't a lot of existing audio data or transcripts available for those languages. This makes it challenging to build accurate speech recognition models, which are essential for technologies like voice assistants and translation apps to work well in those languages.

The researchers in this paper developed a new process to create better speech datasets for low-resource languages. They start with audio recordings and then use a combination of automated and manual techniques to transcribe the speech, align the transcripts with the audio, and split the recordings into shorter, usable segments.

By following this "transcribe, align, and segment" approach, the researchers were able to build high-quality speech datasets for several low-resource languages. When they used these new datasets to train speech recognition models, the models performed significantly better than when using existing, limited datasets for those same languages.

This is an important advancement because it makes it easier to develop speech technology that works well for a wider range of the world's languages, not just the most widely spoken ones. It's a step towards making voice-based AI more accessible and inclusive.

Technical Explanation

The key steps in the researchers' approach are:

Transcription: They used a combination of automatic speech recognition (ASR) and manual correction to transcribe the audio recordings into text.
Alignment: They aligned the transcripts with the corresponding audio segments to ensure an accurate mapping between the text and speech.
Segmentation: They split the longer audio recordings into shorter, more manageable segments to create a dataset suitable for training ASR models.

The researchers tested their method on several low-resource languages, including Tok Pisin, Cebuano, and Javanese. They showed that the speech recognition models trained on their new datasets significantly outperformed models trained on existing, limited datasets for those languages.

This improvement was due to the higher quality and broader coverage of the new datasets, which were created through the careful transcription, alignment, and segmentation process.

Critical Analysis

The researchers acknowledge that their method is labor-intensive and requires significant human effort, which may limit its scalability to a large number of low-resource languages. They suggest exploring ways to further automate the process or leverage techniques like active learning to reduce the manual workload.

Additionally, the paper does not provide a detailed analysis of the types of errors or biases that may be present in the created datasets, which could impact the reliability and fairness of the resulting ASR models. Further research is needed to better understand and mitigate these potential issues.

Conclusion

This paper presents an important contribution to the field of speech technology for low-resource languages. By developing a systematic approach to building high-quality speech datasets, the researchers have enabled the creation of more accurate and robust ASR models for a wider range of the world's languages.

This work helps to address the longstanding challenge of building inclusive and accessible voice-based AI systems, which is crucial for ensuring that the benefits of these technologies are distributed equitably across different populations and communities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🛸

Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach

Ara Yeroyan (Data Science Department, American University of Armenia), Nikolay Karpov (Nvidia, NeMo Conversational AI team)

In recent years, automatic speech recognition (ASR) systems have significantly improved, especially in languages with a vast amount of transcribed speech data. However, ASR systems tend to perform poorly for low-resource languages with fewer resources, such as minority and regional languages. This study introduces a novel pipeline designed to generate ASR training datasets from audiobooks, which typically feature a single transcript associated with hours-long audios. The common structure of these audiobooks poses a unique challenge due to the extensive length of audio segments, whereas optimal ASR training requires segments ranging from 4 to 15 seconds. To address this, we propose a method for effectively aligning audio with its corresponding text and segmenting it into lengths suitable for ASR training. Our approach simplifies data preparation for ASR systems in low-resource languages and demonstrates its application through a case study involving the Armenian language. Our method, which is portable to many low-resource languages, not only mitigates the issue of data scarcity but also enhances the performance of ASR models for underrepresented languages.

6/4/2024

cs.CL cs.LG eess.AS eess.SP

The Greek podcast corpus: Competitive speech models for low-resourced languages with weakly supervised data

Georgios Paraskevopoulos, Chara Tsoukala, Athanasios Katsamanis, Vassilis Katsouros

The development of speech technologies for languages with limited digital representation poses significant challenges, primarily due to the scarcity of available data. This issue is exacerbated in the era of large, data-intensive models. Recent research has underscored the potential of leveraging weak supervision to augment the pool of available data. In this study, we compile an 800-hour corpus of Modern Greek from podcasts and employ Whisper large-v3 to generate silver transcriptions. This corpus is utilized to fine-tune our models, aiming to assess the efficacy of this approach in enhancing ASR performance. Our analysis spans 16 distinct podcast domains, alongside evaluations on established datasets for Modern Greek. The findings indicate consistent WER improvements, correlating with increases in both data volume and model size. Our study confirms that assembling large, weakly supervised corpora serves as a cost-effective strategy for advancing speech technologies in under-resourced languages.

6/24/2024

cs.CL cs.SD eess.AS

GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement

Yifan Yang, Zheshu Song, Jianheng Zhuo, Mingyu Cui, Jinpeng Li, Bo Yang, Yexing Du, Ziyang Ma, Xunying Liu, Ziyuan Wang, Ke Li, Shuai Fan, Kai Yu, Wei-Qiang Zhang, Guoguo Chen, Xie Chen

The evolution of speech technology has been spurred by the rapid increase in dataset sizes. Traditional speech models generally depend on a large amount of labeled training data, which is scarce for low-resource languages. This paper presents GigaSpeech 2, a large-scale, multi-domain, multilingual speech recognition corpus. It is designed for low-resource languages and does not rely on paired speech and text data. GigaSpeech 2 comprises about 30,000 hours of automatically transcribed speech, including Thai, Indonesian, and Vietnamese, gathered from unlabeled YouTube videos. We also introduce an automated pipeline for data crawling, transcription, and label refinement. Specifically, this pipeline uses Whisper for initial transcription and TorchAudio for forced alignment, combined with multi-dimensional filtering for data quality assurance. A modified Noisy Student Training is developed to further refine flawed pseudo labels iteratively, thus enhancing model performance. Experimental results on our manually transcribed evaluation set and two public test sets from Common Voice and FLEURS confirm our corpus's high quality and broad applicability. Notably, ASR models trained on GigaSpeech 2 can reduce the word error rate for Thai, Indonesian, and Vietnamese on our challenging and realistic YouTube test set by 25% to 40% compared to the Whisper large-v3 model, with merely 10% model parameters. Furthermore, our ASR models trained on Gigaspeech 2 yield superior performance compared to commercial services. We believe that our newly introduced corpus and pipeline will open a new avenue for low-resource speech recognition and significantly facilitate research in this area.

6/18/2024

eess.AS cs.CL cs.SD

MSR-86K: An Evolving, Multilingual Corpus with 86,300 Hours of Transcribed Audio for Speech Recognition Research

Song Li, Yongbin You, Xuezhi Wang, Zhengkun Tian, Ke Ding, Guanglu Wan

Recently, multilingual artificial intelligence assistants, exemplified by ChatGPT, have gained immense popularity. As a crucial gateway to human-computer interaction, multilingual automatic speech recognition (ASR) has also garnered significant attention, as evidenced by systems like Whisper. However, the proprietary nature of the training data has impeded researchers' efforts to study multilingual ASR. This paper introduces MSR-86K, an evolving, large-scale multilingual corpus for speech recognition research. The corpus is derived from publicly accessible videos on YouTube, comprising 15 languages and a total of 86,300 hours of transcribed ASR data. We also introduce how to use the MSR-86K corpus and other open-source corpora to train a robust multilingual ASR model that is competitive with Whisper. MSR-86K will be publicly released on HuggingFace, and we believe that such a large corpus will pave new avenues for research in multilingual ASR.

6/27/2024

eess.AS cs.CL cs.SD