ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages

Read original: arXiv:2409.07259 - Published 9/12/2024 by Mahta Fetrat Qharabagh, Zahra Dehghanian, Hamid R. Rabiee

ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages

Overview

The paper presents a method for creating text-to-speech (TTS) datasets for low-resource languages, using Persian as an example.
The proposed approach, called ManaTTS Persian, leverages public datasets and crowdsourcing to build a high-quality TTS dataset.
The authors demonstrate the effectiveness of their method by evaluating the quality of the resulting TTS system.

Plain English Explanation

The paper discusses a technique for building text-to-speech (TTS) systems for languages that have limited available data, using Persian as a case study. [TTS systems convert written text into synthesized speech.] Building high-quality TTS models typically requires a large dataset of recorded speech, which can be challenging to obtain for languages with fewer available resources.

The researchers developed a method called ManaTTS Persian that combines publicly available datasets and crowdsourcing to create a robust TTS dataset for the Persian language. [Crowdsourcing involves recruiting a large number of people to contribute small amounts of work, like recording short speech samples.] The authors demonstrate that their approach can produce a TTS system that performs well, even for a low-resource language like Persian.

This research is significant because it provides a blueprint for how to build TTS capabilities for languages that lack extensive data. By leveraging creative data collection techniques, the researchers were able to overcome the challenge of data scarcity and develop a high-quality TTS system. This could enable the creation of more accessible and inclusive voice technologies for a wider range of global languages.

Technical Explanation

The paper presents the ManaTTS Persian method for constructing a text-to-speech (TTS) dataset for the Persian language, which has limited available speech data.

The researchers began by collecting publicly available speech data, including news broadcasts, audiobooks, and online videos. They then used crowdsourcing to supplement this data by having volunteers record short speech samples. [The researchers provided volunteers with prompts and recorded their responses using a web-based interface.]

To ensure the quality of the crowdsourced data, the authors implemented several validation steps. This included automated checks for audio quality, as well as manual review and annotation of the recordings.

The researchers then used the combined dataset to train a TTS model. They experimented with different neural network architectures and training procedures to optimize the model's performance. [The authors explored techniques like transfer learning, wherein a model pre-trained on a related task is fine-tuned for the TTS application.]

Evaluations conducted by the authors demonstrate that the ManaTTS Persian TTS system achieves high-quality results, outperforming previous approaches for low-resource Persian TTS. This suggests that the researchers' data collection and model training strategies were effective in overcoming the challenges of limited speech data for the Persian language.

Critical Analysis

The paper provides a thoughtful and well-executed approach to building a TTS dataset and system for a low-resource language. The authors' use of publicly available data and crowdsourcing is a creative solution to the challenge of data scarcity, and the validation steps they implemented help to ensure the quality of the resulting dataset.

One potential limitation is the reliance on volunteer contributors for the crowdsourced data. While the researchers describe measures to maintain data quality, there may be inherent biases or inconsistencies in the crowdsourced recordings that could impact the TTS model's performance. [The researchers do not address potential issues around the demographic or linguistic diversity of the crowdsourced contributors.]

Additionally, the paper does not provide a detailed analysis of the specific challenges encountered when working with the Persian language. It would be helpful to understand the unique linguistic and cultural factors that had to be considered when designing the data collection and model training procedures.

Despite these minor concerns, the ManaTTS Persian method appears to be a promising approach for developing TTS capabilities for low-resource languages. The authors' work could serve as a model for other researchers and practitioners seeking to expand access to voice technologies in underserved communities.

Conclusion

This paper presents a novel technique, called ManaTTS Persian, for building a high-quality text-to-speech (TTS) dataset and system for the Persian language, which has limited available speech data. The researchers leveraged publicly available data sources and crowdsourcing to supplement the dataset, and then trained an effective TTS model using various neural network architectures and training strategies.

The authors' results demonstrate the effectiveness of their approach, with the ManaTTS Persian TTS system outperforming previous low-resource Persian TTS systems. This work provides a valuable blueprint for how to develop TTS capabilities for other languages that lack extensive speech data, which could help to expand the accessibility and inclusivity of voice technologies worldwide.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages

Mahta Fetrat Qharabagh, Zahra Dehghanian, Hamid R. Rabiee

In this study, we introduce ManaTTS, the most extensive publicly accessible single-speaker Persian corpus, and a comprehensive framework for collecting transcribed speech datasets for the Persian language. ManaTTS, released under the open CC-0 license, comprises approximately 86 hours of audio with a sampling rate of 44.1 kHz. Alongside ManaTTS, we also generated the VirgoolInformal dataset to evaluate Persian speech recognition models used for forced alignment, extending over 5 hours of audio. The datasets are supported by a fully transparent, MIT-licensed pipeline, a testament to innovation in the field. It includes unique tools for sentence tokenization, bounded audio segmentation, and a novel forced alignment method. This alignment technique is specifically designed for low-resource languages, addressing a crucial need in the field. With this dataset, we trained a Tacotron2-based TTS model, achieving a Mean Opinion Score (MOS) of 3.76, which is remarkably close to the MOS of 3.86 for the utterances generated by the same vocoder and natural spectrogram, and the MOS of 4.01 for the natural waveform, demonstrating the exceptional quality and effectiveness of the corpus.

9/12/2024

A multilingual training strategy for low resource Text to Speech

Asma Amalas, Mounir Ghogho, Mohamed Chetouani, Rachid Oulad Haj Thami

Recent speech technologies have led to produce high quality synthesised speech due to recent advances in neural Text to Speech (TTS). However, such TTS models depend on extensive amounts of data that can be costly to produce and is hardly scalable to all existing languages, especially that seldom attention is given to low resource languages. With techniques such as knowledge transfer, the burden of creating datasets can be alleviated. In this paper, we therefore investigate two aspects; firstly, whether data from social media can be used for a small TTS dataset construction, and secondly whether cross lingual transfer learning (TL) for a low resource language can work with this type of data. In this aspect, we specifically assess to what extent multilingual modeling can be leveraged as an alternative to training on monolingual corporas. To do so, we explore how data from foreign languages may be selected and pooled to train a TTS model for a target low resource language. Our findings show that multilingual pre-training is better than monolingual pre-training at increasing the intelligibility and naturalness of the generated speech.

9/4/2024

🗣️

Transcribe, Align and Segment: Creating speech datasets for low-resource languages

Taras Sereda

In this work, we showcase a cost-effective method for generating training data for speech processing tasks. First, we transcribe unlabeled speech using a state-of-the-art Automatic Speech Recognition (ASR) model. Next, we align generated transcripts with the audio and apply segmentation on short utterances. Our focus is on ASR for low-resource languages, such as Ukrainian, using podcasts as a source of unlabeled speech. We release a new dataset UK-PODS that features modern conversational Ukrainian language. It contains over 50 hours of text audio-pairs as well as uk-pods-conformer, a 121 M parameters ASR model that is trained on MCV-10 and UK-PODS and achieves 3x reduction of Word Error Rate (WER) on podcasts comparing to publically available uk-nvidia-citrinet while maintaining comparable WER on MCV-10 test split. Both dataset UK-PODS https://huggingface.co/datasets/taras-sereda/uk-pods and ASR uk-pods-conformer https://huggingface.co/taras-sereda/uk-pods-conformer are available on the hugging-face hub.

6/19/2024

WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark

Linhan Ma, Dake Guo, Kun Song, Yuepeng Jiang, Shuai Wang, Liumeng Xue, Weiming Xu, Huan Zhao, Binbin Zhang, Lei Xie

With the development of large text-to-speech (TTS) models and scale-up of the training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio quality, and eliminating speaker mixing within each segment. Following a more accurate transcription process and quality-based data filtering process, the obtained WenetSpeech4TTS corpus contains $12,800$ hours of paired audio-text data. Furthermore, we have created subsets of varying sizes, categorized by segment quality scores to allow for TTS model training and fine-tuning. VALL-E and NaturalSpeech 2 systems are trained and fine-tuned on these subsets to validate the usability of WenetSpeech4TTS, establishing baselines on benchmark for fair comparison of TTS systems. The corpus and corresponding benchmarks are publicly available on huggingface.

6/21/2024