Breaking Walls: Pioneering Automatic Speech Recognition for Central Kurdish: End-to-End Transformer Paradigm

Read original: arXiv:2406.02561 - Published 9/10/2024 by Abdulhady Abas Abdullah, Hadi Veisi, Tarik Rashid

🗣️

Overview

This paper explores the development of an automatic speech recognition (ASR) system for the Central Kurdish language, which is a low-resource language.
The researchers utilized end-to-end transformers to build the ASR system, and they also employed transfer learning techniques.
The study involved the creation of a 224-hour speech corpus for the Central Kurdish language, which was then used to train the acoustic model.
The resulting model achieved a state-of-the-art Word Error Rate (WER) of 13% on the Asosoft test set, representing a significant advancement in ASR technology for the Central Kurdish language.

Plain English Explanation

Automatic speech recognition (ASR) is a technology that allows computers to understand and transcribe spoken language. It's an important field of speech processing that has many practical applications. One of the current challenges in ASR is making these systems robust to noise and developing them for languages that don't have a lot of available data, which are known as "low-resource" languages.

In this research paper, the authors focused on developing an ASR system for the Central Kurdish language, which is a low-resource language. Central Kurdish, also known as Sorani, is one of the three main dialects of the Kurdish language, which is spoken by over 30 million people.

To build the ASR system, the researchers used a type of artificial intelligence called "end-to-end transformers." This approach allows the system to directly convert the input speech into text, without the need for separate steps like speech recognition and language modeling.

The researchers also utilized a technique called "transfer learning," which involves taking a model that has been trained on a large dataset and then fine-tuning it on a smaller, more specific dataset. This can help improve the performance of the model, especially when working with low-resource languages.

The team collected a speech corpus of 224 hours, which they used to train the acoustic model for the Central Kurdish ASR system. The resulting model achieved a very low Word Error Rate (WER) of 13% on the Asosoft test set, which is a significant achievement and represents an important advancement in ASR technology for the Central Kurdish language.

Technical Explanation

The researchers in this paper developed an automatic speech recognition (ASR) system for the Central Kurdish language, which is a low-resource language. They used end-to-end transformers to build the acoustic model, which allows the system to directly convert the input speech into text without the need for separate speech recognition and language modeling components.

To train the acoustic model, the team collected a speech corpus of 224 hours of Central Kurdish speech data from various sources. They then utilized transfer learning techniques to fine-tune the model and improve its performance on the Central Kurdish language.

The resulting model achieved a state-of-the-art Word Error Rate (WER) of 13% on the Asosoft test set, which is a notable accomplishment in the context of low-resource language ASR. This represents a significant advancement in ASR technology for the Central Kurdish language, which is one of the three main dialects of the Kurdish language spoken by over 30 million people.

The researchers' approach of leveraging end-to-end transformers and transfer learning techniques to develop a robust ASR system for a low-resource language like Central Kurdish is an important contribution to the field of speech recognition.

Critical Analysis

The researchers have made a notable contribution in developing an ASR system for the Central Kurdish language, which is a low-resource language. However, the paper does not provide much detail on the specific challenges encountered in working with this language or the limitations of the current approach.

One potential area for improvement could be exploring ways to further enhance the robustness of the ASR system, such as by incorporating techniques to handle noise, accents, or other variations in the input speech. Additionally, the researchers could investigate the performance of the system on different types of speech data, such as spontaneous conversations or specialized domains, to assess its broader applicability.

It would also be interesting to see how the Central Kurdish ASR system compares to other low-resource language ASR systems, both in terms of performance and the techniques used. Providing such a comparative analysis could help contextualize the significance of the researchers' achievements and identify potential areas for further research and development.

Overall, the paper presents a valuable contribution to the field of ASR for low-resource languages, and the researchers' use of end-to-end transformers and transfer learning techniques is a promising approach that could be further explored and refined in future studies.

Conclusion

This research paper describes the development of an automatic speech recognition (ASR) system for the Central Kurdish language, which is a low-resource language. The researchers utilized end-to-end transformers and transfer learning techniques to build the acoustic model, and they collected a 224-hour speech corpus to train the system.

The resulting ASR model achieved a state-of-the-art Word Error Rate (WER) of 13% on the Asosoft test set, which represents a significant advancement in ASR technology for the Central Kurdish language. This work is an important contribution to the field of speech recognition, particularly in the context of low-resource language ASR.

The researchers' approach of leveraging end-to-end transformers and transfer learning techniques to develop a robust ASR system for a low-resource language like Central Kurdish could serve as a model for future efforts in speech recognition for other underserved languages.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🗣️

Breaking Walls: Pioneering Automatic Speech Recognition for Central Kurdish: End-to-End Transformer Paradigm

Abdulhady Abas Abdullah, Hadi Veisi, Tarik Rashid

End-to-end transformer-based models epitomize the cutting-edge in Automatic Speech Recognition (ASR) systems. Despite their substantial benefits, these models demand extensive training data to perform optimally, presenting a significant challenge for low-resource languages such as Central Kurdish. Addressing this issue requires innovative methods and techniques. This paper aims to develop an ASR system for Intermediate Kurdish by collecting a robust corpus of speech, using the N-GRAM language model, and utilizing an external Kurdish tokenizer for refinement and integration techniques to enhance the model's performance. We collect a comprehensive 100-hour speech corpus from diverse sources. Additionally, applied fine-tuning techniques to our speech corpus on Persian, English, and Arabic pre-trained models, specifically utilizing the xls-r-300m, xls-r-1b, and xls-r-2b Wav2vec 2.0 models. And utilized language models trained by 3-gram and 4-gram from a large text corpus of 300 million tokens. The fine-tuned xls-r-2b model, combined with a 3-gram language model and included external Kurdish tokenizer, achieved the best performance, yielding a Word Error Rate (WER) of 10.0% on the validation set and 11.8% on the Asosoft test set. The ASR model has demonstrated the advantages of having a large vocabulary compared to the existing Kurdish ASR models. Compared to other models, it produced more accurate and higher performance outcomes by working with a lower error rate.

9/10/2024

🏋️

Central Kurdish Text-to-Speech Synthesis with Novel End-to-End Transformer Training

Hawraz A. Ahmad, Tarik A. Rashid

Recent advancements in text-to-speech (TTS) models have aimed to streamline the two-stage process into a single-stage training approach. However, many single-stage models still lag behind in audio quality, particularly when handling Kurdish text and speech. There is a critical need to enhance text-to-speech conversion for the Kurdish language, particularly for the Sorani dialect, which has been relatively neglected and is underrepresented in recent text-to-speech advancements. This study introduces an end-to-end TTS model for efficiently generating high-quality Kurdish audio. The proposed method leverages a variational autoencoder (VAE) that is pre-trained for audio waveform reconstruction and is augmented by adversarial training. This involves aligning the prior distribution established by the pre-trained encoder with the posterior distribution of the text encoder within latent variables. Additionally, a stochastic duration predictor is incorporated to imbue synthesized Kurdish speech with diverse rhythms. By aligning latent distributions and integrating the stochastic duration predictor, the proposed method facilitates the real-time generation of natural Kurdish speech audio, offering flexibility in pitches and rhythms. Empirical evaluation via the mean opinion score (MOS) on a custom dataset confirms the superior performance of our approach (MOS of 3.94) compared with that of a one-stage system and other two-staged systems as assessed through a subjective human evaluation.

8/9/2024

Error-preserving Automatic Speech Recognition of Young English Learners' Language

Janick Michot, Manuela Hurlimann, Jan Deriu, Luzia Sauer, Katsiaryna Mlynchyk, Mark Cieliebak

One of the central skills that language learners need to practice is speaking the language. Currently, students in school do not get enough speaking opportunities and lack conversational practice. Recent advances in speech technology and natural language processing allow for the creation of novel tools to practice their speaking skills. In this work, we tackle the first component of such a pipeline, namely, the automated speech recognition module (ASR), which faces a number of challenges: first, state-of-the-art ASR models are often trained on adult read-aloud data by native speakers and do not transfer well to young language learners' speech. Second, most ASR systems contain a powerful language model, which smooths out errors made by the speakers. To give corrective feedback, which is a crucial part of language learning, the ASR systems in our setting need to preserve the errors made by the language learners. In this work, we build an ASR system that satisfies these requirements: it works on spontaneous speech by young language learners and preserves their errors. For this, we collected a corpus containing around 85 hours of English audio spoken by learners in Switzerland from grades 4 to 6 on different language learning tasks, which we used to train an ASR model. Our experiments show that our model benefits from direct fine-tuning on children's voices and has a much higher error preservation rate than other models.

6/6/2024

Keyword-Guided Adaptation of Automatic Speech Recognition

Aviv Shamsian, Aviv Navon, Neta Glazer, Gill Hetz, Joseph Keshet

Automatic Speech Recognition (ASR) technology has made significant progress in recent years, providing accurate transcription across various domains. However, some challenges remain, especially in noisy environments and specialized jargon. In this paper, we propose a novel approach for improved jargon word recognition by contextual biasing Whisper-based models. We employ a keyword spotting model that leverages the Whisper encoder representation to dynamically generate prompts for guiding the decoder during the transcription process. We introduce two approaches to effectively steer the decoder towards these prompts: KG-Whisper, which is aimed at fine-tuning the Whisper decoder, and KG-Whisper-PT, which learns a prompt prefix. Our results show a significant improvement in the recognition accuracy of specified keywords and in reducing the overall word error rates. Specifically, in unseen language generalization, we demonstrate an average WER improvement of 5.1% over Whisper.

6/6/2024