Automatic speech recognition for the Nepali language using CNN, bidirectional LSTM and ResNet

2406.17825

Published 6/27/2024 by Manish Dhakal, Arman Chhetri, Aman Kumar Gupta, Prabin Lamichhane, Suraj Pandey, Subarna Shakya

🗣️

Abstract

This paper presents an end-to-end deep learning model for Automatic Speech Recognition (ASR) that transcribes Nepali speech to text. The model was trained and tested on the OpenSLR (audio, text) dataset. The majority of the audio dataset have silent gaps at both ends which are clipped during dataset preprocessing for a more uniform mapping of audio frames and their corresponding texts. Mel Frequency Cepstral Coefficients (MFCCs) are used as audio features to feed into the model. The model having Bidirectional LSTM paired with ResNet and one-dimensional CNN produces the best results for this dataset out of all the models (neural networks with variations of LSTM, GRU, CNN, and ResNet) that have been trained so far. This novel model uses Connectionist Temporal Classification (CTC) function for loss calculation during training and CTC beam search decoding for predicting characters as the most likely sequence of Nepali text. On the test dataset, the character error rate (CER) of 17.06 percent has been achieved. The source code is available at: https://github.com/manishdhakal/ASR-Nepali-using-CNN-BiLSTM-ResNet.

Create account to get full access

Overview

This paper presents a deep learning model for Automatic Speech Recognition (ASR) that transcribes Nepali speech to text.
The model was trained and tested on the OpenSLR dataset, which contains audio and text data.
The paper describes the model architecture, training process, and evaluation results.

Plain English Explanation

This research describes a machine learning model that can automatically transcribe Nepali speech into written text. The model was developed and tested using a dataset of Nepali audio recordings and their corresponding text transcripts.

To prepare the dataset, the researchers removed any silent gaps at the beginning and end of the audio recordings. This helped create a more consistent mapping between the audio features and the text. The key audio features used were Mel Frequency Cepstral Coefficients (MFCCs), which are commonly used in speech recognition models.

The model architecture combines several powerful deep learning techniques, including Bidirectional Long Short-Term Memory (LSTM), Convolutional Neural Networks (CNNs), and Residual Networks (ResNets). This combination of techniques helps the model effectively process and understand the Nepali speech data.

The model was trained using a Connectionist Temporal Classification (CTC) loss function, which is well-suited for speech recognition tasks. During prediction, the model uses a CTC beam search algorithm to determine the most likely sequence of Nepali characters.

On a test dataset, the model achieved a character error rate (CER) of 17.06%, which is a promising result for this challenging task of Nepali speech recognition.

Technical Explanation

The researchers developed an end-to-end deep learning model for Automatic Speech Recognition (ASR) of Nepali speech. They used the OpenSLR dataset, which contains audio recordings and corresponding text transcripts in Nepali.

To preprocess the dataset, the researchers clipped any silent gaps at the beginning and end of the audio recordings. This helped create a more uniform mapping between the audio features and the text data. They used Mel Frequency Cepstral Coefficients (MFCCs) as the primary audio features to feed into the model.

The model architecture combines several powerful deep learning techniques:

Bidirectional LSTM: This type of recurrent neural network can effectively capture long-range dependencies in the speech data.
Convolutional Neural Network (CNN): The CNN layers help extract local features from the audio data.
Residual Network (ResNet): The ResNet architecture allows for deeper neural networks that can learn more complex representations.

The model was trained using Connectionist Temporal Classification (CTC) loss, which is well-suited for speech recognition tasks. During prediction, the model uses a CTC beam search algorithm to determine the most likely sequence of Nepali characters.

On the test dataset, the model achieved a character error rate (CER) of 17.06%, which is a promising result for this challenging task of Nepali speech recognition.

Critical Analysis

The researchers have made a valuable contribution to the field of Nepali speech recognition by developing a novel deep learning model and evaluating its performance on a real-world dataset. However, there are a few areas that could be further explored or addressed:

Dataset Size and Diversity: The OpenSLR dataset used in this study, while useful, may be relatively small and lack diversity in terms of speakers, recording conditions, and speech styles. Expanding the dataset or evaluating the model on additional Nepali speech datasets could help better assess its robustness and generalization capabilities.
Model Interpretability: The proposed model is a complex, end-to-end deep learning architecture. While such models can achieve impressive results, they can also be difficult to interpret and understand the underlying mechanisms that lead to their performance. Exploring techniques to improve the interpretability of the model could provide valuable insights for further model development and optimization.
Real-world Deployment Considerations: The paper focuses on the model's performance on the test dataset, but does not discuss potential challenges and considerations for deploying the model in real-world Nepali speech recognition applications. Factors such as computational efficiency, latency, and integration with other system components should be investigated to assess the model's practical viability.
Comparison to Other Nepali ASR Models: While the paper mentions that the proposed model outperforms other neural network architectures, it would be helpful to compare its performance to other published Nepali ASR models, either traditional or deep learning-based, to provide a more comprehensive evaluation of the model's strengths and weaknesses.

Overall, the researchers have presented a promising deep learning approach for Nepali speech recognition, but further research and real-world evaluation would be beneficial to fully understand the model's capabilities and limitations.

Conclusion

This research paper introduces a novel deep learning model for Automatic Speech Recognition (ASR) of Nepali speech. The model combines several state-of-the-art deep learning techniques, including Bidirectional LSTM, CNNs, and ResNets, to effectively process and transcribe Nepali speech.

The researchers demonstrated the model's performance on the OpenSLR dataset, achieving a character error rate (CER) of 17.06% on the test set. This is a promising result for the challenging task of Nepali speech recognition, which has received relatively less attention compared to more widely spoken languages.

The development of this model represents an important step forward in enabling more accessible and accurate Nepali speech-to-text transcription, which could have significant applications in areas such as language preservation, educational resources, and voice-based interfaces for Nepali-speaking communities. Further research and real-world deployment of this technology could help bridge the gap in speech recognition capabilities across different languages and empower more people to interact with digital systems using their native tongue.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🧠

Artificial Neural Networks to Recognize Speakers Division from Continuous Bengali Speech

Hasmot Ali, Md. Fahad Hossain, Md. Mehedi Hasan, Sheikh Abujar, Sheak Rashed Haider Noori

Voice based applications are ruling over the era of automation because speech has a lot of factors that determine a speakers information as well as speech. Modern Automatic Speech Recognition (ASR) is a blessing in the field of Human-Computer Interaction (HCI) for efficient communication among humans and devices using Artificial Intelligence technology. Speech is one of the easiest mediums of communication because it has a lot of identical features for different speakers. Nowadays it is possible to determine speakers and their identity using their speech in terms of speaker recognition. In this paper, we presented a method that will provide a speakers geographical identity in a certain region using continuous Bengali speech. We consider eight different divisions of Bangladesh as the geographical region. We applied the Mel Frequency Cepstral Coefficient (MFCC) and Delta features on an Artificial Neural Network to classify speakers division. We performed some preprocessing tasks like noise reduction and 8-10 second segmentation of raw audio before feature extraction. We used our dataset of more than 45 hours of audio data from 633 individual male and female speakers. We recorded the highest accuracy of 85.44%.

4/24/2024

eess.AS cs.HC cs.LG cs.SD

📈

Attention Based Encoder Decoder Model for Video Captioning in Nepali (2023)

Kabita Parajuli, Shashidhar Ram Joshi

Video captioning in Nepali, a language written in the Devanagari script, presents a unique challenge due to the lack of existing academic work in this domain. This work develops a novel encoder-decoder paradigm for Nepali video captioning to tackle this difficulty. LSTM and GRU sequence-to-sequence models are used in the model to produce related textual descriptions based on features retrieved from video frames using CNNs. Using Google Translate and manual post-editing, a Nepali video captioning dataset is generated from the Microsoft Research Video Description Corpus (MSVD) dataset created using Google Translate, and manual post-editing work. The efficiency of the model for Devanagari-scripted video captioning is demonstrated by BLEU, METOR, and ROUGE measures, which are used to assess its performance.

5/21/2024

cs.CV

🗣️

Automatic Speech Recognition for Hindi

Anish Saha, A. G. Ramakrishnan

Automatic speech recognition (ASR) is a key area in computational linguistics, focusing on developing technologies that enable computers to convert spoken language into text. This field combines linguistics and machine learning. ASR models, which map speech audio to transcripts through supervised learning, require handling real and unrestricted text. Text-to-speech systems directly work with real text, while ASR systems rely on language models trained on large text corpora. High-quality transcribed data is essential for training predictive models. The research involved two main components: developing a web application and designing a web interface for speech recognition. The web application, created with JavaScript and Node.js, manages large volumes of audio files and their transcriptions, facilitating collaborative human correction of ASR transcripts. It operates in real-time using a client-server architecture. The web interface for speech recognition records 16 kHz mono audio from any device running the web app, performs voice activity detection (VAD), and sends the audio to the recognition engine. VAD detects human speech presence, aiding efficient speech processing and reducing unnecessary processing during non-speech intervals, thus saving computation and network bandwidth in VoIP applications. The final phase of the research tested a neural network for accurately aligning the speech signal to hidden Markov model (HMM) states. This included implementing a novel backpropagation method that utilizes prior statistics of node co-activations.

6/27/2024

cs.CL cs.SD eess.AS

Automatic Speech Recognition for Biomedical Data in Bengali Language

Shariar Kabir, Nazmun Nahar, Shyamasree Saha, Mamunur Rashid

This paper presents the development of a prototype Automatic Speech Recognition (ASR) system specifically designed for Bengali biomedical data. Recent advancements in Bengali ASR are encouraging, but a lack of domain-specific data limits the creation of practical healthcare ASR models. This project bridges this gap by developing an ASR system tailored for Bengali medical terms like symptoms, severity levels, and diseases, encompassing two major dialects: Bengali and Sylheti. We train and evaluate two popular ASR frameworks on a comprehensive 46-hour Bengali medical corpus. Our core objective is to create deployable health-domain ASR systems for digital health applications, ultimately increasing accessibility for non-technical users in the healthcare sector.

6/21/2024

eess.AS cs.CL cs.SD