Artificial Neural Networks to Recognize Speakers Division from Continuous Bengali Speech

2404.15168

Published 4/24/2024 by Hasmot Ali, Md. Fahad Hossain, Md. Mehedi Hasan, Sheikh Abujar, Sheak Rashed Haider Noori

🧠

Abstract

Voice based applications are ruling over the era of automation because speech has a lot of factors that determine a speakers information as well as speech. Modern Automatic Speech Recognition (ASR) is a blessing in the field of Human-Computer Interaction (HCI) for efficient communication among humans and devices using Artificial Intelligence technology. Speech is one of the easiest mediums of communication because it has a lot of identical features for different speakers. Nowadays it is possible to determine speakers and their identity using their speech in terms of speaker recognition. In this paper, we presented a method that will provide a speakers geographical identity in a certain region using continuous Bengali speech. We consider eight different divisions of Bangladesh as the geographical region. We applied the Mel Frequency Cepstral Coefficient (MFCC) and Delta features on an Artificial Neural Network to classify speakers division. We performed some preprocessing tasks like noise reduction and 8-10 second segmentation of raw audio before feature extraction. We used our dataset of more than 45 hours of audio data from 633 individual male and female speakers. We recorded the highest accuracy of 85.44%.

Create account to get full access

Overview

Speech recognition is a crucial technology for modern human-computer interaction (HCI)
This paper presents a method to determine a speaker's geographic identity using their Bengali speech
The researchers used Mel Frequency Cepstral Coefficients (MFCC) and Delta features with an Artificial Neural Network to classify speakers by their region in Bangladesh
They achieved an accuracy of 85.44% in identifying speakers' geographic location

Plain English Explanation

Speech is one of the easiest and most natural ways for humans to communicate with each other and with machines. Modern Automatic Speech Recognition (ASR) systems have made great strides in allowing efficient communication between people and devices using artificial intelligence.

One interesting aspect of speech is that it can reveal information about the speaker, beyond just the words they're saying. For example, a person's geographic origin can be determined from their accent and speech patterns. In this research paper, the authors developed a method to identify which region of Bangladesh a Bengali speaker is from, based on their continuous speech.

The researchers used signal processing techniques like Mel Frequency Cepstral Coefficients (MFCC) to extract features from the audio data. They then fed these features into an Artificial Neural Network, which was trained to classify the speakers into one of eight different geographic regions of Bangladesh.

After preprocessing the audio data to reduce noise and segment it into 8-10 second clips, the researchers were able to achieve an impressive accuracy of 85.44% in correctly identifying the speakers' geographic origins. This shows the potential for using speech analysis to gain insights about a person's background and identity, beyond just the words they're saying.

Technical Explanation

The researchers in this paper developed a method to determine the geographic identity of Bengali speakers based on their continuous speech. They focused on identifying speakers from the eight different administrative divisions of Bangladesh.

First, the researchers collected over 45 hours of audio data from 633 individual male and female Bengali speakers. They performed some preprocessing steps on the raw audio, including noise reduction and segmentation into 8-10 second clips.

Next, they extracted Mel Frequency Cepstral Coefficient (MFCC) and Delta features from the audio data. MFCC is a widely used signal processing technique in automatic speech recognition (ASR) systems that captures the spectral characteristics of speech. Delta features represent the changes in MFCC over time, providing additional information about the speech dynamics.

The researchers then fed these acoustic features into an Artificial Neural Network (ANN) model, training it to classify the speakers into one of the eight geographic regions of Bangladesh. After the training process, the model was able to achieve an impressive accuracy of 85.44% in correctly identifying the speakers' geographic origins.

Critical Analysis

The researchers provide a thorough explanation of their methodology and present promising results in terms of the high accuracy achieved. However, there are a few potential limitations and areas for further exploration:

The study was limited to a specific language (Bengali) and geographic region (Bangladesh). It would be valuable to investigate whether a similar approach could be generalized to other languages and regions, or if there are unique characteristics of Bengali speech that enabled this high level of accuracy.
The dataset, while substantial in size, may not fully represent the diversity of speakers across Bangladesh. Collecting a larger and more diverse dataset could help validate the robustness of the model's performance.
The researchers did not provide much insight into the specific speech features or patterns that were most indicative of a speaker's geographic origin. Exploring these linguistic cues could yield interesting insights and potentially inform the development of more effective speaker recognition systems.
While the 85.44% accuracy is impressive, there is still room for improvement. Investigating alternative model architectures, feature engineering techniques, or data augmentation strategies could potentially further enhance the system's performance.

Overall, this research demonstrates the potential of using speech analysis to infer speakers' geographic identities, with promising implications for various applications in human-computer interaction and speaker recognition. However, as with any machine learning-based system, continued refinement and validation will be important to ensure the reliability and real-world applicability of this approach.

Conclusion

This paper presents a novel method for determining a speaker's geographic identity based on their continuous Bengali speech. By extracting Mel Frequency Cepstral Coefficients (MFCC) and Delta features from the audio data and feeding them into an Artificial Neural Network, the researchers were able to achieve an impressive accuracy of 85.44% in correctly classifying speakers into one of eight different regions in Bangladesh.

The findings of this study highlight the potential of using speech analysis to gain insights about a speaker's background and identity, beyond just the words they're saying. This technology could have applications in areas like human-computer interaction, speaker recognition, and sociolinguistic research.

While the results are promising, further research is needed to explore the generalizability of this approach to other languages and regions, as well as to investigate the specific speech features that are most indicative of geographic identity. Continued refinement and validation of this technology will be crucial to ensure its reliability and real-world applicability.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🗣️

Automatic Speech Recognition for Hindi

Anish Saha, A. G. Ramakrishnan

Automatic speech recognition (ASR) is a key area in computational linguistics, focusing on developing technologies that enable computers to convert spoken language into text. This field combines linguistics and machine learning. ASR models, which map speech audio to transcripts through supervised learning, require handling real and unrestricted text. Text-to-speech systems directly work with real text, while ASR systems rely on language models trained on large text corpora. High-quality transcribed data is essential for training predictive models. The research involved two main components: developing a web application and designing a web interface for speech recognition. The web application, created with JavaScript and Node.js, manages large volumes of audio files and their transcriptions, facilitating collaborative human correction of ASR transcripts. It operates in real-time using a client-server architecture. The web interface for speech recognition records 16 kHz mono audio from any device running the web app, performs voice activity detection (VAD), and sends the audio to the recognition engine. VAD detects human speech presence, aiding efficient speech processing and reducing unnecessary processing during non-speech intervals, thus saving computation and network bandwidth in VoIP applications. The final phase of the research tested a neural network for accurately aligning the speech signal to hidden Markov model (HMM) states. This included implementing a novel backpropagation method that utilizes prior statistics of node co-activations.

6/27/2024

cs.CL cs.SD eess.AS

Automatic Speech Recognition for Biomedical Data in Bengali Language

Shariar Kabir, Nazmun Nahar, Shyamasree Saha, Mamunur Rashid

This paper presents the development of a prototype Automatic Speech Recognition (ASR) system specifically designed for Bengali biomedical data. Recent advancements in Bengali ASR are encouraging, but a lack of domain-specific data limits the creation of practical healthcare ASR models. This project bridges this gap by developing an ASR system tailored for Bengali medical terms like symptoms, severity levels, and diseases, encompassing two major dialects: Bengali and Sylheti. We train and evaluate two popular ASR frameworks on a comprehensive 46-hour Bengali medical corpus. Our core objective is to create deployable health-domain ASR systems for digital health applications, ultimately increasing accessibility for non-technical users in the healthcare sector.

6/21/2024

eess.AS cs.CL cs.SD

🗣️

Automatic speech recognition for the Nepali language using CNN, bidirectional LSTM and ResNet

Manish Dhakal, Arman Chhetri, Aman Kumar Gupta, Prabin Lamichhane, Suraj Pandey, Subarna Shakya

This paper presents an end-to-end deep learning model for Automatic Speech Recognition (ASR) that transcribes Nepali speech to text. The model was trained and tested on the OpenSLR (audio, text) dataset. The majority of the audio dataset have silent gaps at both ends which are clipped during dataset preprocessing for a more uniform mapping of audio frames and their corresponding texts. Mel Frequency Cepstral Coefficients (MFCCs) are used as audio features to feed into the model. The model having Bidirectional LSTM paired with ResNet and one-dimensional CNN produces the best results for this dataset out of all the models (neural networks with variations of LSTM, GRU, CNN, and ResNet) that have been trained so far. This novel model uses Connectionist Temporal Classification (CTC) function for loss calculation during training and CTC beam search decoding for predicting characters as the most likely sequence of Nepali text. On the test dataset, the character error rate (CER) of 17.06 percent has been achieved. The source code is available at: https://github.com/manishdhakal/ASR-Nepali-using-CNN-BiLSTM-ResNet.

6/27/2024

cs.CL cs.SD eess.AS

🤿

System Description for the Displace Speaker Diarization Challenge 2023

Ali Aliyev

This paper describes our solution for the Diarization of Speaker and Language in Conversational Environments Challenge (Displace 2023). We used a combination of VAD for finding segfments with speech, Resnet architecture based CNN for feature extraction from these segments, and spectral clustering for features clustering. Even though it was not trained with using Hindi, the described algorithm achieves the following metrics: DER 27. 1% and DER 27. 4%, on the development and phase-1 evaluation parts of the dataset, respectively.

6/26/2024

cs.CL cs.SD eess.AS