Semantically Corrected Amharic Automatic Speech Recognition

2404.13362

Published 4/23/2024 by Samuael Adnew, Paul Pu Liang

🗣️

Abstract

Automatic Speech Recognition (ASR) can play a crucial role in enhancing the accessibility of spoken languages worldwide. In this paper, we build a set of ASR tools for Amharic, a language spoken by more than 50 million people primarily in eastern Africa. Amharic is written in the Ge'ez script, a sequence of graphemes with spacings denoting word boundaries. This makes computational processing of Amharic challenging since the location of spacings can significantly impact the meaning of formed sentences. We find that existing benchmarks for Amharic ASR do not account for these spacings and only measure individual grapheme error rates, leading to significantly inflated measurements of in-the-wild performance. In this paper, we first release corrected transcriptions of existing Amharic ASR test datasets, enabling the community to accurately evaluate progress. Furthermore, we introduce a post-processing approach using a transformer encoder-decoder architecture to organize raw ASR outputs into a grammatically complete and semantically meaningful Amharic sentence. Through experiments on the corrected test dataset, our model enhances the semantic correctness of Amharic speech recognition systems, achieving a Character Error Rate (CER) of 5.5% and a Word Error Rate (WER) of 23.3%.

Create account to get full access

Overview

This paper focuses on developing automatic speech recognition (ASR) tools for the Amharic language, which is spoken by over 50 million people primarily in eastern Africa.
Amharic is written in the Ge'ez script, which uses a sequence of graphemes (written symbols) with spaces to denote word boundaries.
The authors find that existing Amharic ASR benchmarks do not account for these unique spacing challenges, leading to inaccurate performance measurements.
To address this, the authors release corrected transcriptions of existing Amharic ASR test datasets and introduce a new post-processing approach to improve the semantic correctness of Amharic speech recognition.

Plain English Explanation

Automatic speech recognition (ASR) is a key technology that can make it easier for people to use spoken languages on computers and other devices. In this paper, the researchers focused on developing ASR tools for Amharic, a language spoken by over 50 million people in eastern Africa.

Amharic is written using a script called Ge'ez, which has some unique features compared to other written languages. Importantly, the spaces between words in Amharic text are used to convey meaning, rather than just separating words. This makes it challenging to process Amharic text computationally, as the spacing can significantly impact the meaning of a sentence.

The researchers found that existing benchmarks for evaluating Amharic ASR systems did not properly account for this spacing, leading to inaccurate assessments of how well these systems perform in real-world use. To address this, the researchers first released corrected versions of the existing Amharic ASR test datasets, so that the community can more accurately evaluate progress in this area.

Furthermore, the researchers developed a new post-processing approach using a type of artificial intelligence model called a transformer encoder-decoder. This model takes the raw output of an Amharic ASR system and reorganizes it into a grammatically correct and semantically meaningful Amharic sentence, improving the overall quality of the speech recognition.

Technical Explanation

The authors first identify the unique challenges of developing ASR systems for Amharic, a language written in the Ge'ez script. Unlike many other written languages, the spacing between words in Amharic text is used to convey meaning, rather than just separating words. This makes it difficult to computationally process Amharic speech recognition outputs, as the placement of spaces can significantly impact the semantics of a sentence.

To address this, the authors first release corrected transcriptions of existing Amharic ASR test datasets. This enables the research community to more accurately evaluate the performance of Amharic ASR systems, as previous benchmarks did not properly account for the spacing issues.

The core technical contribution of the paper is a post-processing approach using a transformer encoder-decoder architecture. This model takes the raw output of an Amharic ASR system and reorganizes it into a grammatically complete and semantically meaningful Amharic sentence. Through experiments on the corrected Amharic ASR test dataset, the authors demonstrate that their model can enhance the overall quality of Amharic speech recognition, achieving a Character Error Rate (CER) of 5.5% and a Word Error Rate (WER) of 23.3%.

Critical Analysis

The authors acknowledge several limitations and areas for further research in their work. First, while the corrected Amharic ASR test datasets enable more accurate performance evaluations, the authors note that these datasets are still relatively small in size. Expanding the available Amharic speech data would likely further improve the robustness and generalization of ASR systems for this language.

Additionally, the authors' post-processing approach focuses on reorganizing the raw ASR output into grammatically correct Amharic sentences. However, there may be opportunities to further enhance the semantic accuracy of the generated text, such as by incorporating more advanced natural language processing techniques.

It would also be valuable to evaluate the authors' post-processing approach on a broader range of Amharic ASR systems, beyond just the specific model used in their experiments. This could help determine the generalizability and versatility of their technique.

Overall, this paper makes an important contribution to the development of Amharic ASR systems, but there is still room for further research and refinement to fully unlock the potential of this technology for the millions of Amharic speakers worldwide.

Conclusion

This paper addresses the unique challenges of developing automatic speech recognition (ASR) systems for the Amharic language, which is spoken by over 50 million people primarily in eastern Africa. The authors find that existing Amharic ASR benchmarks do not properly account for the way Amharic text uses spacing to convey meaning, leading to inaccurate performance measurements.

To address this, the authors first release corrected transcriptions of existing Amharic ASR test datasets, enabling the research community to more accurately evaluate progress in this area. They also introduce a novel post-processing approach using a transformer encoder-decoder model to reorganize raw ASR outputs into grammatically correct and semantically meaningful Amharic sentences.

Through experiments, the authors demonstrate that their post-processing model can significantly enhance the quality of Amharic speech recognition, achieving a Character Error Rate of 5.5% and a Word Error Rate of 23.3% on the corrected test dataset. This work represents an important step forward in making ASR technology more accessible and useful for Amharic speakers worldwide.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🗣️

Automatic Speech Recognition for Hindi

Anish Saha, A. G. Ramakrishnan

Automatic speech recognition (ASR) is a key area in computational linguistics, focusing on developing technologies that enable computers to convert spoken language into text. This field combines linguistics and machine learning. ASR models, which map speech audio to transcripts through supervised learning, require handling real and unrestricted text. Text-to-speech systems directly work with real text, while ASR systems rely on language models trained on large text corpora. High-quality transcribed data is essential for training predictive models. The research involved two main components: developing a web application and designing a web interface for speech recognition. The web application, created with JavaScript and Node.js, manages large volumes of audio files and their transcriptions, facilitating collaborative human correction of ASR transcripts. It operates in real-time using a client-server architecture. The web interface for speech recognition records 16 kHz mono audio from any device running the web app, performs voice activity detection (VAD), and sends the audio to the recognition engine. VAD detects human speech presence, aiding efficient speech processing and reducing unnecessary processing during non-speech intervals, thus saving computation and network bandwidth in VoIP applications. The final phase of the research tested a neural network for accurately aligning the speech signal to hidden Markov model (HMM) states. This included implementing a novel backpropagation method that utilizes prior statistics of node co-activations.

6/27/2024

cs.CL cs.SD eess.AS

Error-preserving Automatic Speech Recognition of Young English Learners' Language

Janick Michot, Manuela Hurlimann, Jan Deriu, Luzia Sauer, Katsiaryna Mlynchyk, Mark Cieliebak

One of the central skills that language learners need to practice is speaking the language. Currently, students in school do not get enough speaking opportunities and lack conversational practice. Recent advances in speech technology and natural language processing allow for the creation of novel tools to practice their speaking skills. In this work, we tackle the first component of such a pipeline, namely, the automated speech recognition module (ASR), which faces a number of challenges: first, state-of-the-art ASR models are often trained on adult read-aloud data by native speakers and do not transfer well to young language learners' speech. Second, most ASR systems contain a powerful language model, which smooths out errors made by the speakers. To give corrective feedback, which is a crucial part of language learning, the ASR systems in our setting need to preserve the errors made by the language learners. In this work, we build an ASR system that satisfies these requirements: it works on spontaneous speech by young language learners and preserves their errors. For this, we collected a corpus containing around 85 hours of English audio spoken by learners in Switzerland from grades 4 to 6 on different language learning tasks, which we used to train an ASR model. Our experiments show that our model benefits from direct fine-tuning on children's voices and has a much higher error preservation rate than other models.

6/6/2024

cs.CL cs.AI

🛸

Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach

Ara Yeroyan (Data Science Department, American University of Armenia), Nikolay Karpov (Nvidia, NeMo Conversational AI team)

In recent years, automatic speech recognition (ASR) systems have significantly improved, especially in languages with a vast amount of transcribed speech data. However, ASR systems tend to perform poorly for low-resource languages with fewer resources, such as minority and regional languages. This study introduces a novel pipeline designed to generate ASR training datasets from audiobooks, which typically feature a single transcript associated with hours-long audios. The common structure of these audiobooks poses a unique challenge due to the extensive length of audio segments, whereas optimal ASR training requires segments ranging from 4 to 15 seconds. To address this, we propose a method for effectively aligning audio with its corresponding text and segmenting it into lengths suitable for ASR training. Our approach simplifies data preparation for ASR systems in low-resource languages and demonstrates its application through a case study involving the Armenian language. Our method, which is portable to many low-resource languages, not only mitigates the issue of data scarcity but also enhances the performance of ASR models for underrepresented languages.

6/4/2024

cs.CL cs.LG eess.AS eess.SP

Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models

Yuchen Hu, Chen Chen, Chengwei Qin, Qiushi Zhu, Eng Siong Chng, Ruizhe Li

Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which aims to predict the ground-truth transcription from the decoded N-best hypotheses. Thanks to the strong language generation ability of LLMs and rich information in the N-best list, GER shows great effectiveness in enhancing ASR results. However, it still suffers from two limitations: 1) LLMs are unaware of the source speech during GER, which may lead to results that are grammatically correct but violate the source speech content, 2) N-best hypotheses usually only vary in a few tokens, making it redundant to send all of them for GER, which could confuse LLM about which tokens to focus on and thus lead to increased miscorrection. In this paper, we propose ClozeGER, a new paradigm for ASR generative error correction. First, we introduce a multimodal LLM (i.e., SpeechGPT) to receive source speech as extra input to improve the fidelity of correction output. Then, we reformat GER as a cloze test with logits calibration to remove the input information redundancy and simplify GER with clear instructions. Experiments show that ClozeGER achieves a new breakthrough over vanilla GER on 9 popular ASR datasets.

5/17/2024

cs.CL cs.AI cs.LG cs.SD eess.AS