Mai Ho'om=auna i ka 'Ai: Language Models Improve Automatic Speech Recognition in Hawaiian

2404.03073

Published 4/5/2024 by Kaavya Chaparala, Guido Zarrella, Bruce Torres Fischer, Larry Kimura, Oiwi Parker Jones

Mai Ho'om=auna i ka 'Ai: Language Models Improve Automatic Speech Recognition in Hawaiian

Abstract

In this paper we address the challenge of improving Automatic Speech Recognition (ASR) for a low-resource language, Hawaiian, by incorporating large amounts of independent text data into an ASR foundation model, Whisper. To do this, we train an external language model (LM) on ~1.5M words of Hawaiian text. We then use the LM to rescore Whisper and compute word error rates (WERs) on a manually curated test set of labeled Hawaiian data. As a baseline, we use Whisper without an external LM. Experimental results reveal a small but significant improvement in WER when ASR outputs are rescored with a Hawaiian LM. The results support leveraging all available data in the development of ASR systems for underrepresented languages.

Create account to get full access

Overview

This research paper explores how language models can improve automatic speech recognition (ASR) for the Hawaiian language.
The researchers developed Hawaiian language models (LMs) and integrated them into an ASR system to evaluate their impact.
The findings show that the Hawaiian LMs significantly improved the accuracy of the ASR system compared to using only acoustic models.

Plain English Explanation

The Hawaiian language is an endangered language, and developing effective speech recognition technology for it is crucial for preserving and revitalizing the language. This research aimed to improve the accuracy of automatic speech recognition (ASR) for Hawaiian by incorporating language models (LMs) into the ASR system.

Language models are AI systems that can predict the likelihood of a sequence of words, based on patterns in large datasets of text. By integrating these language models with the acoustic models that recognize speech sounds, the researchers were able to significantly improve the overall accuracy of the Hawaiian ASR system.

The key insight is that language models can provide important contextual information to the ASR system, helping it better understand and transcribe the Hawaiian speech. For example, the language model can learn that certain word combinations are more common in Hawaiian, and use that knowledge to correct mistakes the acoustic model might make.

This research builds on previous work exploring how large language models can be used to improve spoken language understanding and how universal language models can be adapted for diverse speech tasks. The successful integration of Hawaiian-specific language models demonstrates the potential for this approach to benefit other low-resource languages as well.

Technical Explanation

The researchers developed two Hawaiian language models (LMs) - a n-gram LM and a transformer-based LM - and integrated them into an end-to-end ASR system. The n-gram LM was trained on a corpus of Hawaiian text, while the transformer LM was further pre-trained on the MALA-500 dataset, a large multilingual text corpus that includes Hawaiian.

These Hawaiian LMs were then combined with acoustic models to create the final ASR system. The researchers evaluated the performance of this hybrid ASR system on a Hawaiian speech corpus, comparing it to a baseline system that used only acoustic models.

The results showed that incorporating the Hawaiian LMs led to significant improvements in word error rate (WER) - a key metric for ASR accuracy. The n-gram LM reduced the WER by 16.3% relative, while the transformer LM achieved a 20.1% relative reduction in WER.

This demonstrates the power of language models to enhance spoken language understanding for low-resource languages like Hawaiian. By learning the patterns and structure of the language from text data, the LMs were able to provide crucial contextual information to the ASR system, helping it transcribe the speech more accurately.

Critical Analysis

The researchers acknowledge several limitations of their work. First, the Hawaiian text corpora used to train the LMs were relatively small, which may have constrained the models' performance. Expanding the available Hawaiian text data could potentially further improve the LMs and the overall ASR accuracy.

Additionally, the researchers did not explore the use of multilingual LMs that could potentially leverage similarities between Hawaiian and other Polynesian languages. Incorporating such cross-lingual knowledge could lead to additional performance gains.

It is also worth noting that the evaluation was conducted on a controlled dataset, and the real-world performance of the ASR system may differ when faced with more diverse or noisy speech samples. Further testing in more realistic scenarios would help validate the practical applicability of this approach.

Overall, this research represents an important step forward in developing effective speech recognition technology for endangered languages like Hawaiian. By demonstrating the value of incorporating language models, the study paves the way for similar techniques to be applied to other low-resource languages, ultimately supporting their preservation and revitalization.

Conclusion

This paper presents a novel approach to improving automatic speech recognition (ASR) for the Hawaiian language by integrating language models (LMs) into the ASR system. The findings show that the Hawaiian LMs, including both n-gram and transformer-based models, significantly enhance the accuracy of the ASR system compared to using only acoustic models.

The success of this approach highlights the potential for language models to play a crucial role in enabling effective speech recognition for endangered and under-resourced languages. By leveraging the patterns and structures learned from text data, the LMs can provide valuable contextual information to the ASR system, resulting in more accurate transcriptions.

This research builds on a growing body of work exploring how large language models can be adapted and applied to various spoken language understanding tasks, and demonstrates the potential for this technique to benefit other low-resource languages beyond Hawaiian. As the field of speech recognition continues to evolve, integrating language models will likely become an increasingly important strategy for developing robust and inclusive systems that can support the preservation and revitalization of endangered languages worldwide.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

MaLa-ASR: Multimedia-Assisted LLM-Based ASR

Guanrou Yang, Ziyang Ma, Fan Yu, Zhifu Gao, Shiliang Zhang, Xie Chen

As more and more information-rich data like video become available, utilizing multi-modal auxiliary information to enhance audio tasks has sparked widespread research interest. The recent surge in research on LLM-based audio models provides fresh perspectives for tackling audio tasks. Given that LLM can flexibly ingest multiple inputs, we propose MaLa-ASR, an LLM-based ASR model that can integrate textual keywords extracted from presentation slides to improve recognition of conference content. MaLa-ASR yields average WERs of 9.4% and 11.7% on the L95 and S95 subsets of the SlideSpeech corpus, representing a significant relative WER drop of 27.9% and 44.7% over the baseline model reported in SlideSpeech. MaLa-ASR underscores LLM's strong performance in speech tasks and the capability to integrate auxiliary information conveniently. By adding keywords to the input prompt, the biased word error rate (B-WER) reduces relatively by 46.0% and 44.2%, establishing a new SOTA on this dataset.

6/14/2024

eess.AS cs.AI

LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR

Zheshu Song, Jianheng Zhuo, Yifan Yang, Ziyang Ma, Shixiong Zhang, Xie Chen

Recent years have witnessed significant progress in multilingual automatic speech recognition (ASR), driven by the emergence of end-to-end (E2E) models and the scaling of multilingual datasets. Despite that, two main challenges persist in multilingual ASR: language interference and the incorporation of new languages without degrading the performance of the existing ones. This paper proposes LoRA-Whisper, which incorporates LoRA matrix into Whisper for multilingual ASR, effectively mitigating language interference. Furthermore, by leveraging LoRA and the similarities between languages, we can achieve better performance on new languages while upholding consistent performance on original ones. Experiments on a real-world task across eight languages demonstrate that our proposed LoRA-Whisper yields a relative gain of 18.5% and 23.0% over the baseline system for multilingual ASR and language expansion respectively.

6/12/2024

eess.AS cs.AI cs.CL

Efficient Compression of Multitask Multilingual Speech Models

Thomas Palmeira Ferraz

Whisper is a multitask and multilingual speech model covering 99 languages. It yields commendable automatic speech recognition (ASR) results in a subset of its covered languages, but the model still underperforms on a non-negligible number of under-represented languages, a problem exacerbated in smaller model versions. In this work, we examine its limitations, demonstrating the presence of speaker-related (gender, age) and model-related (resourcefulness and model size) bias. Despite that, we show that only model-related bias are amplified by quantization, impacting more low-resource languages and smaller models. Searching for a better compression approach, we propose DistilWhisper, an approach that is able to bridge the performance gap in ASR for these languages while retaining the advantages of multitask and multilingual capabilities. Our approach involves two key strategies: lightweight modular ASR fine-tuning of whisper-small using language-specific experts, and knowledge distillation from whisper-large-v2. This dual approach allows us to effectively boost ASR performance while keeping the robustness inherited from the multitask and multilingual pre-training. Results demonstrate that our approach is more effective than standard fine-tuning or LoRA adapters, boosting performance in the targeted languages for both in- and out-of-domain test sets, while introducing only a negligible parameter overhead at inference.

5/3/2024

cs.CL cs.AI cs.SD eess.AS

Error-preserving Automatic Speech Recognition of Young English Learners' Language

Janick Michot, Manuela Hurlimann, Jan Deriu, Luzia Sauer, Katsiaryna Mlynchyk, Mark Cieliebak

One of the central skills that language learners need to practice is speaking the language. Currently, students in school do not get enough speaking opportunities and lack conversational practice. Recent advances in speech technology and natural language processing allow for the creation of novel tools to practice their speaking skills. In this work, we tackle the first component of such a pipeline, namely, the automated speech recognition module (ASR), which faces a number of challenges: first, state-of-the-art ASR models are often trained on adult read-aloud data by native speakers and do not transfer well to young language learners' speech. Second, most ASR systems contain a powerful language model, which smooths out errors made by the speakers. To give corrective feedback, which is a crucial part of language learning, the ASR systems in our setting need to preserve the errors made by the language learners. In this work, we build an ASR system that satisfies these requirements: it works on spontaneous speech by young language learners and preserves their errors. For this, we collected a corpus containing around 85 hours of English audio spoken by learners in Switzerland from grades 4 to 6 on different language learning tasks, which we used to train an ASR model. Our experiments show that our model benefits from direct fine-tuning on children's voices and has a much higher error preservation rate than other models.

6/6/2024

cs.CL cs.AI