PhoWhisper: Automatic Speech Recognition for Vietnamese

Read original: arXiv:2406.02555 - Published 6/6/2024 by Thanh-Thien Le, Linh The Nguyen, Dat Quoc Nguyen

🗣️

Overview

Researchers introduce PhoWhisper, a set of five versions of an automatic speech recognition model for Vietnamese
PhoWhisper is built by fine-tuning the Whisper model on a large, diverse dataset of Vietnamese speech
The paper claims PhoWhisper achieves state-of-the-art performance on benchmark Vietnamese ASR datasets

Plain English Explanation

Automatic speech recognition (ASR) is the process of converting spoken language into text. Researchers have developed a new Vietnamese ASR model called PhoWhisper that aims to be highly accurate and robust across different Vietnamese accents and dialects.

PhoWhisper is based on the popular Whisper speech recognition model, which was originally trained on a large, multilingual dataset. To create PhoWhisper, the researchers "fine-tuned" the Whisper model by training it further on an 844-hour dataset of Vietnamese speech, encompassing a variety of regional accents. This allows PhoWhisper to better recognize the unique characteristics of the Vietnamese language.

The researchers tested PhoWhisper on standard Vietnamese ASR benchmark datasets and claim it outperforms other state-of-the-art models. This suggests PhoWhisper could be a valuable tool for a range of Vietnamese language applications, from voice assistants to transcription services.

Technical Explanation

The researchers introduce five versions of their PhoWhisper model for Vietnamese automatic speech recognition. The core of PhoWhisper is the Whisper model, a large, multilingual speech recognition system developed by Anthropic. To create PhoWhisper, the researchers fine-tuned the Whisper model on an 844-hour dataset of Vietnamese speech, which includes diverse accents and dialects.

Through this fine-tuning process, PhoWhisper is able to better recognize the unique phonetic and linguistic properties of the Vietnamese language. The researchers evaluated PhoWhisper's performance on standard Vietnamese ASR benchmarks, such as the VietMED dataset, and found it achieves state-of-the-art results.

The researchers also explored techniques like efficient model compression and keyword-guided adaptation to further improve the performance and efficiency of the PhoWhisper models. Additionally, they compared PhoWhisper to other Vietnamese ASR systems like Whispy and Mai Hoʻomana i ka AI.

Critical Analysis

The researchers provide a thorough evaluation of PhoWhisper's performance on standard Vietnamese ASR benchmarks, demonstrating its state-of-the-art capabilities. However, the paper does not delve into potential limitations or real-world deployment challenges.

For example, the researchers do not discuss how PhoWhisper would perform on highly colloquial, spontaneous Vietnamese speech, or in noisy real-world environments. Additionally, the paper does not address potential biases in the training data, which could lead to disparities in PhoWhisper's performance across different demographic groups.

Further research could also explore the computational efficiency and latency of the PhoWhisper models, as these factors are crucial for practical applications like voice assistants. Lastly, the researchers could collaborate with Vietnamese language experts to gain deeper insights into the linguistic nuances captured by the PhoWhisper models.

Conclusion

In summary, the researchers have introduced PhoWhisper, a set of high-performing Vietnamese automatic speech recognition models built by fine-tuning the Whisper model on a large, diverse dataset of Vietnamese speech. PhoWhisper demonstrates state-of-the-art results on standard benchmarks, suggesting it could be a valuable tool for a range of Vietnamese language applications.

While the technical details and evaluation are impressive, the paper could be strengthened by addressing potential limitations and avenues for future research. Nonetheless, the development of PhoWhisper represents an important step towards more inclusive and effective speech recognition technology for the Vietnamese language.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🗣️

PhoWhisper: Automatic Speech Recognition for Vietnamese

Thanh-Thien Le, Linh The Nguyen, Dat Quoc Nguyen

We introduce PhoWhisper in five versions for Vietnamese automatic speech recognition. PhoWhisper's robustness is achieved through fine-tuning the Whisper model on an 844-hour dataset that encompasses diverse Vietnamese accents. Our experimental study demonstrates state-of-the-art performances of PhoWhisper on benchmark Vietnamese ASR datasets. We have open-sourced PhoWhisper at: https://github.com/VinAIResearch/PhoWhisper

6/6/2024

🗣️

VietMed: A Dataset and Benchmark for Automatic Speech Recognition of Vietnamese in the Medical Domain

Khai Le-Duc

Due to privacy restrictions, there's a shortage of publicly available speech recognition datasets in the medical domain. In this work, we present VietMed - a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled medical speech and 1200h of unlabeled general-domain speech. To our best knowledge, VietMed is by far the world's largest public medical speech recognition dataset in 7 aspects: total duration, number of speakers, diseases, recording conditions, speaker roles, unique medical terms and accents. VietMed is also by far the largest public Vietnamese speech dataset in terms of total duration. Additionally, we are the first to present a medical ASR dataset covering all ICD-10 disease groups and all accents within a country. Moreover, we release the first public large-scale pre-trained models for Vietnamese ASR, w2v2-Viet and XLSR-53-Viet, along with the first public large-scale fine-tuned models for medical ASR. Even without any medical data in unsupervised pre-training, our best pre-trained model XLSR-53-Viet generalizes very well to the medical domain by outperforming state-of-the-art XLSR-53, from 51.8% to 29.6% WER on test set (a relative reduction of more than 40%). All code, data and models are made publicly available: https://github.com/leduckhai/MultiMed.

5/29/2024

Efficient Compression of Multitask Multilingual Speech Models

Thomas Palmeira Ferraz

Whisper is a multitask and multilingual speech model covering 99 languages. It yields commendable automatic speech recognition (ASR) results in a subset of its covered languages, but the model still underperforms on a non-negligible number of under-represented languages, a problem exacerbated in smaller model versions. In this work, we examine its limitations, demonstrating the presence of speaker-related (gender, age) and model-related (resourcefulness and model size) bias. Despite that, we show that only model-related bias are amplified by quantization, impacting more low-resource languages and smaller models. Searching for a better compression approach, we propose DistilWhisper, an approach that is able to bridge the performance gap in ASR for these languages while retaining the advantages of multitask and multilingual capabilities. Our approach involves two key strategies: lightweight modular ASR fine-tuning of whisper-small using language-specific experts, and knowledge distillation from whisper-large-v2. This dual approach allows us to effectively boost ASR performance while keeping the robustness inherited from the multitask and multilingual pre-training. Results demonstrate that our approach is more effective than standard fine-tuning or LoRA adapters, boosting performance in the targeted languages for both in- and out-of-domain test sets, while introducing only a negligible parameter overhead at inference.

5/3/2024

🗣️

Automatic Speech Recognition of Non-Native Child Speech for Language Learning Applications

Simone Wills, Yu Bai, Cristian Tejedor-Garcia, Catia Cucchiarini, Helmer Strik

Voicebots have provided a new avenue for supporting the development of language skills, particularly within the context of second language learning. Voicebots, though, have largely been geared towards native adult speakers. We sought to assess the performance of two state-of-the-art ASR systems, Wav2Vec2.0 and Whisper AI, with a view to developing a voicebot that can support children acquiring a foreign language. We evaluated their performance on read and extemporaneous speech of native and non-native Dutch children. We also investigated the utility of using ASR technology to provide insight into the children's pronunciation and fluency. The results show that recent, pre-trained ASR transformer-based models achieve acceptable performance from which detailed feedback on phoneme pronunciation quality can be extracted, despite the challenging nature of child and non-native speech.

7/24/2024