Automatic Speech Recognition of Non-Native Child Speech for Language Learning Applications

Read original: arXiv:2306.16710 - Published 7/24/2024 by Simone Wills, Yu Bai, Cristian Tejedor-Garcia, Catia Cucchiarini, Helmer Strik

🗣️

Overview

Voicebots have potential to support language learning, especially for second language acquisition.
Most voicebots are designed for native adult speakers, not children.
This paper evaluates performance of two state-of-the-art speech recognition models, Wav2Vec2.0 and Whisper AI, on read and spontaneous speech from native and non-native Dutch children.
The goal is to assess the feasibility of using these models to provide feedback on pronunciation and fluency for children learning a foreign language.

Plain English Explanation

Voicebots, which are computer programs that can understand and respond to voice commands, have opened up new possibilities for supporting the development of language skills, particularly for people learning a second language. However, most voicebots have been designed with adult native speakers in mind, not children.

The researchers in this study wanted to evaluate how well two advanced speech recognition models, called Wav2Vec2.0 and Whisper AI, could handle speech from native and non-native Dutch children. They tested the models on both read speech, where the children read aloud, and spontaneous speech, where the children spoke freely.

The goal was to see if these speech recognition models could provide useful feedback to children learning a foreign language, such as insights into their pronunciation and fluency. Even though child and non-native speech can be very challenging for speech recognition systems, the results showed that the Wav2Vec2.0 and Whisper AI models were able to achieve acceptable performance. This suggests they could potentially be used to build voicebots that support language learning for children.

Technical Explanation

The researchers evaluated the performance of two state-of-the-art automatic speech recognition (ASR) systems, Wav2Vec2.0 and Whisper AI, on speech data from native and non-native Dutch children. Wav2Vec2.0 and Whisper AI are transformer-based models that have shown strong results on various speech recognition tasks.

The researchers collected speech data from 64 Dutch children, both native speakers and non-native learners. The children were asked to read aloud a set of sentences as well as engage in spontaneous speech. The audio recordings were then processed by the Wav2Vec2.0 and Whisper AI models to evaluate their performance on transcribing the children's speech.

The results demonstrated that despite the challenges posed by child and non-native speech, the pre-trained transformer-based ASR models were able to achieve acceptable performance. The models were not only able to accurately transcribe the speech, but the researchers also found that the models' output could be used to provide detailed feedback on the children's pronunciation and fluency.

Critical Analysis

The study provides promising evidence that state-of-the-art speech recognition models like Wav2Vec2.0 and Whisper AI can be leveraged to support language learning, even for young non-native speakers. However, the researchers acknowledge several limitations and areas for further research.

First, the study was conducted in a relatively controlled setting with a small sample size. It remains to be seen how the models would perform in more realistic, noisy environments or with a larger and more diverse set of learners. Additionally, the researchers did not evaluate the models' ability to provide feedback that is actually useful and actionable for the children's language learning process.

Further research is needed to better understand the specific types of feedback the models can generate and how effective that feedback is in helping children improve their language skills. The researchers also note that additional model fine-tuning or data augmentation techniques may be necessary to further improve the models' performance on child and non-native speech.

Conclusion

This study demonstrates the potential of using advanced speech recognition models, such as Wav2Vec2.0 and Whisper AI, to support the development of language skills, particularly for children learning a foreign language. The models were able to achieve acceptable performance on transcribing the speech of both native and non-native Dutch children, suggesting they could be used to provide detailed feedback on pronunciation and fluency.

While further research is needed to address the limitations and refine the models' capabilities, this work represents an important step towards leveraging voicebots to enhance language learning experiences for young learners. By bridging the gap between state-of-the-art speech recognition and the unique challenges of child and non-native speech, this research opens up new possibilities for more personalized and effective language learning tools.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🗣️

Automatic Speech Recognition of Non-Native Child Speech for Language Learning Applications

Simone Wills, Yu Bai, Cristian Tejedor-Garcia, Catia Cucchiarini, Helmer Strik

Voicebots have provided a new avenue for supporting the development of language skills, particularly within the context of second language learning. Voicebots, though, have largely been geared towards native adult speakers. We sought to assess the performance of two state-of-the-art ASR systems, Wav2Vec2.0 and Whisper AI, with a view to developing a voicebot that can support children acquiring a foreign language. We evaluated their performance on read and extemporaneous speech of native and non-native Dutch children. We also investigated the utility of using ASR technology to provide insight into the children's pronunciation and fluency. The results show that recent, pre-trained ASR transformer-based models achieve acceptable performance from which detailed feedback on phoneme pronunciation quality can be extracted, despite the challenging nature of child and non-native speech.

7/24/2024

Error-preserving Automatic Speech Recognition of Young English Learners' Language

Janick Michot, Manuela Hurlimann, Jan Deriu, Luzia Sauer, Katsiaryna Mlynchyk, Mark Cieliebak

One of the central skills that language learners need to practice is speaking the language. Currently, students in school do not get enough speaking opportunities and lack conversational practice. Recent advances in speech technology and natural language processing allow for the creation of novel tools to practice their speaking skills. In this work, we tackle the first component of such a pipeline, namely, the automated speech recognition module (ASR), which faces a number of challenges: first, state-of-the-art ASR models are often trained on adult read-aloud data by native speakers and do not transfer well to young language learners' speech. Second, most ASR systems contain a powerful language model, which smooths out errors made by the speakers. To give corrective feedback, which is a crucial part of language learning, the ASR systems in our setting need to preserve the errors made by the language learners. In this work, we build an ASR system that satisfies these requirements: it works on spontaneous speech by young language learners and preserves their errors. For this, we collected a corpus containing around 85 hours of English audio spoken by learners in Switzerland from grades 4 to 6 on different language learning tasks, which we used to train an ASR model. Our experiments show that our model benefits from direct fine-tuning on children's voices and has a much higher error preservation rate than other models.

6/6/2024

Child Speech Recognition in Human-Robot Interaction: Problem Solved?

Ruben Janssens, Eva Verhelst, Giulio Antonio Abbo, Qiaoqiao Ren, Maria Jose Pinto Bernal, Tony Belpaeme

Automated Speech Recognition shows superhuman performance for adult English speech on a range of benchmarks, but disappoints when fed children's speech. This has long sat in the way of child-robot interaction. Recent evolutions in data-driven speech recognition, including the availability of Transformer architectures and unprecedented volumes of training data, might mean a breakthrough for child speech recognition and social robot applications aimed at children. We revisit a study on child speech recognition from 2017 and show that indeed performance has increased, with newcomer OpenAI Whisper doing markedly better than leading commercial cloud services. While transcription is not perfect yet, the best model recognises 60.3% of sentences correctly barring small grammatical differences, with sub-second transcription time running on a local GPU, showing potential for usable autonomous child-robot speech interactions.

4/29/2024

🚀

Kid-Whisper: Towards Bridging the Performance Gap in Automatic Speech Recognition for Children VS. Adults

Ahmed Adel Attia, Jing Liu, Wei Ai, Dorottya Demszky, Carol Espy-Wilson

Recent advancements in Automatic Speech Recognition (ASR) systems, exemplified by Whisper, have demonstrated the potential of these systems to approach human-level performance given sufficient data. However, this progress doesn't readily extend to ASR for children due to the limited availability of suitable child-specific databases and the distinct characteristics of children's speech. A recent study investigated leveraging the My Science Tutor (MyST) children's speech corpus to enhance Whisper's performance in recognizing children's speech. They were able to demonstrate some improvement on a limited testset. This paper builds on these findings by enhancing the utility of the MyST dataset through more efficient data preprocessing. We reduce the Word Error Rate (WER) on the MyST testset 13.93% to 9.11% with Whisper-Small and from 13.23% to 8.61% with Whisper-Medium and show that this improvement can be generalized to unseen datasets. We also highlight important challenges towards improving children's ASR performance. The results showcase the viable and efficient integration of Whisper for effective children's speech recognition.

5/16/2024