LearnerVoice: A Dataset of Non-Native English Learners' Spontaneous Speech

Read original: arXiv:2407.04280 - Published 7/8/2024 by Haechan Kim, Junho Myung, Seoyoung Kim, Sungpah Lee, Dongyeop Kang, Juho Kim

LearnerVoice: A Dataset of Non-Native English Learners' Spontaneous Speech

Overview

The paper introduces the LearnerVoice dataset, which consists of spontaneous speech recordings from non-native English learners.
The dataset aims to support research on speech recognition and language learning technologies for non-native speakers.
The dataset includes audio recordings, transcripts, and metadata for over 1,000 speakers from diverse language backgrounds.

Plain English Explanation

The LearnerVoice dataset provides recordings of non-native English speakers having natural conversations. This is useful for developing technologies that can better understand and assist language learners. The dataset includes audio files, text transcripts, and information about the speakers, like their native language and English proficiency level. This allows researchers to study the patterns and challenges in non-native speech. By having a large, diverse set of real-world speech samples, they can create better speech recognition and language learning tools tailored to the needs of non-native English speakers.

Technical Explanation

The LearnerVoice dataset consists of over 1,000 audio recordings of non-native English speakers engaged in spontaneous conversations. The recordings were collected from learners with diverse language backgrounds, including Mandarin, Hindi, Spanish, and more. Each recording is accompanied by a transcript and metadata about the speaker, such as their native language, English proficiency level, age, and gender.

The dataset was designed to support research in speech recognition, language learning technologies, and the analysis of non-native speech patterns. By providing a large corpus of real-world, non-native English speech, the LearnerVoice dataset aims to enable the development of more robust and inclusive speech technologies that can better serve the needs of global language learners.

Critical Analysis

The LearnerVoice dataset is a valuable resource for advancing research in speech recognition and language learning for non-native English speakers. The inclusion of metadata about the speakers' language backgrounds and proficiency levels allows for nuanced analyses of the different challenges faced by learners from diverse linguistic and cultural contexts.

However, the paper does not provide details on the data collection process, such as the specific prompts or tasks given to participants, which could impact the spontaneity and naturalness of the speech samples. Additionally, the geographical distribution of the speakers is not clearly described, which may limit the dataset's representativeness of the global population of non-native English learners.

Further research could explore the use of the LearnerVoice dataset in real-world applications, such as the development of personalized language learning systems or virtual language assistants tailored to individual learners' needs and backgrounds.

Conclusion

The LearnerVoice dataset is a significant contribution to the field of speech and language technology research for non-native English speakers. By providing a large corpus of diverse, spontaneous speech samples with accompanying metadata, the dataset enables the development of more inclusive and effective speech recognition and language learning tools. The dataset's potential to advance research in this area is promising and could have meaningful impacts on the experiences of global language learners.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

LearnerVoice: A Dataset of Non-Native English Learners' Spontaneous Speech

Haechan Kim, Junho Myung, Seoyoung Kim, Sungpah Lee, Dongyeop Kang, Juho Kim

Prevalent ungrammatical expressions and disfluencies in spontaneous speech from second language (L2) learners pose unique challenges to Automatic Speech Recognition (ASR) systems. However, few datasets are tailored to L2 learner speech. We publicly release LearnerVoice, a dataset consisting of 50.04 hours of audio and transcriptions of L2 learners' spontaneous speech. Our linguistic analysis reveals that transcriptions in our dataset contain L2S (L2 learner's Spontaneous speech) features, consisting of ungrammatical expressions and disfluencies (e.g., filler words, word repetitions, self-repairs, false starts), significantly more than native speech datasets. Fine-tuning whisper-small.en with LearnerVoice achieves a WER of 10.26%, 44.2% lower than vanilla whisper-small.en. Furthermore, our qualitative analysis indicates that 54.2% of errors from the vanilla model on LearnerVoice are attributable to L2S features, with 48.1% of them being reduced in the fine-tuned model.

7/8/2024

DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage

Kyra Wang, Dorien Herremans

Laughing, sighing, stuttering, and other forms of paralanguage do not contribute any direct lexical meaning to speech, but they provide crucial propositional context that aids semantic and pragmatic processes such as irony. It is thus important for artificial social agents to both understand and be able to generate speech with semantically-important paralanguage. Most speech datasets do not include transcribed non-lexical speech sounds and disfluencies, while those that do are typically multi-speaker datasets where each speaker provides relatively little audio. This makes it challenging to train conversational Text-to-Speech (TTS) synthesis models that include such paralinguistic components. We thus present DisfluencySpeech, a studio-quality labeled English speech dataset with paralanguage. A single speaker recreates nearly 10 hours of expressive utterances from the Switchboard-1 Telephone Speech Corpus (Switchboard), simulating realistic informal conversations. To aid the development of a TTS model that is able to predictively synthesise paralanguage from text without such components, we provide three different transcripts at different levels of information removal (removal of non-speech events, removal of non-sentence elements, and removal of false starts), as well as benchmark TTS models trained on each of these levels.

6/14/2024

🗣️

Automatic Speech Recognition of Non-Native Child Speech for Language Learning Applications

Simone Wills, Yu Bai, Cristian Tejedor-Garcia, Catia Cucchiarini, Helmer Strik

Voicebots have provided a new avenue for supporting the development of language skills, particularly within the context of second language learning. Voicebots, though, have largely been geared towards native adult speakers. We sought to assess the performance of two state-of-the-art ASR systems, Wav2Vec2.0 and Whisper AI, with a view to developing a voicebot that can support children acquiring a foreign language. We evaluated their performance on read and extemporaneous speech of native and non-native Dutch children. We also investigated the utility of using ASR technology to provide insight into the children's pronunciation and fluency. The results show that recent, pre-trained ASR transformer-based models achieve acceptable performance from which detailed feedback on phoneme pronunciation quality can be extracted, despite the challenging nature of child and non-native speech.

7/24/2024

Error-preserving Automatic Speech Recognition of Young English Learners' Language

Janick Michot, Manuela Hurlimann, Jan Deriu, Luzia Sauer, Katsiaryna Mlynchyk, Mark Cieliebak

One of the central skills that language learners need to practice is speaking the language. Currently, students in school do not get enough speaking opportunities and lack conversational practice. Recent advances in speech technology and natural language processing allow for the creation of novel tools to practice their speaking skills. In this work, we tackle the first component of such a pipeline, namely, the automated speech recognition module (ASR), which faces a number of challenges: first, state-of-the-art ASR models are often trained on adult read-aloud data by native speakers and do not transfer well to young language learners' speech. Second, most ASR systems contain a powerful language model, which smooths out errors made by the speakers. To give corrective feedback, which is a crucial part of language learning, the ASR systems in our setting need to preserve the errors made by the language learners. In this work, we build an ASR system that satisfies these requirements: it works on spontaneous speech by young language learners and preserves their errors. For this, we collected a corpus containing around 85 hours of English audio spoken by learners in Switzerland from grades 4 to 6 on different language learning tasks, which we used to train an ASR model. Our experiments show that our model benefits from direct fine-tuning on children's voices and has a much higher error preservation rate than other models.

6/6/2024