Enhancing Child Vocalization Classification with Phonetically-Tuned Embeddings for Assisting Autism Diagnosis

Read original: arXiv:2309.07287 - Published 6/7/2024 by Jialu Li, Mark Hasegawa-Johnson, Karrie Karahalios

🏷️

Overview

This paper explores the use of a machine learning model, specifically Wav2Vec 2.0 (W2V2), to assist in the assessment of children at risk of autism.
The model is trained on 4300 hours of home audio recordings of children under 5 years old, and is used for two main tasks: clinician-child speaker diarization and vocalization classification (VC).
To enhance the performance of the VC task, the researchers incorporate a W2V2 phoneme recognition system for children under 4 years old, and use its phonetically-tuned embeddings as auxiliary features or recognize pseudo phonetic transcripts as an auxiliary task.
The model is tested on two corpora, Rapid-ABC and BabbleCor, and consistently improves upon the state-of-the-art performance on the reproducible subset of BabbleCor.

Plain English Explanation

Assessing whether a child is at risk of autism typically involves a clinician observing the child, taking notes, and rating the child's behaviors. However, this process can be time-consuming and labor-intensive. The researchers in this study have developed a machine learning model that can help make this process more efficient.

The model, called Wav2Vec 2.0 (W2V2), has been trained on over 4,000 hours of audio recordings of children under 5 years old. It can be used to automatically identify when a clinician is speaking and when the child is vocalizing, which can save the clinician a lot of time and effort.

To further improve the model's ability to classify the child's vocalizations, the researchers have also incorporated a specialized phoneme recognition system. This system is trained to recognize the specific sounds that young children make, which can help the model better understand the child's speech patterns and communication.

The researchers have tested their model on two different datasets, and have found that it consistently outperforms the current state-of-the-art approaches. This suggests that their approach could be a valuable tool for clinicians who are assessing children for autism risk.

Technical Explanation

The researchers in this study leveraged the Wav2Vec 2.0 (W2V2) model, which was pre-trained on 4300 hours of home audio recordings of children under 5 years old, to build a unified system for two key tasks: clinician-child speaker diarization and vocalization classification (VC).

To enhance the performance of the VC task, the researchers built a W2V2 phoneme recognition system specifically for children under 4 years old. They incorporated the phonetically-tuned embeddings from this system as auxiliary features, or used it to recognize pseudo phonetic transcripts as an auxiliary task.

The researchers tested their method on two corpora: Rapid-ABC and BabbleCor. They found that their approach consistently improved upon the state-of-the-art performance on the reproducible subset of BabbleCor.

Critical Analysis

The researchers have presented a promising approach to automating the assessment of children at risk of autism, which could significantly reduce the labor required by clinicians. However, the study has some limitations that should be considered:

The model was trained on a relatively small dataset of 4300 hours of audio, which may not be sufficient to capture the full diversity of children's speech patterns. Expanding the dataset could potentially improve the model's performance.
The researchers only tested their model on two corpora, and it would be valuable to see how it performs on a wider range of datasets to assess its generalizability.
The study did not address potential privacy and ethical concerns around the use of audio recordings of children for machine learning models. These issues should be carefully considered before deploying such systems in clinical settings.

Further research could also explore ways to make the model more interpretable and transparent, so that clinicians can better understand the reasoning behind its decisions and have greater confidence in its outputs.

Conclusion

This study presents a novel approach to automating the assessment of children at risk of autism, using a machine learning model trained on a large dataset of children's speech. The researchers have demonstrated that their model can outperform the current state-of-the-art approaches, which could potentially save clinicians a significant amount of time and effort in the assessment process.

While the study has some limitations, the researchers' work represents an important step forward in the development of speech recognition systems for children. By leveraging the power of machine learning, clinicians may be able to more effectively capture critical events and better communicate with parents, ultimately leading to improved outcomes for children at risk of autism.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏷️

Enhancing Child Vocalization Classification with Phonetically-Tuned Embeddings for Assisting Autism Diagnosis

Jialu Li, Mark Hasegawa-Johnson, Karrie Karahalios

The assessment of children at risk of autism typically involves a clinician observing, taking notes, and rating children's behaviors. A machine learning model that can label adult and child audio may largely save labor in coding children's behaviors, helping clinicians capture critical events and better communicate with parents. In this study, we leverage Wav2Vec 2.0 (W2V2), pre-trained on 4300-hour of home audio of children under 5 years old, to build a unified system for tasks of clinician-child speaker diarization and vocalization classification (VC). To enhance children's VC, we build a W2V2 phoneme recognition system for children under 4 years old, and we incorporate its phonetically-tuned embeddings as auxiliary features or recognize pseudo phonetic transcripts as an auxiliary task. We test our method on two corpora (Rapid-ABC and BabbleCor) and obtain consistent improvements. Additionally, we outperform the state-of-the-art performance on the reproducible subset of BabbleCor. Code available at https://huggingface.co/lijialudew

6/7/2024

Analysis of Self-Supervised Speech Models on Children's Speech and Infant Vocalizations

Jialu Li, Mark Hasegawa-Johnson, Nancy L. McElwain

To understand why self-supervised learning (SSL) models have empirically achieved strong performances on several speech-processing downstream tasks, numerous studies have focused on analyzing the encoded information of the SSL layer representations in adult speech. Limited work has investigated how pre-training and fine-tuning affect SSL models encoding children's speech and vocalizations. In this study, we aim to bridge this gap by probing SSL models on two relevant downstream tasks: (1) phoneme recognition (PR) on the speech of adults, older children (8-10 years old), and younger children (1-4 years old), and (2) vocalization classification (VC) distinguishing cry, fuss, and babble for infants under 14 months old. For younger children's PR, the superiority of fine-tuned SSL models is largely due to their ability to learn features that represent older children's speech and then adapt those features to the speech of younger children. For infant VC, SSL models pre-trained on large-scale home recordings learn to leverage phonetic representations at middle layers, and thereby enhance the performance of this task.

6/7/2024

🗣️

Automatic Speech Recognition of Non-Native Child Speech for Language Learning Applications

Simone Wills, Yu Bai, Cristian Tejedor-Garcia, Catia Cucchiarini, Helmer Strik

Voicebots have provided a new avenue for supporting the development of language skills, particularly within the context of second language learning. Voicebots, though, have largely been geared towards native adult speakers. We sought to assess the performance of two state-of-the-art ASR systems, Wav2Vec2.0 and Whisper AI, with a view to developing a voicebot that can support children acquiring a foreign language. We evaluated their performance on read and extemporaneous speech of native and non-native Dutch children. We also investigated the utility of using ASR technology to provide insight into the children's pronunciation and fluency. The results show that recent, pre-trained ASR transformer-based models achieve acceptable performance from which detailed feedback on phoneme pronunciation quality can be extracted, despite the challenging nature of child and non-native speech.

7/24/2024

Improving child speech recognition with augmented child-like speech

Yuanyuan Zhang, Zhengjun Yue, Tanvina Patel, Odette Scharenborg

State-of-the-art ASRs show suboptimal performance for child speech. The scarcity of child speech limits the development of child speech recognition (CSR). Therefore, we studied child-to-child voice conversion (VC) from existing child speakers in the dataset and additional (new) child speakers via monolingual and cross-lingual (Dutch-to-German) VC, respectively. The results showed that cross-lingual child-to-child VC significantly improved child ASR performance. Experiments on the impact of the quantity of child-to-child cross-lingual VC-generated data on fine-tuning (FT) ASR models gave the best results with two-fold augmentation for our FT-Conformer model and FT-Whisper model which reduced WERs with ~3% absolute compared to the baseline, and with six-fold augmentation for the model trained from scratch, which improved by an absolute 3.6% WER. Moreover, using a small amount of high-quality VC-generated data achieved similar results to those of our best-FT models.

6/18/2024