Sound Tagging in Infant-centric Home Soundscapes

Read original: arXiv:2406.17190 - Published 6/26/2024 by Mohammad Nur Hossain Khan, Jialu Li, Nancy L. McElwain, Mark Hasegawa-Johnson, Bashima Islam

Sound Tagging in Infant-centric Home Soundscapes

Overview

This paper focuses on developing a sound tagging system for infant-centric home soundscapes.
It explores the use of an audio spectrogram transformer model for detecting and classifying various domestic sound events.
The research aims to create a system that can accurately identify sounds relevant to infant care and development within a home environment.

Plain English Explanation

The paper is about creating a system that can automatically identify and label different sounds that occur in a home environment, particularly those that are relevant to caring for infants. This is important because being able to understand the sounds that infants are exposed to in their daily lives can provide valuable insights into their development and experiences.

The researchers used a type of machine learning model called an audio spectrogram transformer to detect and classify various domestic sound events, such as speech, crying, and household appliances. This model was trained on a dataset of audio recordings from homes, and it was able to accurately identify sounds that are particularly relevant to infants, like their vocalizations and interactions with caregivers.

By using this sound tagging system, researchers can get a better understanding of the acoustic environment that infants are experiencing in their homes, which can inform our knowledge of child development and help identify any potential issues or areas for support.

Technical Explanation

The researchers developed a sound tagging system for infant-centric home soundscapes using an audio spectrogram transformer model. This model takes audio recordings as input and outputs a set of labels or "tags" that identify the various sound events present in the recording.

The model was trained on a dataset of audio recordings collected from homes with infants. These recordings were annotated by human experts to identify different sound events, such as speech, crying, household appliances, and other environmental sounds. The researchers used this annotated dataset to train the audio spectrogram transformer model to learn the distinctive features of these various sound events.

The key innovation in this work is the use of the audio spectrogram transformer architecture, which has been shown to be effective for sound event detection and classification tasks. This model is able to capture the temporal and spectral characteristics of sounds, allowing it to accurately identify a wide range of domestic sound events.

The researchers evaluated the performance of their sound tagging system on a held-out test set of home audio recordings. They found that the model was able to achieve high accuracy in identifying sound events relevant to infant care and development, demonstrating the potential of this approach for understanding infant-centric home soundscapes.

Critical Analysis

The researchers acknowledge several limitations of their work, such as the relatively small size of the dataset used for training and the potential for bias in the annotations. They also note that the performance of the sound tagging system may vary depending on the specific home environment and the characteristics of the infant's experiences.

Additionally, the researchers did not explore the potential applications of this sound tagging system beyond the research context, such as its use in clinical or educational settings. Further research is needed to investigate the real-world implications and practical applications of this technology.

It is also worth considering the ethical implications of using such a system, particularly in terms of privacy and the potential for misuse of the collected data. The researchers should address these concerns and outline appropriate safeguards and guidelines for the use of this technology.

Conclusion

Overall, this research represents a promising step towards developing more sophisticated and accurate systems for understanding infant-centric home soundscapes. By using advanced machine learning techniques like the audio spectrogram transformer, the researchers have demonstrated the potential to accurately identify and classify a wide range of domestic sound events that are relevant to infant care and development.

While there are still some limitations and areas for further exploration, this work could have significant implications for our understanding of child development and the design of interventions and support systems for families with young children. As the field of sound event detection continues to evolve, the insights and techniques presented in this paper may also find applications in other domains, such as healthcare and smart home automation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Sound Tagging in Infant-centric Home Soundscapes

Mohammad Nur Hossain Khan, Jialu Li, Nancy L. McElwain, Mark Hasegawa-Johnson, Bashima Islam

Certain environmental noises have been associated with negative developmental outcomes for infants and young children. Though classifying or tagging sound events in a domestic environment is an active research area, previous studies focused on data collected from a non-stationary microphone placed in the environment or from the perspective of adults. Further, many of these works ignore infants or young children in the environment or have data collected from only a single family where noise from the fixed sound source can be moderate at the infant's position or vice versa. Thus, despite the recent success of large pre-trained models for noise event detection, the performance of these models on infant-centric noise soundscapes in the home is yet to be explored. To bridge this gap, we have collected and labeled noises in home soundscapes from 22 families in an unobtrusive manner, where the data are collected through an infant-worn recording device. In this paper, we explore the performance of a large pre-trained model (Audio Spectrogram Transformer [AST]) on our noise-conditioned infant-centric environmental data as well as publicly available home environmental datasets. Utilizing different training strategies such as resampling, utilizing public datasets, mixing public and infant-centric training sets, and data augmentation using noise and masking, we evaluate the performance of a large pre-trained model on sparse and imbalanced infant-centric data. Our results show that fine-tuning the large pre-trained model by combining our collected dataset with public datasets increases the F1-score from 0.11 (public datasets) and 0.76 (collected datasets) to 0.84 (combined datasets) and Cohen's Kappa from 0.013 (public datasets) and 0.77 (collected datasets) to 0.83 (combined datasets) compared to only training with public or collected datasets, respectively.

6/26/2024

New!The Sounds of Home: A Speech-Removed Residential Audio Dataset for Sound Event Detection

Gabriel Bibb'o, Thomas Deacon, Arshdeep Singh, Mark D. Plumbley

This paper presents a residential audio dataset to support sound event detection research for smart home applications aimed at promoting wellbeing for older adults. The dataset is constructed by deploying audio recording systems in the homes of 8 participants aged 55-80 years for a 7-day period. Acoustic characteristics are documented through detailed floor plans and construction material information to enable replication of the recording environments for AI model deployment. A novel automated speech removal pipeline is developed, using pre-trained audio neural networks to detect and remove segments containing spoken voice, while preserving segments containing other sound events. The resulting dataset consists of privacy-compliant audio recordings that accurately capture the soundscapes and activities of daily living within residential spaces. The paper details the dataset creation methodology, the speech removal pipeline utilizing cascaded model architectures, and an analysis of the vocal label distribution to validate the speech removal process. This dataset enables the development and benchmarking of sound event detection models tailored specifically for in-home applications.

9/18/2024

👨‍🏫

New!Machine listening in a neonatal intensive care unit

Modan Tailleur (LS2N, Nantes Univ - ECN, LS2N - 'equipe SIMS), Vincent Lostanlen (LS2N, LS2N - 'equipe SIMS, Nantes Univ - ECN), Jean-Philippe Rivi`ere (Nantes Univ, Nantes Univ - UFR FLCE, LS2N, LS2N - 'equipe PACCE), Pierre Aumond

Oxygenators, alarm devices, and footsteps are some of the most common sound sources in a hospital. Detecting them has scientific value for environmental psychology but comes with challenges of its own: namely, privacy preservation and limited labeled data. In this paper, we address these two challenges via a combination of edge computing and cloud computing. For privacy preservation, we have designed an acoustic sensor which computes third-octave spectrograms on the fly instead of recording audio waveforms. For sample-efficient machine learning, we have repurposed a pretrained audio neural network (PANN) via spectral transcoding and label space adaptation. A small-scale study in a neonatological intensive care unit (NICU) confirms that the time series of detected events align with another modality of measurement: i.e., electronic badges for parents and healthcare professionals. Hence, this paper demonstrates the feasibility of polyphonic machine listening in a hospital ward while guaranteeing privacy by design.

9/19/2024

Soundscape Captioning using Sound Affective Quality Network and Large Language Model

Yuanbo Hou, Qiaoqiao Ren, Andrew Mitchell, Wenwu Wang, Jian Kang, Tony Belpaeme, Dick Botteldooren

We live in a rich and varied acoustic world, which is experienced by individuals or communities as a soundscape. Computational auditory scene analysis, disentangling acoustic scenes by detecting and classifying events, focuses on objective attributes of sounds, such as their category and temporal characteristics, ignoring the effect of sounds on people and failing to explore the relationship between sounds and the emotions they evoke within a context. To fill this gap and to automate soundscape analysis, which traditionally relies on labour-intensive subjective ratings and surveys, we propose the soundscape captioning (SoundSCap) task. SoundSCap generates context-aware soundscape descriptions by capturing the acoustic scene, event information, and the corresponding human affective qualities. To this end, we propose an automatic soundscape captioner (SoundSCaper) composed of an acoustic model, SoundAQnet, and a general large language model (LLM). SoundAQnet simultaneously models multi-scale information about acoustic scenes, events, and perceived affective qualities, while LLM generates soundscape captions by parsing the information captured by SoundAQnet to a common language. The soundscape caption's quality is assessed by a jury of 16 audio/soundscape experts. The average score (out of 5) of SoundSCaper-generated captions is lower than the score of captions generated by two soundscape experts by 0.21 and 0.25, respectively, on the evaluation set and the model-unknown mixed external dataset with varying lengths and acoustic properties, but the differences are not statistically significant. Overall, SoundSCaper-generated captions show promising performance compared to captions annotated by soundscape experts. The models' code, LLM scripts, human assessment data and instructions, and expert evaluation statistics are all publicly available.

6/11/2024