The Sounds of Home: A Speech-Removed Residential Audio Dataset for Sound Event Detection

Read original: arXiv:2409.11262 - Published 9/18/2024 by Gabriel Bibb'o, Thomas Deacon, Arshdeep Singh, Mark D. Plumbley

The Sounds of Home: A Speech-Removed Residential Audio Dataset for Sound Event Detection

Overview

The researchers created a dataset of residential audio recordings with speech removed, called "The Sounds of Home".
This dataset is intended for training sound event detection models.
It contains over 1000 hours of audio data from diverse residential environments.
Speech has been removed from the recordings, leaving only environmental sounds like household appliances, pets, and outdoor noises.

Plain English Explanation

The researchers have developed a new dataset called "The Sounds of Home" that could be very useful for training AI systems to detect and recognize different sounds that occur in a home environment. This dataset contains over 1000 hours of audio recordings from various residential settings, but the key difference is that all the speech has been removed.

So instead of hearing people talking, you just hear the other sounds that make up the "soundtrack" of a home - things like household appliances running, pets making noise, sounds from outside the house, and so on. By removing the speech, the researchers have created a dataset that is focused solely on these other environmental sounds, which can be really helpful for training sound event detection models.

Having a high-quality dataset like this, with diverse home sounds and no speech, can advance the field of audio scene analysis and help develop better AI systems that can understand and interpret the acoustic environment of a home. This could have applications in areas like home automation, smart home assistants, and monitoring the well-being of elderly or vulnerable people living independently.

Technical Explanation

The researchers described a methodology for collecting and curating "The Sounds of Home" dataset, which contains over 1000 hours of residential audio recordings. The key novel aspect is that speech has been completely removed from these recordings, leaving only the environmental sounds present in home environments.

The data was collected using commercial off-the-shelf microphones placed in various residential settings, including apartments, houses, and assisted living facilities. Participants were instructed to go about their normal daily activities while the recordings were made. The speech was then automatically detected and removed using a neural network-based voice activity detection model.

The resulting dataset contains a wide variety of sound events, including household appliances, pets, outdoor sounds, and other ambient noises commonly found in homes. The researchers provide baseline evaluations of the dataset using several sound event detection models, demonstrating its utility for training and evaluating these types of systems.

Critical Analysis

A key strength of this dataset is the removal of speech, which allows for more focused training and evaluation of sound event detection models on the environmental sounds of interest. However, one limitation is that the speech removal process may not be perfect, and some residual speech artifacts could remain in the recordings.

Additionally, the dataset is limited to residential environments and may not capture the full diversity of sound events that could be encountered in the real world. Further research could explore expanding the dataset to include a broader range of indoor and outdoor acoustic environments.

It would also be valuable to conduct user studies to better understand how this dataset and the resulting models could be applied in real-world scenarios, such as home monitoring or assistive technologies. This could help identify any potential biases or limitations in the dataset and guide future improvements.

Conclusion

Overall, "The Sounds of Home" dataset represents a valuable contribution to the field of audio scene analysis and sound event detection. By providing a large-scale dataset of residential audio recordings with speech removed, the researchers have created a resource that can help advance the development of AI systems capable of understanding and interpreting the acoustic environment of a home. This could have important applications in areas like home automation, healthcare monitoring, and improving the quality of life for elderly or vulnerable individuals living independently.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

New!The Sounds of Home: A Speech-Removed Residential Audio Dataset for Sound Event Detection

Gabriel Bibb'o, Thomas Deacon, Arshdeep Singh, Mark D. Plumbley

This paper presents a residential audio dataset to support sound event detection research for smart home applications aimed at promoting wellbeing for older adults. The dataset is constructed by deploying audio recording systems in the homes of 8 participants aged 55-80 years for a 7-day period. Acoustic characteristics are documented through detailed floor plans and construction material information to enable replication of the recording environments for AI model deployment. A novel automated speech removal pipeline is developed, using pre-trained audio neural networks to detect and remove segments containing spoken voice, while preserving segments containing other sound events. The resulting dataset consists of privacy-compliant audio recordings that accurately capture the soundscapes and activities of daily living within residential spaces. The paper details the dataset creation methodology, the speech removal pipeline utilizing cascaded model architectures, and an analysis of the vocal label distribution to validate the speech removal process. This dataset enables the development and benchmarking of sound event detection models tailored specifically for in-home applications.

9/18/2024

Sound Tagging in Infant-centric Home Soundscapes

Mohammad Nur Hossain Khan, Jialu Li, Nancy L. McElwain, Mark Hasegawa-Johnson, Bashima Islam

Certain environmental noises have been associated with negative developmental outcomes for infants and young children. Though classifying or tagging sound events in a domestic environment is an active research area, previous studies focused on data collected from a non-stationary microphone placed in the environment or from the perspective of adults. Further, many of these works ignore infants or young children in the environment or have data collected from only a single family where noise from the fixed sound source can be moderate at the infant's position or vice versa. Thus, despite the recent success of large pre-trained models for noise event detection, the performance of these models on infant-centric noise soundscapes in the home is yet to be explored. To bridge this gap, we have collected and labeled noises in home soundscapes from 22 families in an unobtrusive manner, where the data are collected through an infant-worn recording device. In this paper, we explore the performance of a large pre-trained model (Audio Spectrogram Transformer [AST]) on our noise-conditioned infant-centric environmental data as well as publicly available home environmental datasets. Utilizing different training strategies such as resampling, utilizing public datasets, mixing public and infant-centric training sets, and data augmentation using noise and masking, we evaluate the performance of a large pre-trained model on sparse and imbalanced infant-centric data. Our results show that fine-tuning the large pre-trained model by combining our collected dataset with public datasets increases the F1-score from 0.11 (public datasets) and 0.76 (collected datasets) to 0.84 (combined datasets) and Cohen's Kappa from 0.013 (public datasets) and 0.77 (collected datasets) to 0.83 (combined datasets) compared to only training with public or collected datasets, respectively.

6/26/2024

🔎

DCASE 2024 Task 4: Sound Event Detection with Heterogeneous Data and Missing Labels

Samuele Cornell, Janek Ebbers, Constance Douwes, Irene Mart'in-Morat'o, Manu Harju, Annamaria Mesaros, Romain Serizel

The Detection and Classification of Acoustic Scenes and Events Challenge Task 4 aims to advance sound event detection (SED) systems in domestic environments by leveraging training data with different supervision uncertainty. Participants are challenged in exploring how to best use training data from different domains and with varying annotation granularity (strong/weak temporal resolution, soft/hard labels), to obtain a robust SED system that can generalize across different scenarios. Crucially, annotation across available training datasets can be inconsistent and hence sound labels of one dataset may be present but not annotated in the other one and vice-versa. As such, systems will have to cope with potentially missing target labels during training. Moreover, as an additional novelty, systems will also be evaluated on labels with different granularity in order to assess their robustness for different applications. To lower the entry barrier for participants, we developed an updated baseline system with several caveats to address these aforementioned problems. Results with our baseline system indicate that this research direction is promising and is possible to obtain a stronger SED system by using diverse domain training data with missing labels compared to training a SED system for each domain separately.

6/13/2024

Audio-Language Datasets of Scenes and Events: A Survey

Gijs Wijngaard, Elia Formisano, Michele Esposito, Michel Dumontier

Audio-language models (ALMs) process sounds to provide a linguistic description of sound-producing events and scenes. Recent advances in computing power and dataset creation have led to significant progress in this domain. This paper surveys existing datasets used for training audio-language models, emphasizing the recent trend towards using large, diverse datasets to enhance model performance. Key sources of these datasets include the Freesound platform and AudioSet that have contributed to the field's rapid growth. Although prior surveys primarily address techniques and training details, this survey categorizes and evaluates a wide array of datasets, addressing their origins, characteristics, and use cases. It also performs a data leak analysis to ensure dataset integrity and mitigate bias between datasets. This survey was conducted by analyzing research papers up to and including December 2023, and does not contain any papers after that period.

7/10/2024