DCASE 2024 Task 4: Sound Event Detection with Heterogeneous Data and Missing Labels

2406.08056

Published 6/13/2024 by Samuele Cornell, Janek Ebbers, Constance Douwes, Irene Mart'in-Morat'o, Manu Harju, Annamaria Mesaros, Romain Serizel

eess.AS cs.SD

🔎

Abstract

The Detection and Classification of Acoustic Scenes and Events Challenge Task 4 aims to advance sound event detection (SED) systems in domestic environments by leveraging training data with different supervision uncertainty. Participants are challenged in exploring how to best use training data from different domains and with varying annotation granularity (strong/weak temporal resolution, soft/hard labels), to obtain a robust SED system that can generalize across different scenarios. Crucially, annotation across available training datasets can be inconsistent and hence sound labels of one dataset may be present but not annotated in the other one and vice-versa. As such, systems will have to cope with potentially missing target labels during training. Moreover, as an additional novelty, systems will also be evaluated on labels with different granularity in order to assess their robustness for different applications. To lower the entry barrier for participants, we developed an updated baseline system with several caveats to address these aforementioned problems. Results with our baseline system indicate that this research direction is promising and is possible to obtain a stronger SED system by using diverse domain training data with missing labels compared to training a SED system for each domain separately.

Create account to get full access

Overview

• The paper discusses DCASE 2024 Task 4: Sound Event Detection with Heterogeneous Data and Missing Labels, which aims to develop sound event detection systems that can handle diverse audio data and incomplete annotation information.

• The challenge involves detecting and localizing sound events in audio recordings, even when the training data contains missing labels or consists of various audio types.

• Addressing this task can lead to advancements in areas like smart home automation, hearing aids, and wildlife monitoring, where accurate sound event detection is crucial.

Plain English Explanation

The paper is about a challenge in the field of sound event detection, which is the task of identifying and locating different types of sounds in audio recordings. This is an important capability for various applications, such as smart home automation, hearing aids, and wildlife monitoring.

The challenge, called DCASE 2024 Task 4, aims to develop sound event detection systems that can handle diverse audio data and missing labels in the training data. This means the systems need to be able to accurately detect and locate sound events, even when the audio recordings they are trained on come from different sources (e.g., different microphones, recording environments) and some of the labels (information about the types of sounds present) are missing.

Addressing this challenge can lead to more robust and versatile sound event detection systems that can be deployed in real-world scenarios, where the available data may not be perfect or uniform.

Technical Explanation

The paper outlines the DCASE 2024 Task 4, which focuses on sound event detection and localization with heterogeneous data and missing labels. The task involves developing systems that can detect and localize sound events in audio recordings, even when the training data consists of diverse audio types and some of the labels (information about the types of sounds present) are missing.

To achieve this, the task will provide participants with a dataset that includes audio recordings from various sources, such as different microphones and recording environments, as well as partial or incomplete annotation information. Participants will need to design and train models that can effectively handle this heterogeneous data and missing labels to accurately detect and localize the sound events.

The task aims to push the boundaries of sound event detection systems, moving beyond the typical assumption of having well-curated and fully-labeled training data. By addressing the challenges posed by this task, researchers can develop more robust and versatile sound event detection models that can be deployed in real-world applications where the available data may be diverse and incomplete.

Critical Analysis

The paper highlights the importance of addressing the challenge of sound event detection with heterogeneous data and missing labels, as it represents a more realistic scenario that sound event detection systems are likely to encounter in real-world applications. By requiring participants to handle diverse audio data and incomplete annotations, the task encourages the development of more adaptable and generalizable models.

However, the paper does not delve into the specific technical approaches or evaluation metrics that will be used in the challenge. It would be helpful to have more details on the dataset characteristics, the types of sound events to be detected, and the evaluation criteria to better understand the scope and difficulties of the task.

Additionally, the paper does not discuss potential limitations or caveats of the proposed approach. It would be valuable to consider how the models developed for this task may perform in even more complex or challenging scenarios, such as when dealing with a larger number of sound event classes, highly overlapping sound events, or significant variations in the recording conditions.

Conclusion

The DCASE 2024 Task 4 on Sound Event Detection with Heterogeneous Data and Missing Labels presents an important challenge in the field of sound event detection. By requiring participants to handle diverse audio data and incomplete annotations, the task aims to push the boundaries of current sound event detection systems and facilitate the development of more robust and versatile models.

Addressing this challenge can lead to advancements in various applications, such as smart home automation, hearing aids, and wildlife monitoring, where accurate and reliable sound event detection is crucial. The insights gained from this task can contribute to the broader goal of creating sound event detection systems that can operate effectively in real-world scenarios with imperfect or heterogeneous data.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🔎

Sound event detection based on auxiliary decoder and maximum probability aggregation for DCASE Challenge 2024 Task 4

Sang Won Son, Jongyeon Park, Hong Kook Kim, Sulaiman Vesal, Jeong Eun Lim

In this report, we propose three novel methods for developing a sound event detection (SED) model for the DCASE 2024 Challenge Task 4. First, we propose an auxiliary decoder attached to the final convolutional block to improve feature extraction capabilities while reducing dependency on embeddings from pre-trained large models. The proposed auxiliary decoder operates independently from the main decoder, enhancing performance of the convolutional block during the initial training stages by assigning a different weight strategy between main and auxiliary decoder losses. Next, to address the time interval issue between the DESED and MAESTRO datasets, we propose maximum probability aggregation (MPA) during the training step. The proposed MPA method enables the model's output to be aligned with soft labels of 1 s in the MAESTRO dataset. Finally, we propose a multi-channel input feature that employs various versions of logmel and MFCC features to generate time-frequency pattern. The experimental results demonstrate the efficacy of these proposed methods in a view of improving SED performance by achieving a balanced enhancement across different datasets and label types. Ultimately, this approach presents a significant step forward in developing more robust and flexible SED models

6/26/2024

eess.AS cs.SD

⚙️

Sound Event Bounding Boxes

Janek Ebbers, Francois G. Germain, Gordon Wichern, Jonathan Le Roux

Sound event detection is the task of recognizing sounds and determining their extent (onset/offset times) within an audio clip. Existing systems commonly predict sound presence confidence in short time frames. Then, thresholding produces binary frame-level presence decisions, with the extent of individual events determined by merging consecutive positive frames. In this paper, we show that frame-level thresholding degrades the prediction of the event extent by coupling it with the system's sound presence confidence. We propose to decouple the prediction of event extent and confidence by introducing SEBBs, which format each sound event prediction as a tuple of a class type, extent, and overall confidence. We also propose a change-detection-based algorithm to convert legacy frame-level outputs into SEBBs. We find the algorithm significantly improves the performance of DCASE 2023 Challenge systems, boosting the state of the art from .644 to .686 PSDS1.

6/7/2024

eess.AS cs.SD

🤷

Description and Discussion on DCASE 2024 Challenge Task 2: First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring

Tomoya Nishida, Noboru Harada, Daisuke Niizumi, Davide Albertini, Roberto Sannino, Simone Pradolini, Filippo Augusti, Keisuke Imoto, Kota Dohi, Harsh Purohit, Takashi Endo, Yohei Kawaguchi

We present the task description of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2024 Challenge Task 2: First-shot unsupervised anomalous sound detection (ASD) for machine condition monitoring. Continuing from last year's DCASE 2023 Challenge Task 2, we organize the task as a first-shot problem under domain generalization required settings. The main goal of the first-shot problem is to enable rapid deployment of ASD systems for new kinds of machines without the need for machine-specific hyperparameter tunings. This problem setting was realized by (1) giving only one section for each machine type and (2) having completely different machine types for the development and evaluation datasets. For the DCASE 2024 Challenge Task 2, data of completely new machine types were newly collected and provided as the evaluation dataset. In addition, attribute information such as the machine operation conditions were concealed for several machine types to mimic situations where such information are unavailable. We will add challenge results and analysis of the submissions after the challenge submission deadline.

6/12/2024

eess.AS cs.LG cs.SD

Data-Efficient Low-Complexity Acoustic Scene Classification in the DCASE 2024 Challenge

Florian Schmid, Paul Primus, Toni Heittola, Annamaria Mesaros, Irene Mart'in-Morat'o, Khaled Koutini, Gerhard Widmer

This article describes the Data-Efficient Low-Complexity Acoustic Scene Classification Task in the DCASE 2024 Challenge and the corresponding baseline system. The task setup is a continuation of previous editions (2022 and 2023), which focused on recording device mismatches and low-complexity constraints. This year's edition introduces an additional real-world problem: participants must develop data-efficient systems for five scenarios, which progressively limit the available training data. The provided baseline system is based on an efficient, factorized CNN architecture constructed from inverted residual blocks and uses Freq-MixStyle to tackle the device mismatch problem. The baseline system's accuracy ranges from 42.40% on the smallest to 56.99% on the largest training set.

5/17/2024

eess.AS cs.SD