FMSG-JLESS Submission for DCASE 2024 Task4 on Sound Event Detection with Heterogeneous Training Dataset and Potentially Missing Labels

Read original: arXiv:2407.00291 - Published 7/2/2024 by Yang Xiao, Han Yin, Jisheng Bai, Rohan Kumar Das

FMSG-JLESS Submission for DCASE 2024 Task4 on Sound Event Detection with Heterogeneous Training Dataset and Potentially Missing Labels

Overview

This paper presents the FMSG-JLESS submission for DCASE 2024 Task 4 on sound event detection with a heterogeneous training dataset and potentially missing labels.
The task aims to develop methods for sound event detection in real-world environments with complex acoustic scenes and noisy audio data.
The authors propose a novel approach that leverages auxiliary decoders and maximum pooling to address the challenges of the task.

Plain English Explanation

The researchers have developed a new method for detecting sound events in complex audio environments. This is important for applications like smart home assistants and acoustic monitoring.

The key idea is to use an "auxiliary decoder" - a secondary neural network that works alongside the main sound detection model. This auxiliary decoder helps the model learn better representations of the sound events, even when the training data has missing or inaccurate labels. The authors also use a technique called "maximum pooling" to further improve the model's ability to detect sound events in noisy, real-world audio.

The DCASE 2024 Task 4 challenge provides a dataset that reflects the messy, imperfect nature of real-world audio, with heterogeneous (diverse) sounds and potentially unreliable labels. The researchers' novel approach aims to handle these challenges effectively, leading to better sound event detection performance.

Technical Explanation

The paper proposes the FMSG-JLESS method for the DCASE 2024 Task 4 on sound event detection with heterogeneous training data and potentially missing labels. The key technical components are:

Auxiliary Decoders: The model uses auxiliary decoders in addition to the main sound event detection decoder. These auxiliary decoders are trained to predict the sound event classes, but with a different loss function that encourages the model to learn more discriminative representations.
Maximum Pooling: The authors employ maximum pooling in the model's output layer to improve the model's ability to detect sound events, even in the presence of noise or other interfering sounds.
Heterogeneous Dataset: The DCASE 2024 Task 4 dataset contains a diverse range of sound events from various real-world environments, reflecting the complexity of real-world audio data.
Potentially Missing Labels: The dataset also includes examples with potentially inaccurate or missing sound event labels, which the proposed method aims to handle effectively.

The authors evaluate the FMSG-JLESS approach on the DCASE 2024 Task 4 dataset and compare its performance to other state-of-the-art methods. The results demonstrate the effectiveness of the proposed techniques in addressing the challenges of sound event detection in complex, real-world environments.

Critical Analysis

The paper provides a well-designed solution to the challenges of the DCASE 2024 Task 4, which closely reflects real-world sound event detection scenarios. The use of auxiliary decoders and maximum pooling are novel and promising approaches to handling heterogeneous datasets with potentially missing labels.

However, the paper does not provide a detailed analysis of the limitations of the proposed method. For example, it would be helpful to understand how the method performs on specific types of sound events or in the presence of different levels of label noise. Additionally, the authors could explore the trade-offs between the complexity of the model and its performance, as well as the computational and memory requirements of the FMSG-JLESS approach.

Further research could also investigate the generalizability of the method to other sound event detection tasks or datasets, as well as explore the potential for incorporating additional techniques, such as data augmentation or transfer learning, to further improve the model's performance.

Conclusion

The FMSG-JLESS submission for DCASE 2024 Task 4 presents an innovative approach to sound event detection in challenging real-world environments. By leveraging auxiliary decoders and maximum pooling, the method demonstrates the ability to effectively handle heterogeneous training data with potentially missing labels.

The technical insights and experimental results provided in this paper contribute valuable knowledge to the field of sound event detection, which has important applications in areas such as smart home automation, acoustic monitoring, and audio-based surveillance. As the researchers continue to refine and expand upon this work, the potential impact on real-world sound event detection systems is likely to grow.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

FMSG-JLESS Submission for DCASE 2024 Task4 on Sound Event Detection with Heterogeneous Training Dataset and Potentially Missing Labels

Yang Xiao, Han Yin, Jisheng Bai, Rohan Kumar Das

This report presents the systems developed and submitted by Fortemedia Singapore (FMSG) and Joint Laboratory of Environmental Sound Sensing (JLESS) for DCASE 2024 Task 4. The task focuses on recognizing event classes and their time boundaries, given that multiple events can be present and may overlap in an audio recording. The novelty this year is a dataset with two sources, making it challenging to achieve good performance without knowing the source of the audio clips during evaluation. To address this, we propose a sound event detection method using domain generalization. Our approach integrates features from bidirectional encoder representations from audio transformers and a convolutional recurrent neural network. We focus on three main strategies to improve our method. First, we apply mixstyle to the frequency dimension to adapt the mel-spectrograms from different domains. Second, we consider training loss of our model specific to each datasets for their corresponding classes. This independent learning framework helps the model extract domain-specific features effectively. Lastly, we use the sound event bounding boxes method for post-processing. Our proposed method shows superior macro-average pAUC and polyphonic SED score performance on the DCASE 2024 Challenge Task 4 validation dataset and public evaluation dataset.

7/2/2024

🔎

DCASE 2024 Task 4: Sound Event Detection with Heterogeneous Data and Missing Labels

Samuele Cornell, Janek Ebbers, Constance Douwes, Irene Mart'in-Morat'o, Manu Harju, Annamaria Mesaros, Romain Serizel

The Detection and Classification of Acoustic Scenes and Events Challenge Task 4 aims to advance sound event detection (SED) systems in domestic environments by leveraging training data with different supervision uncertainty. Participants are challenged in exploring how to best use training data from different domains and with varying annotation granularity (strong/weak temporal resolution, soft/hard labels), to obtain a robust SED system that can generalize across different scenarios. Crucially, annotation across available training datasets can be inconsistent and hence sound labels of one dataset may be present but not annotated in the other one and vice-versa. As such, systems will have to cope with potentially missing target labels during training. Moreover, as an additional novelty, systems will also be evaluated on labels with different granularity in order to assess their robustness for different applications. To lower the entry barrier for participants, we developed an updated baseline system with several caveats to address these aforementioned problems. Results with our baseline system indicate that this research direction is promising and is possible to obtain a stronger SED system by using diverse domain training data with missing labels compared to training a SED system for each domain separately.

6/13/2024

Mixstyle based Domain Generalization for Sound Event Detection with Heterogeneous Training Data

Yang Xiao, Han Yin, Jisheng Bai, Rohan Kumar Das

This work explores domain generalization (DG) for sound event detection (SED), advancing adaptability towards real-world scenarios. Our approach employs a mean-teacher framework with domain generalization to integrate heterogeneous training data, while preserving the SED model performance across the datasets. Specifically, we first apply mixstyle to the frequency dimension to adapt the mel-spectrograms from different domains. Next, we use the adaptive residual normalization method to generalize features across multiple domains by applying instance normalization in the frequency dimension. Lastly, we use the sound event bounding boxes method for post-processing. Our approach integrates features from bidirectional encoder representations from audio transformers and a convolutional recurrent neural network. We evaluate the proposed approach on DCASE 2024 Challenge Task 4 dataset, measuring polyphonic SED score (PSDS) on the DESED dataset and macro-average pAUC on the MAESTRO dataset. The results indicate that the proposed DG-based method improves both PSDS and macro-average pAUC compared to the challenge baseline.

8/30/2024

Multi-Iteration Multi-Stage Fine-Tuning of Transformers for Sound Event Detection with Heterogeneous Datasets

Florian Schmid, Paul Primus, Tobias Morocutti, Jonathan Greif, Gerhard Widmer

A central problem in building effective sound event detection systems is the lack of high-quality, strongly annotated sound event datasets. For this reason, Task 4 of the DCASE 2024 challenge proposes learning from two heterogeneous datasets, including audio clips labeled with varying annotation granularity and with different sets of possible events. We propose a multi-iteration, multi-stage procedure for fine-tuning Audio Spectrogram Transformers on the joint DESED and MAESTRO Real datasets. The first stage closely matches the baseline system setup and trains a CRNN model while keeping the pre-trained transformer model frozen. In the second stage, both CRNN and transformer are fine-tuned using heavily weighted self-supervised losses. After the second stage, we compute strong pseudo-labels for all audio clips in the training set using an ensemble of fine-tuned transformers. Then, in a second iteration, we repeat the two-stage training process and include a distillation loss based on the pseudo-labels, achieving a new single-model, state-of-the-art performance on the public evaluation set of DESED with a PSDS1 of 0.692. A single model and an ensemble, both based on our proposed training procedure, ranked first in Task 4 of the DCASE Challenge 2024.

7/19/2024