Self Training and Ensembling Frequency Dependent Networks with Coarse Prediction Pooling and Sound Event Bounding Boxes

Read original: arXiv:2406.15725 - Published 6/26/2024 by Hyeonuk Nam, Deokki Min, Seungdeok Choi, Inhan Choi, Yong-Hwa Park

Self Training and Ensembling Frequency Dependent Networks with Coarse Prediction Pooling and Sound Event Bounding Boxes

Overview

This paper proposes a novel approach for sound event detection and localization using frequency-dependent neural networks, self-training, and ensemble methods.
The key ideas include coarse prediction pooling, sound event bounding boxes, and leveraging frequency-adaptive convolutional kernels to capture different sound characteristics.
The method also incorporates self-training and model ensembling techniques to improve performance.

Plain English Explanation

The paper focuses on the task of sound event detection and localization, which involves identifying the presence and location of various sounds like speech, music, or environmental noises in audio recordings. The authors propose a system that uses specialized neural network models to tackle this problem.

The key innovation is the use of "frequency-dependent" neural networks, which means the model has different components that are specialized to process different frequency ranges of the audio signal. This allows the model to better capture the distinctive characteristics of different types of sounds.

The authors also introduce a technique called "coarse prediction pooling" that helps the model make more accurate predictions by aggregating information across different frequency bands. Another novel aspect is the use of "sound event bounding boxes" to precisely locate the position of sound events in the audio.

To further boost performance, the researchers leverage self-training and model ensembling. Self-training involves the model refining its own predictions on unlabeled data, while ensembling combines the outputs of multiple models to arrive at more reliable results.

Overall, this work aims to advance the state-of-the-art in sound event detection and localization, which has important applications in areas like smart home technology, autonomous vehicles, and audio surveillance.

Technical Explanation

The paper presents a multi-stage framework for sound event detection and localization. At the core of the system are frequency-dependent neural networks, which have separate convolutional and pooling layers specialized for different frequency bands of the input audio spectrogram.

This frequency-dependent architecture is inspired by work on frequency-adaptive convolution kernels and auxiliary decoders for sound event detection. The authors hypothesize that this structure can better capture the diverse spectral characteristics of different sound classes.

To further improve performance, the authors introduce "coarse prediction pooling", which aggregates the frequency-dependent predictions into a single output. This helps the model make more robust decisions by considering evidence across the full spectrum.

Additionally, the authors utilize sound event bounding boxes to provide localization information along with the sound event detection.

To leverage unlabeled data, the authors employ a self-training strategy similar to frequency-mix knowledge distillation. They also create an ensemble of the frequency-dependent models to further boost performance.

The proposed framework is evaluated on public sound event detection benchmarks, demonstrating state-of-the-art results. The authors attribute the improvements to the synergistic effects of the frequency-dependent architecture, coarse prediction pooling, sound event bounding boxes, self-training, and model ensembling.

Critical Analysis

The paper presents a well-designed and comprehensive framework that effectively leverages multiple novel techniques to advance the state-of-the-art in sound event detection and localization. The frequency-dependent architecture and coarse prediction pooling are particularly interesting contributions that could have broader applications beyond this specific task.

However, the paper does not provide much insight into the relative importance of each individual component. It would be helpful to understand the performance impact of the different techniques, as well as any potential trade-offs or limitations. For example, the self-training strategy may be sensitive to the quality of the unlabeled data, and the ensemble approach could increase computational complexity during inference.

Additionally, the paper focuses on established benchmark datasets, but it would be valuable to see how the proposed system performs on real-world, unconstrained audio recordings, which may present additional challenges.

Overall, this work represents a significant advancement in the field and provides a solid foundation for future research. By continuing to explore the synergies between specialized network architectures, self-supervision, and ensemble methods, the authors may uncover even more powerful solutions for audio scene understanding.

Conclusion

This paper introduces a novel framework for sound event detection and localization that combines frequency-dependent neural networks, coarse prediction pooling, sound event bounding boxes, self-training, and model ensembling. The key innovations enable the system to better capture the diverse spectral characteristics of different sound classes and leverage unlabeled data to improve performance.

The proposed approach demonstrates state-of-the-art results on established benchmarks, highlighting its potential for real-world applications in areas such as smart home technology, autonomous vehicles, and audio surveillance. By continuing to explore the synergies between specialized network architectures, self-supervision, and ensemble methods, the authors may uncover even more powerful solutions for audio scene understanding.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Self Training and Ensembling Frequency Dependent Networks with Coarse Prediction Pooling and Sound Event Bounding Boxes

Hyeonuk Nam, Deokki Min, Seungdeok Choi, Inhan Choi, Yong-Hwa Park

To tackle sound event detection (SED) task, we propose frequency dependent networks (FreDNets), which heavily leverage frequency-dependent methods. We apply frequency warping and FilterAugment, which are frequency-dependent data augmentation methods. The model architecture consists of 3 branches: audio teacher-student transformer (ATST) branch, BEATs branch and CNN branch including either partial dilated frequency dynamic convolution (PDFD) or squeeze-and-Excitation (SE) with time-frame frequency-wise SE (tfwSE). To train MAESTRO labels with coarse temporal resolution, we apply max pooling on prediction for the MAESTRO dataset. Using best ensemble model, we apply self training to obtain pseudo label from DESED weak set, DESED unlabeled set and AudioSet. AudioSet labels are filtered to focus on high-confidence pseudo labels and AudioSet pseudo labels are used to train on DESED labels only. We used change-detection-based sound event bounding boxes (cSEBBs) as post processing for ensemble models on self training and submission models.

6/26/2024

Multi-Iteration Multi-Stage Fine-Tuning of Transformers for Sound Event Detection with Heterogeneous Datasets

Florian Schmid, Paul Primus, Tobias Morocutti, Jonathan Greif, Gerhard Widmer

A central problem in building effective sound event detection systems is the lack of high-quality, strongly annotated sound event datasets. For this reason, Task 4 of the DCASE 2024 challenge proposes learning from two heterogeneous datasets, including audio clips labeled with varying annotation granularity and with different sets of possible events. We propose a multi-iteration, multi-stage procedure for fine-tuning Audio Spectrogram Transformers on the joint DESED and MAESTRO Real datasets. The first stage closely matches the baseline system setup and trains a CRNN model while keeping the pre-trained transformer model frozen. In the second stage, both CRNN and transformer are fine-tuned using heavily weighted self-supervised losses. After the second stage, we compute strong pseudo-labels for all audio clips in the training set using an ensemble of fine-tuned transformers. Then, in a second iteration, we repeat the two-stage training process and include a distillation loss based on the pseudo-labels, achieving a new single-model, state-of-the-art performance on the public evaluation set of DESED with a PSDS1 of 0.692. A single model and an ensemble, both based on our proposed training procedure, ranked first in Task 4 of the DCASE Challenge 2024.

7/19/2024

Improving Audio Spectrogram Transformers for Sound Event Detection Through Multi-Stage Training

Florian Schmid, Paul Primus, Tobias Morocutti, Jonathan Greif, Gerhard Widmer

This technical report describes the CP-JKU team's submission for Task 4 Sound Event Detection with Heterogeneous Training Datasets and Potentially Missing Labels of the DCASE 24 Challenge. We fine-tune three large Audio Spectrogram Transformers, PaSST, BEATs, and ATST, on the joint DESED and MAESTRO datasets in a two-stage training procedure. The first stage closely matches the baseline system setup and trains a CRNN model while keeping the large pre-trained transformer model frozen. In the second stage, both CRNN and transformer are fine-tuned using heavily weighted self-supervised losses. After the second stage, we compute strong pseudo-labels for all audio clips in the training set using an ensemble of all three fine-tuned transformers. Then, in a second iteration, we repeat the two-stage training process and include a distillation loss based on the pseudo-labels, boosting single-model performance substantially. Additionally, we pre-train PaSST and ATST on the subset of AudioSet that comes with strong temporal labels, before fine-tuning them on the Task 4 datasets.

8/6/2024

Mixstyle based Domain Generalization for Sound Event Detection with Heterogeneous Training Data

Yang Xiao, Han Yin, Jisheng Bai, Rohan Kumar Das

This work explores domain generalization (DG) for sound event detection (SED), advancing adaptability towards real-world scenarios. Our approach employs a mean-teacher framework with domain generalization to integrate heterogeneous training data, while preserving the SED model performance across the datasets. Specifically, we first apply mixstyle to the frequency dimension to adapt the mel-spectrograms from different domains. Next, we use the adaptive residual normalization method to generalize features across multiple domains by applying instance normalization in the frequency dimension. Lastly, we use the sound event bounding boxes method for post-processing. Our approach integrates features from bidirectional encoder representations from audio transformers and a convolutional recurrent neural network. We evaluate the proposed approach on DCASE 2024 Challenge Task 4 dataset, measuring polyphonic SED score (PSDS) on the DESED dataset and macro-average pAUC on the MAESTRO dataset. The results indicate that the proposed DG-based method improves both PSDS and macro-average pAUC compared to the challenge baseline.

8/30/2024