Pushing the Limit of Sound Event Detection with Multi-Dilated Frequency Dynamic Convolution

Read original: arXiv:2406.13312 - Published 7/9/2024 by Hyeonuk Nam, Yong-Hwa Park

Pushing the Limit of Sound Event Detection with Multi-Dilated Frequency Dynamic Convolution

Overview

This paper proposes a novel sound event detection model called Multi-Dilated Frequency Dynamic Convolution (MDFDC) that pushes the limits of performance on this task.
The key innovations include the use of multi-dilated convolutions to capture dependencies across different time and frequency scales, as well as a frequency-dynamic convolution mechanism to adaptively adjust the receptive field.
The model is evaluated on several sound event detection benchmarks and achieves state-of-the-art results, demonstrating its effectiveness at accurately detecting a wide range of sound events.

Plain English Explanation

The paper describes a new machine learning model designed to identify different types of sounds, such as a dog barking, a car honking, or a person clapping. This task, known as sound event detection, is an important capability for many real-world applications like smart home automation, autonomous vehicles, and audio analysis.

The core idea behind this new model, called Multi-Dilated Frequency Dynamic Convolution (MDFDC), is to use a special type of neural network architecture that can better capture the complex patterns and relationships within audio signals. Specifically, it employs "multi-dilated convolutions" to analyze the audio at different timescales and "frequency-dynamic convolutions" to adjust its focus on different frequency bands as needed.

By using these advanced techniques, the MDFDC model is able to outperform previous state-of-the-art approaches on several standard benchmarks for sound event detection. This suggests the model is better able to recognize a wide variety of sounds, which could enable a variety of practical applications, like improving speech detection for virtual assistants or enhancing environmental sound monitoring for smart cities.

Technical Explanation

The key technical innovations in the MDFDC model are the use of multi-dilated convolutions and frequency-dynamic convolutions.

Multi-dilated convolutions employ multiple convolutional kernels with different dilation rates, allowing the model to capture dependencies across different time and frequency scales in the audio input. This enables the model to better understand the complex temporal and spectral patterns that characterize different sound events.

The frequency-dynamic convolution mechanism adaptively adjusts the receptive field of the convolution operation based on the input frequencies. This allows the model to focus its attention on the most relevant frequency bands for each sound event, improving its ability to distinguish between similar sounds.

The authors evaluate the MDFDC model on several standard sound event detection datasets, including URBAN-SED and DCASE, and show that it outperforms previous state-of-the-art approaches. This demonstrates the effectiveness of the proposed architectural innovations for advancing the state-of-the-art in this important domain.

Critical Analysis

The paper provides a thorough evaluation of the MDFDC model and its performance on multiple sound event detection benchmarks. However, the authors do not extensively discuss potential limitations or areas for future work.

For example, the model's performance may be dependent on the availability of large, high-quality training datasets, which can be challenging to obtain for many real-world sound event detection applications. Additionally, the computational complexity of the multi-dilated and frequency-dynamic convolutions could limit the model's deployment on resource-constrained devices.

Further research could explore strategies to improve the model's sample efficiency, such as leveraging auxiliary decoders or knowledge distillation techniques. Examining the model's robustness to noisy or ambiguous audio inputs would also be a valuable direction for future work.

Conclusion

The MDFDC model proposed in this paper represents a significant advancement in sound event detection, demonstrating state-of-the-art performance on several benchmark datasets. The innovative use of multi-dilated and frequency-dynamic convolutions allows the model to capture the complex spectrotemporal patterns that characterize different sound events.

The potential impact of this research is broad, with applications ranging from smart home automation and environmental monitoring to assistive technologies and audio-based surveillance. Further research to address the model's potential limitations and explore additional real-world scenarios could unlock even more impressive applications of this cutting-edge sound event detection technology.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Pushing the Limit of Sound Event Detection with Multi-Dilated Frequency Dynamic Convolution

Hyeonuk Nam, Yong-Hwa Park

Frequency dynamic convolution (FDY conv) has been a milestone in the sound event detection (SED) field, but it involves a substantial increase in model size due to multiple basis kernels. In this work, we propose partial frequency dynamic convolution (PFD conv), which concatenates static convolution output and dynamic FDY conv output in order to minimize model size increase while maintaining the performance. Additionally, we propose multi-dilated frequency dynamic convolution (MDFD conv), which integrates multiple dilated frequency dynamic convolution (DFD conv) branches with different dilation size sets and a static branch within a single convolution module, achieving a 3.17% improvement in polyphonic sound detection score (PSDS) over FDY conv. Proposed methods with extensive ablation studies further enhance understanding and usability of FDY conv variants.

7/9/2024

Diversifying and Expanding Frequency-Adaptive Convolution Kernels for Sound Event Detection

Hyeonuk Nam, Seong-Hu Kim, Deokki Min, Junhyeok Lee, Yong-Hwa Park

Frequency dynamic convolution (FDY conv) has shown the state-of-the-art performance in sound event detection (SED) using frequency-adaptive kernels obtained by frequency-varying combination of basis kernels. However, FDY conv lacks an explicit mean to diversify frequency-adaptive kernels, potentially limiting the performance. In addition, size of basis kernels is limited while time-frequency patterns span larger spectro-temporal range. Therefore, we propose dilated frequency dynamic convolution (DFD conv) which diversifies and expands frequency-adaptive kernels by introducing different dilation sizes to basis kernels. Experiments showed advantages of varying dilation sizes along frequency dimension, and analysis on attention weight variance proved dilated basis kernels are effectively diversified. By adapting class-wise median filter with intersection-based F1 score, proposed DFD-CRNN outperforms FDY-CRNN by 3.12% in terms of polyphonic sound detection score (PSDS).

6/11/2024

Full-frequency dynamic convolution: a physical frequency-dependent convolution for sound event detection

Haobo Yue, Zhicheng Zhang, Da Mu, Yonghao Dang, Jianqin Yin, Jin Tang

Recently, 2D convolution has been found unqualified in sound event detection (SED). It enforces translation equivariance on sound events along frequency axis, which is not a shift-invariant dimension. To address this issue, dynamic convolution is used to model the frequency dependency of sound events. In this paper, we proposed the first full-dynamic method named full-frequency dynamic convolution (FFDConv). FFDConv generates frequency kernels for every frequency band, which is designed directly in the structure for frequency-dependent modeling. It physically furnished 2D convolution with the capability of frequency-dependent modeling. FFDConv outperforms not only the baseline by 6.6% in DESED real validation dataset in terms of PSDS1, but outperforms the other full-dynamic methods. In addition, by visualizing features of sound events, we observed that FFDConv could effectively extract coherent features in specific frequency bands, consistent with the vocal continuity of sound events. This proves that FFDConv has great frequency-dependent perception ability.

8/23/2024

Self Training and Ensembling Frequency Dependent Networks with Coarse Prediction Pooling and Sound Event Bounding Boxes

Hyeonuk Nam, Deokki Min, Seungdeok Choi, Inhan Choi, Yong-Hwa Park

To tackle sound event detection (SED) task, we propose frequency dependent networks (FreDNets), which heavily leverage frequency-dependent methods. We apply frequency warping and FilterAugment, which are frequency-dependent data augmentation methods. The model architecture consists of 3 branches: audio teacher-student transformer (ATST) branch, BEATs branch and CNN branch including either partial dilated frequency dynamic convolution (PDFD) or squeeze-and-Excitation (SE) with time-frame frequency-wise SE (tfwSE). To train MAESTRO labels with coarse temporal resolution, we apply max pooling on prediction for the MAESTRO dataset. Using best ensemble model, we apply self training to obtain pseudo label from DESED weak set, DESED unlabeled set and AudioSet. AudioSet labels are filtered to focus on high-confidence pseudo labels and AudioSet pseudo labels are used to train on DESED labels only. We used change-detection-based sound event bounding boxes (cSEBBs) as post processing for ensemble models on self training and submission models.

6/26/2024