Full-frequency dynamic convolution: a physical frequency-dependent convolution for sound event detection

Read original: arXiv:2401.04976 - Published 8/23/2024 by Haobo Yue, Zhicheng Zhang, Da Mu, Yonghao Dang, Jianqin Yin, Jin Tang

Full-frequency dynamic convolution: a physical frequency-dependent convolution for sound event detection

Overview

The paper proposes a new convolution method called "full-frequency dynamic convolution" for sound event detection.
This method takes into account the frequency-dependent nature of sound events, improving upon standard convolution approaches.
The authors demonstrate the effectiveness of this technique on several sound event detection benchmarks.

Plain English Explanation

The paper describes a new way of processing audio data for the task of sound event detection. In typical convolutional neural networks used for this problem, the convolution operation treats all frequencies in the audio signal equally.

However, the authors argue that different sound events have different frequency characteristics. For example, a low-frequency rumble from a car engine versus a high-pitched bird call. The proposed "full-frequency dynamic convolution" method allows the network to dynamically adjust how it processes different frequency bands in the audio, depending on the characteristics of the sound event.

This frequency-aware convolution aims to better capture the physical properties of the audio, leading to improved performance on sound event detection benchmarks compared to standard convolution approaches.

Technical Explanation

The key innovation in this paper is the "full-frequency dynamic convolution" module, which extends the standard convolution operation. In a typical convolution, a fixed set of filters is applied across the entire frequency spectrum of the input audio.

Instead, the full-frequency dynamic convolution uses frequency-dependent convolution kernels that can adapt to the specific frequency characteristics of the sound event being detected. This is achieved by predicting separate convolution weights for different frequency bands within the audio input.

The authors integrate this dynamic convolution module into a convolutional neural network architecture for sound event detection. They evaluate this approach on several benchmark datasets, demonstrating improved performance over standard convolution-based methods, especially for sound events with distinct frequency profiles.

Critical Analysis

The paper provides a thoughtful approach to improving sound event detection by incorporating frequency-aware convolution. The authors acknowledge that while typical convolution treats all frequencies equally, real-world sound events often have unique frequency characteristics that are important for accurate detection.

One potential limitation is that the full-frequency dynamic convolution may be more computationally expensive than standard convolution, due to the need to predict separate convolution weights for each frequency band. The authors do not provide detailed analysis of the computational overhead or potential trade-offs.

Additionally, the paper focuses on evaluating the approach on established sound event detection benchmarks. Further research could explore how the frequency-aware convolution performs on more diverse or challenging real-world audio datasets, or investigate potential applications beyond just sound event detection.

Overall, the proposed full-frequency dynamic convolution is a promising technique that could help advance the state-of-the-art in audio processing and understanding tasks.

Conclusion

This paper introduces a novel convolution method called "full-frequency dynamic convolution" that adapts to the frequency characteristics of sound events, improving upon standard convolution approaches for the task of sound event detection.

By dynamically adjusting the convolution weights based on the frequency content of the audio input, the authors demonstrate improved performance on several benchmark datasets. This frequency-aware convolution technique represents an important step towards more physically-grounded and effective audio processing models.

The findings from this research could have broader implications for other audio-related applications, such as speech recognition, music analysis, and environmental sound understanding. Further exploration of the full-frequency dynamic convolution approach could lead to continued advancements in the field of computational auditory scene analysis.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Full-frequency dynamic convolution: a physical frequency-dependent convolution for sound event detection

Haobo Yue, Zhicheng Zhang, Da Mu, Yonghao Dang, Jianqin Yin, Jin Tang

Recently, 2D convolution has been found unqualified in sound event detection (SED). It enforces translation equivariance on sound events along frequency axis, which is not a shift-invariant dimension. To address this issue, dynamic convolution is used to model the frequency dependency of sound events. In this paper, we proposed the first full-dynamic method named full-frequency dynamic convolution (FFDConv). FFDConv generates frequency kernels for every frequency band, which is designed directly in the structure for frequency-dependent modeling. It physically furnished 2D convolution with the capability of frequency-dependent modeling. FFDConv outperforms not only the baseline by 6.6% in DESED real validation dataset in terms of PSDS1, but outperforms the other full-dynamic methods. In addition, by visualizing features of sound events, we observed that FFDConv could effectively extract coherent features in specific frequency bands, consistent with the vocal continuity of sound events. This proves that FFDConv has great frequency-dependent perception ability.

8/23/2024

Pushing the Limit of Sound Event Detection with Multi-Dilated Frequency Dynamic Convolution

Hyeonuk Nam, Yong-Hwa Park

Frequency dynamic convolution (FDY conv) has been a milestone in the sound event detection (SED) field, but it involves a substantial increase in model size due to multiple basis kernels. In this work, we propose partial frequency dynamic convolution (PFD conv), which concatenates static convolution output and dynamic FDY conv output in order to minimize model size increase while maintaining the performance. Additionally, we propose multi-dilated frequency dynamic convolution (MDFD conv), which integrates multiple dilated frequency dynamic convolution (DFD conv) branches with different dilation size sets and a static branch within a single convolution module, achieving a 3.17% improvement in polyphonic sound detection score (PSDS) over FDY conv. Proposed methods with extensive ablation studies further enhance understanding and usability of FDY conv variants.

7/9/2024

Diversifying and Expanding Frequency-Adaptive Convolution Kernels for Sound Event Detection

Hyeonuk Nam, Seong-Hu Kim, Deokki Min, Junhyeok Lee, Yong-Hwa Park

Frequency dynamic convolution (FDY conv) has shown the state-of-the-art performance in sound event detection (SED) using frequency-adaptive kernels obtained by frequency-varying combination of basis kernels. However, FDY conv lacks an explicit mean to diversify frequency-adaptive kernels, potentially limiting the performance. In addition, size of basis kernels is limited while time-frequency patterns span larger spectro-temporal range. Therefore, we propose dilated frequency dynamic convolution (DFD conv) which diversifies and expands frequency-adaptive kernels by introducing different dilation sizes to basis kernels. Experiments showed advantages of varying dilation sizes along frequency dimension, and analysis on attention weight variance proved dilated basis kernels are effectively diversified. By adapting class-wise median filter with intersection-based F1 score, proposed DFD-CRNN outperforms FDY-CRNN by 3.12% in terms of polyphonic sound detection score (PSDS).

6/11/2024

Self Training and Ensembling Frequency Dependent Networks with Coarse Prediction Pooling and Sound Event Bounding Boxes

Hyeonuk Nam, Deokki Min, Seungdeok Choi, Inhan Choi, Yong-Hwa Park

To tackle sound event detection (SED) task, we propose frequency dependent networks (FreDNets), which heavily leverage frequency-dependent methods. We apply frequency warping and FilterAugment, which are frequency-dependent data augmentation methods. The model architecture consists of 3 branches: audio teacher-student transformer (ATST) branch, BEATs branch and CNN branch including either partial dilated frequency dynamic convolution (PDFD) or squeeze-and-Excitation (SE) with time-frame frequency-wise SE (tfwSE). To train MAESTRO labels with coarse temporal resolution, we apply max pooling on prediction for the MAESTRO dataset. Using best ensemble model, we apply self training to obtain pseudo label from DESED weak set, DESED unlabeled set and AudioSet. AudioSet labels are filtered to focus on high-confidence pseudo labels and AudioSet pseudo labels are used to train on DESED labels only. We used change-detection-based sound event bounding boxes (cSEBBs) as post processing for ensemble models on self training and submission models.

6/26/2024