Sound Event Bounding Boxes

2406.04212

Published 6/7/2024 by Janek Ebbers, Francois G. Germain, Gordon Wichern, Jonathan Le Roux

⚙️

Abstract

Sound event detection is the task of recognizing sounds and determining their extent (onset/offset times) within an audio clip. Existing systems commonly predict sound presence confidence in short time frames. Then, thresholding produces binary frame-level presence decisions, with the extent of individual events determined by merging consecutive positive frames. In this paper, we show that frame-level thresholding degrades the prediction of the event extent by coupling it with the system's sound presence confidence. We propose to decouple the prediction of event extent and confidence by introducing SEBBs, which format each sound event prediction as a tuple of a class type, extent, and overall confidence. We also propose a change-detection-based algorithm to convert legacy frame-level outputs into SEBBs. We find the algorithm significantly improves the performance of DCASE 2023 Challenge systems, boosting the state of the art from .644 to .686 PSDS1.

Create account to get full access

Overview

The paper discusses sound event detection, which is the task of recognizing sounds and determining their start and end times within an audio clip.
Existing systems commonly predict sound presence confidence in short time frames, then use thresholding to produce binary frame-level presence decisions, and merge consecutive positive frames to determine the extent of individual events.
The paper proposes a new approach called Sound Event Bounding Boxes (SEBBs) that decouples the prediction of event extent and confidence, formatting each sound event prediction as a tuple of class type, extent, and overall confidence.
The paper also introduces a change-detection-based algorithm to convert legacy frame-level outputs into SEBBs, which significantly improves the performance of DCASE 2023 Challenge systems.

Plain English Explanation

Sound event detection is the process of identifying sounds and figuring out when they start and stop within an audio recording. Existing systems often work by predicting how confident they are that a sound is present in short time intervals, and then using a threshold to decide whether a sound is actually there. They then combine the consecutive time intervals where a sound was detected to determine the full duration of the sound event.

The researchers behind this paper found that this approach of using thresholds to make binary decisions about sound presence can actually degrade the ability to accurately predict the start and end times of sound events. Instead, they propose a new way of formatting sound event predictions called "Sound Event Bounding Boxes" (SEBBs). SEBBs separate the prediction of the sound's class (e.g. a car horn, a dog bark), its duration, and the overall confidence in the detection. This decoupling allows the system to better capture the full extent of sound events.

The researchers also developed a algorithm that can take the traditional frame-level outputs from existing sound detection systems and convert them into this new SEBB format. When they applied this algorithm to systems participating in the DCASE 2023 Challenge, it significantly improved their performance, raising the state of the art from 0.644 to 0.686 on a key metric.

Technical Explanation

Existing sound event detection systems commonly use a two-step process: first, they predict the confidence that a sound is present in short time frames; then, they apply a threshold to convert these frame-level confidence scores into binary decisions about sound presence. The extent (start and end times) of individual sound events is determined by merging consecutive positive frames.

The paper argues that this frame-level thresholding approach couples the prediction of sound presence confidence with the prediction of event extent, which can degrade performance on the latter. To address this, the authors propose a new way of formatting sound event predictions called "Sound Event Bounding Boxes" (SEBBs). SEBBs represent each detected sound event as a tuple containing the sound class, the temporal extent of the event, and an overall confidence score.

The paper also introduces a change-detection-based algorithm to convert legacy frame-level outputs into the SEBB format. This algorithm identifies significant changes in the frame-level confidence scores to determine the start and end times of sound events, and aggregates the per-frame confidence scores into an overall confidence for each SEBB.

When the authors applied this conversion algorithm to systems participating in the DCASE 2023 Challenge, they found it significantly improved their performance, boosting the state of the art from 0.644 to 0.686 on the PSDS1 metric.

Critical Analysis

The paper presents a compelling approach to improving sound event detection by decoupling the prediction of sound presence confidence and event extent. The proposed SEBB format and conversion algorithm seem promising, as evidenced by the significant performance gains on the DCASE 2023 Challenge.

However, the paper does not provide much insight into the specific reasons why the frame-level thresholding approach degrades event extent prediction. More analysis on the failure modes of this traditional approach could have strengthened the justification for the SEBB method.

Additionally, the paper does not discuss the potential computational or memory overhead of the SEBB format compared to the legacy frame-level outputs. As sound event detection is often deployed in real-time or resource-constrained applications, the efficiency implications of the new approach should be considered.

Future work could also explore the robustness of the SEBB method to noisy or ambiguous audio data, as well as its generalizability to a wider range of sound event detection tasks beyond the DCASE Challenge. Incorporating multi-modal cues could also be an interesting avenue to improve the reliability and accuracy of sound event predictions.

Conclusion

This paper introduces a novel approach to sound event detection called "Sound Event Bounding Boxes" (SEBBs), which decouples the prediction of sound event extent and confidence. The authors also present a change-detection-based algorithm to convert legacy frame-level outputs into the SEBB format, significantly improving the performance of state-of-the-art systems on the DCASE 2023 Challenge.

By reformulating the sound event detection task in this way, the paper demonstrates the potential to enhance the accuracy and robustness of audio event understanding, with applications in areas such as smart home monitoring, autonomous driving, and environmental sound analysis. The insights presented here could inspire further research into more sophisticated, context-aware approaches to audio perception and scene understanding.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🔎

DCASE 2024 Task 4: Sound Event Detection with Heterogeneous Data and Missing Labels

Samuele Cornell, Janek Ebbers, Constance Douwes, Irene Mart'in-Morat'o, Manu Harju, Annamaria Mesaros, Romain Serizel

The Detection and Classification of Acoustic Scenes and Events Challenge Task 4 aims to advance sound event detection (SED) systems in domestic environments by leveraging training data with different supervision uncertainty. Participants are challenged in exploring how to best use training data from different domains and with varying annotation granularity (strong/weak temporal resolution, soft/hard labels), to obtain a robust SED system that can generalize across different scenarios. Crucially, annotation across available training datasets can be inconsistent and hence sound labels of one dataset may be present but not annotated in the other one and vice-versa. As such, systems will have to cope with potentially missing target labels during training. Moreover, as an additional novelty, systems will also be evaluated on labels with different granularity in order to assess their robustness for different applications. To lower the entry barrier for participants, we developed an updated baseline system with several caveats to address these aforementioned problems. Results with our baseline system indicate that this research direction is promising and is possible to obtain a stronger SED system by using diverse domain training data with missing labels compared to training a SED system for each domain separately.

6/13/2024

eess.AS cs.SD

Sound Event Detection and Localization with Distance Estimation

Daniel Aleksander Krause, Archontis Politis, Annamaria Mesaros

Sound Event Detection and Localization (SELD) is a combined task of identifying sound events and their corresponding direction-of-arrival (DOA). While this task has numerous applications and has been extensively researched in recent years, it fails to provide full information about the sound source position. In this paper, we overcome this problem by extending the task to Sound Event Detection, Localization with Distance Estimation (3D SELD). We study two ways of integrating distance estimation within the SELD core - a multi-task approach, in which the problem is tackled by a separate model output, and a single-task approach obtained by extending the multi-ACCDOA method to include distance information. We investigate both methods for the Ambisonic and binaural versions of STARSS23: Sony-TAU Realistic Spatial Soundscapes 2023. Moreover, our study involves experiments on the loss function related to the distance estimation part. Our results show that it is possible to perform 3D SELD without any degradation of performance in sound event detection and DOA estimation.

6/13/2024

cs.SD cs.LG eess.AS

🔎

Sound event detection based on auxiliary decoder and maximum probability aggregation for DCASE Challenge 2024 Task 4

Sang Won Son, Jongyeon Park, Hong Kook Kim, Sulaiman Vesal, Jeong Eun Lim

In this report, we propose three novel methods for developing a sound event detection (SED) model for the DCASE 2024 Challenge Task 4. First, we propose an auxiliary decoder attached to the final convolutional block to improve feature extraction capabilities while reducing dependency on embeddings from pre-trained large models. The proposed auxiliary decoder operates independently from the main decoder, enhancing performance of the convolutional block during the initial training stages by assigning a different weight strategy between main and auxiliary decoder losses. Next, to address the time interval issue between the DESED and MAESTRO datasets, we propose maximum probability aggregation (MPA) during the training step. The proposed MPA method enables the model's output to be aligned with soft labels of 1 s in the MAESTRO dataset. Finally, we propose a multi-channel input feature that employs various versions of logmel and MFCC features to generate time-frequency pattern. The experimental results demonstrate the efficacy of these proposed methods in a view of improving SED performance by achieving a balanced enhancement across different datasets and label types. Ultimately, this approach presents a significant step forward in developing more robust and flexible SED models

6/26/2024

eess.AS cs.SD

Text-Queried Target Sound Event Localization

Jinzheng Zhao, Xinyuan Qian, Yong Xu, Haohe Liu, Yin Cao, Davide Berghi, Wenwu Wang

Sound event localization and detection (SELD) aims to determine the appearance of sound classes, together with their Direction of Arrival (DOA). However, current SELD systems can only predict the activities of specific classes, for example, 13 classes in DCASE challenges. In this paper, we propose text-queried target sound event localization (SEL), a new paradigm that allows the user to input the text to describe the sound event, and the SEL model can predict the location of the related sound event. The proposed task presents a more user-friendly way for human-computer interaction. We provide a benchmark study for the proposed task and perform experiments on datasets created by simulated room impulse response (RIR) and real RIR to validate the effectiveness of the proposed methods. We hope that our benchmark will inspire the interest and additional research for text-queried sound source localization.

6/25/2024

eess.AS