Text-Queried Target Sound Event Localization

2406.16058

Published 6/25/2024 by Jinzheng Zhao, Xinyuan Qian, Yong Xu, Haohe Liu, Yin Cao, Davide Berghi, Wenwu Wang

Text-Queried Target Sound Event Localization

Abstract

Sound event localization and detection (SELD) aims to determine the appearance of sound classes, together with their Direction of Arrival (DOA). However, current SELD systems can only predict the activities of specific classes, for example, 13 classes in DCASE challenges. In this paper, we propose text-queried target sound event localization (SEL), a new paradigm that allows the user to input the text to describe the sound event, and the SEL model can predict the location of the related sound event. The proposed task presents a more user-friendly way for human-computer interaction. We provide a benchmark study for the proposed task and perform experiments on datasets created by simulated room impulse response (RIR) and real RIR to validate the effectiveness of the proposed methods. We hope that our benchmark will inspire the interest and additional research for text-queried sound source localization.

Create account to get full access

Overview

This paper presents a novel approach for locating and detecting target sound events based on text queries.
The proposed method fuses audio and visual information to improve the accuracy and robustness of sound event localization and detection.
The researchers evaluate their approach on several benchmark datasets, demonstrating its effectiveness compared to existing techniques.

Plain English Explanation

In this research, the authors have developed a new way to locate and identify specific sound events based on text descriptions. For example, if you wanted to find the location of a barking dog in an audio recording, you could provide a text query like "dog barking" and the system would pinpoint where that sound is coming from.

The key innovation is that the system uses both audio and visual information to make these detections and localizations more accurate and reliable. By combining cues from the sound and any accompanying video, the model can better distinguish the target sound event from background noise or other irrelevant sounds.

The researchers tested their approach on several established datasets used for evaluating sound event localization and detection. The results showed that their text-queried, multimodal fusion method outperformed previous state-of-the-art techniques. This suggests the approach could be very useful for applications like sound event detection, audio-visual event recognition, and locating specific sounds in complex audio environments.

Technical Explanation

The paper proposes a text-queried target sound event localization (TQTSEL) model that leverages both audio and visual information to detect and locate specific sound events of interest. The architecture consists of several key components:

Audio Encoder: This module processes the input audio waveform and extracts relevant acoustic features.
Visual Encoder: This component analyzes any accompanying video frames to capture visual cues about the scene.
Text Encoder: This part encodes the text query describing the target sound event.
Multimodal Fusion: The audio, visual, and text features are combined through a series of fusion layers to produce the final sound event localization and detection outputs.

The researchers evaluate their TQTSEL model on several benchmark datasets, including DCASE 2024 Task 4 for general sound event detection and WASN for outdoor sound event localization. The results demonstrate that the text-queried, multimodal approach outperforms unimodal and other state-of-the-art methods, particularly in terms of localization accuracy and robustness to challenging audio environments.

Critical Analysis

The paper presents a compelling approach for improving the performance of sound event localization and detection tasks by leveraging multimodal information. The use of text queries to specify the target sound of interest is a novel and practical extension of previous work in this area, which has typically relied on predefined sound classes or events.

However, the paper does not extensively discuss the potential limitations or caveats of the proposed TQTSEL model. For example, it is unclear how the model would perform on open-ended text queries that go beyond the predefined sound events in the training data, or how sensitive the model is to variations in the wording of the text queries.

Additionally, the paper could have provided more insight into the tradeoffs or failure modes of the multimodal fusion approach compared to unimodal approaches. It would be useful to understand the specific scenarios where the audio-visual fusion provides the greatest benefits, as well as any cases where the fusion may introduce errors or reduce performance.

Overall, the research represents an important step forward in sound event detection and localization, and the insights gained could inform the design of more robust and flexible multimodal systems for real-world applications.

Conclusion

This paper introduces a novel text-queried target sound event localization (TQTSEL) model that combines audio and visual information to detect and locate specific sound events of interest. The proposed approach demonstrated strong performance on several benchmark datasets, outperforming previous state-of-the-art techniques.

The key innovation is the use of text queries to specify the target sound, which makes the system more flexible and practical for real-world applications compared to approaches that rely on predefined sound classes. The multimodal fusion of audio and visual cues also improves the overall accuracy and robustness of the sound event localization and detection.

While the paper could have provided more insight into the limitations and tradeoffs of the TQTSEL model, the research represents an important advancement in the field of audio-visual information fusion for sound event understanding. The findings could have significant implications for a wide range of applications, from security and surveillance to assistive technologies and human-computer interaction.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Sound Event Detection and Localization with Distance Estimation

Daniel Aleksander Krause, Archontis Politis, Annamaria Mesaros

Sound Event Detection and Localization (SELD) is a combined task of identifying sound events and their corresponding direction-of-arrival (DOA). While this task has numerous applications and has been extensively researched in recent years, it fails to provide full information about the sound source position. In this paper, we overcome this problem by extending the task to Sound Event Detection, Localization with Distance Estimation (3D SELD). We study two ways of integrating distance estimation within the SELD core - a multi-task approach, in which the problem is tackled by a separate model output, and a single-task approach obtained by extending the multi-ACCDOA method to include distance information. We investigate both methods for the Ambisonic and binaural versions of STARSS23: Sony-TAU Realistic Spatial Soundscapes 2023. Moreover, our study involves experiments on the loss function related to the distance estimation part. Our results show that it is possible to perform 3D SELD without any degradation of performance in sound event detection and DOA estimation.

6/13/2024

cs.SD cs.LG eess.AS

Exploring Audio-Visual Information Fusion for Sound Event Localization and Detection In Low-Resource Realistic Scenarios

Ya Jiang, Qing Wang, Jun Du, Maocheng Hu, Pengfei Hu, Zeyan Liu, Shi Cheng, Zhaoxu Nian, Yuxuan Dong, Mingqi Cai, Xin Fang, Chin-Hui Lee

This study presents an audio-visual information fusion approach to sound event localization and detection (SELD) in low-resource scenarios. We aim at utilizing audio and video modality information through cross-modal learning and multi-modal fusion. First, we propose a cross-modal teacher-student learning (TSL) framework to transfer information from an audio-only teacher model, trained on a rich collection of audio data with multiple data augmentation techniques, to an audio-visual student model trained with only a limited set of multi-modal data. Next, we propose a two-stage audio-visual fusion strategy, consisting of an early feature fusion and a late video-guided decision fusion to exploit synergies between audio and video modalities. Finally, we introduce an innovative video pixel swapping (VPS) technique to extend an audio channel swapping (ACS) method to an audio-visual joint augmentation. Evaluation results on the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge data set demonstrate significant improvements in SELD performances. Furthermore, our submission to the SELD task of the DCASE 2023 Challenge ranks first place by effectively integrating the proposed techniques into a model ensemble.

6/24/2024

eess.AS eess.SP

🔎

DCASE 2024 Task 4: Sound Event Detection with Heterogeneous Data and Missing Labels

Samuele Cornell, Janek Ebbers, Constance Douwes, Irene Mart'in-Morat'o, Manu Harju, Annamaria Mesaros, Romain Serizel

The Detection and Classification of Acoustic Scenes and Events Challenge Task 4 aims to advance sound event detection (SED) systems in domestic environments by leveraging training data with different supervision uncertainty. Participants are challenged in exploring how to best use training data from different domains and with varying annotation granularity (strong/weak temporal resolution, soft/hard labels), to obtain a robust SED system that can generalize across different scenarios. Crucially, annotation across available training datasets can be inconsistent and hence sound labels of one dataset may be present but not annotated in the other one and vice-versa. As such, systems will have to cope with potentially missing target labels during training. Moreover, as an additional novelty, systems will also be evaluated on labels with different granularity in order to assess their robustness for different applications. To lower the entry barrier for participants, we developed an updated baseline system with several caveats to address these aforementioned problems. Results with our baseline system indicate that this research direction is promising and is possible to obtain a stronger SED system by using diverse domain training data with missing labels compared to training a SED system for each domain separately.

6/13/2024

eess.AS cs.SD

Sound event localization and classification using WASN in Outdoor Environment

Dongzhe Zhang, Jianfeng Chen, Jisheng Bai, Mou Wang

Deep learning-based sound event localization and classification is an emerging research area within wireless acoustic sensor networks. However, current methods for sound event localization and classification typically rely on a single microphone array, making them susceptible to signal attenuation and environmental noise, which limits their monitoring range. Moreover, methods using multiple microphone arrays often focus solely on source localization, neglecting the aspect of sound event classification. In this paper, we propose a deep learning-based method that employs multiple features and attention mechanisms to estimate the location and class of sound source. We introduce a Soundmap feature to capture spatial information across multiple frequency bands. We also use the Gammatone filter to generate acoustic features more suitable for outdoor environments. Furthermore, we integrate attention mechanisms to learn channel-wise relationships and temporal dependencies within the acoustic features. To evaluate our proposed method, we conduct experiments using simulated datasets with different levels of noise and size of monitoring areas, as well as different arrays and source positions. The experimental results demonstrate the superiority of our proposed method over state-of-the-art methods in both sound event classification and sound source localization tasks. And we provide further analysis to explain the reasons for the observed errors.

4/1/2024

cs.SD cs.LG eess.AS