Exploring Audio-Visual Information Fusion for Sound Event Localization and Detection In Low-Resource Realistic Scenarios

2406.15160

Published 6/24/2024 by Ya Jiang, Qing Wang, Jun Du, Maocheng Hu, Pengfei Hu, Zeyan Liu, Shi Cheng, Zhaoxu Nian, Yuxuan Dong, Mingqi Cai and 2 others

eess.AS eess.SP

Exploring Audio-Visual Information Fusion for Sound Event Localization and Detection In Low-Resource Realistic Scenarios

Abstract

This study presents an audio-visual information fusion approach to sound event localization and detection (SELD) in low-resource scenarios. We aim at utilizing audio and video modality information through cross-modal learning and multi-modal fusion. First, we propose a cross-modal teacher-student learning (TSL) framework to transfer information from an audio-only teacher model, trained on a rich collection of audio data with multiple data augmentation techniques, to an audio-visual student model trained with only a limited set of multi-modal data. Next, we propose a two-stage audio-visual fusion strategy, consisting of an early feature fusion and a late video-guided decision fusion to exploit synergies between audio and video modalities. Finally, we introduce an innovative video pixel swapping (VPS) technique to extend an audio channel swapping (ACS) method to an audio-visual joint augmentation. Evaluation results on the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge data set demonstrate significant improvements in SELD performances. Furthermore, our submission to the SELD task of the DCASE 2023 Challenge ranks first place by effectively integrating the proposed techniques into a model ensemble.

Create account to get full access

Overview

This paper explores the use of audio-visual information fusion for sound event localization and detection in low-resource realistic scenarios.
The researchers investigate cross-modal teacher-student learning and multi-modal fusion techniques to improve the performance of sound event localization and detection systems.
They also explore audio channel swapping and video pixel swapping as methods to enhance the model's ability to learn cross-modal associations.

Plain English Explanation

The paper focuses on developing better systems for detecting and localizing sound events, such as a car horn or a person speaking, using both audio and visual information. Many real-world scenarios, like a noisy city street, can make it challenging for audio-only systems to accurately identify and pinpoint the sources of various sounds.

To address this, the researchers explored ways to fuse audio and visual data to create more robust and reliable sound event detection and localization models. One approach they tried was "cross-modal teacher-student learning," where a model trained on both audio and video data teaches a model that only has access to audio data. This helps the audio-only model learn to make better use of the audio information.

The researchers also experimented with "multi-modal fusion," which involves combining the audio and visual data in different ways to get the most useful information for the task. Additionally, they looked at techniques like "audio channel swapping" and "video pixel swapping" to help the models better understand the relationships between the audio and visual cues.

The goal of this work is to develop sound event detection and localization systems that can work well even in challenging real-world environments where there may be limited audio or visual data available. By leveraging both audio and visual information, these systems can become more accurate and reliable.

Technical Explanation

The researchers explored several techniques for fusing audio and visual information to improve sound event localization and detection (SELD) performance in low-resource realistic scenarios:

Cross-Modal Teacher-Student Learning: The researchers trained a teacher model on both audio and video data, then used that model to train a student model that only had access to audio data. This helped the student model learn to better utilize the audio information.
Multi-Modal Fusion: The researchers experimented with different ways of combining the audio and visual features, such as early fusion (concatenating features) and late fusion (combining predictions). This allowed the model to take advantage of complementary information from both modalities.
Audio Channel Swapping and Video Pixel Swapping: To encourage the model to learn cross-modal associations, the researchers randomly swapped audio channels or video pixels during training. This forced the model to learn robust cross-modal representations.

The researchers evaluated their approaches on the DCASE 2022 Task 3 dataset, which contains sound events in realistic low-resource scenarios. They found that the cross-modal teacher-student learning and multi-modal fusion techniques led to significant improvements in SELD performance compared to audio-only baselines.

Critical Analysis

The researchers acknowledge several limitations and areas for further research in their work:

The proposed techniques may not generalize well to other datasets or real-world scenarios beyond the DCASE 2022 Task 3 setup.
The audio channel swapping and video pixel swapping methods, while effective, may not be the optimal way to encourage cross-modal learning.
The paper does not explore the tradeoffs between performance and computational complexity/model size, which would be important for real-world deployment.

Additionally, one could question whether the specific fusion techniques used in the paper are the most effective way to combine audio and visual information. There may be other approaches, such as those explored in related work on audio-visual talker localization, sound event detection and localization, or unified audio-visual perception, that could yield even better results.

Conclusion

This paper presents an interesting exploration of using audio-visual information fusion to improve sound event localization and detection in challenging real-world scenarios. The researchers demonstrated the benefits of cross-modal teacher-student learning and multi-modal fusion techniques, as well as the potential of audio channel swapping and video pixel swapping to encourage robust cross-modal representations.

While the paper has some limitations and open questions, it contributes valuable insights to the field of text-guided visual sound source localization and audio-visual perception. The findings could help inform the development of more accurate and reliable sound event detection and localization systems for real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

T-VSL: Text-Guided Visual Sound Source Localization in Mixtures

Tanvir Mahmud, Yapeng Tian, Diana Marculescu

Visual sound source localization poses a significant challenge in identifying the semantic region of each sounding source within a video. Existing self-supervised and weakly supervised source localization methods struggle to accurately distinguish the semantic regions of each sounding object, particularly in multi-source mixtures. These methods often rely on audio-visual correspondence as guidance, which can lead to substantial performance drops in complex multi-source localization scenarios. The lack of access to individual source sounds in multi-source mixtures during training exacerbates the difficulty of learning effective audio-visual correspondence for localization. To address this limitation, in this paper, we propose incorporating the text modality as an intermediate feature guide using tri-modal joint embedding models (e.g., AudioCLIP) to disentangle the semantic audio-visual source correspondence in multi-source mixtures. Our framework, dubbed T-VSL, begins by predicting the class of sounding entities in mixtures. Subsequently, the textual representation of each sounding source is employed as guidance to disentangle fine-grained audio-visual source correspondence from multi-source mixtures, leveraging the tri-modal AudioCLIP embedding. This approach enables our framework to handle a flexible number of sources and exhibits promising zero-shot transferability to unseen classes during test time. Extensive experiments conducted on the MUSIC, VGGSound, and VGGSound-Instruments datasets demonstrate significant performance improvements over state-of-the-art methods.

4/3/2024

cs.CV cs.SD eess.AS

UniAV: Unified Audio-Visual Perception for Multi-Task Video Localization

Tiantian Geng, Teng Wang, Yanfu Zhang, Jinming Duan, Weili Guan, Feng Zheng

Video localization tasks aim to temporally locate specific instances in videos, including temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL). Existing methods over-specialize on each task, overlooking the fact that these instances often occur in the same video to form the complete video content. In this work, we present UniAV, a Unified Audio-Visual perception network, to achieve joint learning of TAL, SED and AVEL tasks for the first time. UniAV can leverage diverse data available in task-specific datasets, allowing the model to learn and share mutually beneficial knowledge across tasks and modalities. To tackle the challenges posed by substantial variations in datasets (size/domain/duration) and distinct task characteristics, we propose to uniformly encode visual and audio modalities of all videos to derive generic representations, while also designing task-specific experts to capture unique knowledge for each task. Besides, we develop a unified language-aware classifier by utilizing a pre-trained text encoder, enabling the model to flexibly detect various types of instances and previously unseen ones by simply changing prompts during inference. UniAV outperforms its single-task counterparts by a large margin with fewer parameters, achieving on-par or superior performances compared to state-of-the-art task-specific methods across ActivityNet 1.3, DESED and UnAV-100 benchmarks.

4/5/2024

cs.CV cs.MM cs.SD eess.AS

Audio-Visual Talker Localization in Video for Spatial Sound Reproduction

Davide Berghi, Philip J. B. Jackson

Object-based audio production requires the positional metadata to be defined for each point-source object, including the key elements in the foreground of the sound scene. In many media production use cases, both cameras and microphones are employed to make recordings, and the human voice is often a key element. In this research, we detect and locate the active speaker in the video, facilitating the automatic extraction of the positional metadata of the talker relative to the camera's reference frame. With the integration of the visual modality, this study expands upon our previous investigation focused solely on audio-based active speaker detection and localization. Our experiments compare conventional audio-visual approaches for active speaker detection that leverage monaural audio, our previous audio-only method that leverages multichannel recordings from a microphone array, and a novel audio-visual approach integrating vision and multichannel audio. We found the role of the two modalities to complement each other. Multichannel audio, overcoming the problem of visual occlusions, provides a double-digit reduction in detection error compared to audio-visual methods with single-channel audio. The combination of multichannel audio and vision further enhances spatial accuracy, leading to a four-percentage point increase in F1 score on the Tragic Talkers dataset. Future investigations will assess the robustness of the model in noisy and highly reverberant environments, as well as tackle the problem of off-screen speakers.

6/4/2024

eess.AS cs.CV

Sound Event Detection and Localization with Distance Estimation

Daniel Aleksander Krause, Archontis Politis, Annamaria Mesaros

Sound Event Detection and Localization (SELD) is a combined task of identifying sound events and their corresponding direction-of-arrival (DOA). While this task has numerous applications and has been extensively researched in recent years, it fails to provide full information about the sound source position. In this paper, we overcome this problem by extending the task to Sound Event Detection, Localization with Distance Estimation (3D SELD). We study two ways of integrating distance estimation within the SELD core - a multi-task approach, in which the problem is tackled by a separate model output, and a single-task approach obtained by extending the multi-ACCDOA method to include distance information. We investigate both methods for the Ambisonic and binaural versions of STARSS23: Sony-TAU Realistic Spatial Soundscapes 2023. Moreover, our study involves experiments on the loss function related to the distance estimation part. Our results show that it is possible to perform 3D SELD without any degradation of performance in sound event detection and DOA estimation.

6/13/2024

cs.SD cs.LG eess.AS