IPDnet: A Universal Direct-Path IPD Estimation Network for Sound Source Localization

Read original: arXiv:2405.07021 - Published 5/14/2024 by Yabo Wang, Bing Yang, Xiaofei Li

🌐

Overview

This paper proposes a neural network called IPDnet that can estimate the direct-path inter-channel phase difference (DP-IPD) of sound sources from microphone array signals.
The estimated DP-IPD can be used to determine the location of the sound sources based on the known geometry of the microphone array.
The key innovations include a full-band and narrow-band fusion network for DP-IPD estimation, a multi-track DP-IPD learning target for localization of multiple sound sources, and the ability to handle variable microphone array configurations.

Plain English Explanation

When trying to locate the source of a sound using multiple microphones, it's important to extract the direct-path information, which is the sound that travels directly from the source to the microphones. This is crucial for accurately determining the location of the sound source, especially in noisy or reverberant environments.

The researchers developed a neural network called IPDnet that can estimate the direct-path inter-channel phase difference (DP-IPD) from the microphone array signals. The DP-IPD refers to the difference in the phase of the sound waves reaching different microphones, which contains information about the location of the sound source.

The key innovations in this work include:

A full-band and narrow-band fusion network that can effectively estimate the rough DP-IPD information in one frequency band and capture the frequency correlations of DP-IPD.
A new multi-track DP-IPD learning target that allows the network to localize a flexible number of sound sources.
The ability to handle variable microphone arrays, so the trained model can process arbitrary microphone arrays with different numbers of channels and array topologies.

These advancements help the IPDnet achieve excellent sound source localization performance, even in challenging real-world scenarios with multiple moving sound sources.

Technical Explanation

The proposed IPDnet is a neural network that estimates the direct-path inter-channel phase difference (DP-IPD) of sound sources from microphone array signals. The DP-IPD is a crucial spatial feature for accurate sound source localization, especially in adverse acoustic environments with background noise and reverberation.

The key innovations in the IPDnet architecture include:

Full-band and Narrow-band Fusion Network: The network combines alternating narrow-band and full-band layers to estimate the rough DP-IPD information in one frequency band and capture the frequency correlations of DP-IPD, respectively. This fusion approach improves the network's ability to model the complex relationships between the DP-IPD and sound source locations.
Multi-track DP-IPD Learning Target: The researchers propose a new learning target that allows the network to localize a flexible number of sound sources. Instead of predicting a single DP-IPD value, the network outputs a multi-track DP-IPD vector, where each track corresponds to the DP-IPD of one sound source.
Variable Microphone Array Handling: The IPDnet can be extended to handle variable microphone arrays, meaning that once trained, the model can process arbitrary microphone arrays with different numbers of channels and array topologies. This makes the approach more practical for real-world applications, where the microphone array configuration may not be fixed.

Experiments on both simulated and real-world data for multiple-moving-speaker localization demonstrate the excellent performance of the proposed full-band and narrow-band fusion network and the multi-track DP-IPD learning target. Additionally, the variable-array model generalized well to unseen microphone array configurations, showcasing its robustness and flexibility.

Critical Analysis

The researchers have addressed several important challenges in sound source localization, such as the need to extract direct-path spatial features, handle multiple sound sources, and accommodate variable microphone array configurations. The proposed IPDnet architecture and learning targets represent significant advancements in the field.

However, the paper does not provide a detailed analysis of the computational complexity and runtime performance of the IPDnet, which are crucial factors for real-world deployment. Additionally, the authors could have explored the interpretability of the trained model to better understand the network's decision-making process and potential biases.

Future research could also investigate the efficiency of the sound field reconstruction used in the IPDnet, as well as explore the integration of physical constraints or domain knowledge to further improve the model's robustness and generalization capabilities.

Conclusion

This paper presents the IPDnet, a novel neural network architecture for estimating the direct-path inter-channel phase difference (DP-IPD) from microphone array signals. The key innovations, including the full-band and narrow-band fusion network, multi-track DP-IPD learning target, and variable microphone array handling, enable the IPDnet to achieve excellent sound source localization performance, even in challenging real-world scenarios.

The proposed approach represents a significant advancement in the field of spatial audio processing and has the potential to enhance a wide range of applications, such as smart home systems, teleconferencing, and augmented reality. As the research in this area continues to evolve, the insights and techniques presented in this paper will likely inspire further developments and contribute to the ongoing progress in sound source localization and related spatial audio technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌐

IPDnet: A Universal Direct-Path IPD Estimation Network for Sound Source Localization

Yabo Wang, Bing Yang, Xiaofei Li

Extracting direct-path spatial feature is crucial for sound source localization in adverse acoustic environments. This paper proposes the IPDnet, a neural network that estimates direct-path inter-channel phase difference (DP-IPD) of sound sources from microphone array signals. The estimated DP-IPD can be easily translated to source location based on the known microphone array geometry. First, a full-band and narrow-band fusion network is proposed for DP-IPD estimation, in which alternating narrow-band and full-band layers are responsible for estimating the rough DP-IPD information in one frequency band and capturing the frequency correlations of DP-IPD, respectively. Second, a new multi-track DP-IPD learning target is proposed for the localization of flexible number of sound sources. Third, the IPDnet is extend to handling variable microphone arrays, once trained which is able to process arbitrary microphone arrays with different number of channels and array topology. Experiments of multiple-moving-speaker localization are conducted on both simulated and real-world data, which show that the proposed full-band and narrow-band fusion network and the proposed multi-track DP-IPD learning target together achieves excellent sound source localization performance. Moreover, the proposed variable-array model generalizes well to unseen microphone arrays.

5/14/2024

Inference-Adaptive Neural Steering for Real-Time Area-Based Sound Source Separation

Martin Strauss, Wolfgang Mack, Mar'ia Luis Valero, Okan Kopuklu

We propose a novel Neural Steering technique that adapts the target area of a spatial-aware multi-microphone sound source separation algorithm during inference without the necessity of retraining the deep neural network (DNN). To achieve this, we first train a DNN aiming to retain speech within a target region, defined by an angular span, while suppressing sound sources stemming from other directions. Afterward, a phase shift is applied to the microphone signals, allowing us to shift the center of the target area during inference at negligible additional cost in computational complexity. Further, we show that the proposed approach performs well in a wide variety of acoustic scenarios, including several speakers inside and outside the target area and additional noise. More precisely, the proposed approach performs on par with DNNs trained explicitly for the steered target area in terms of DNSMOS and SI-SDR.

8/26/2024

Configurable DOA Estimation using Incremental Learning

Yang Xiao, Rohan Kumar Das

This study introduces a progressive neural network (PNN) model for direction of arrival (DOA) estimation, DOA-PNN, addressing the challenge due to catastrophic forgetting in adapting dynamic acoustic environments. While traditional methods such as GCC, MUSIC, and SRP-PHAT are effective in static settings, they perform worse in noisy, reverberant conditions. Deep learning models, particularly CNNs, offer improvements but struggle with a mismatch configuration between the training and inference phases. The proposed DOA-PNN overcomes these limitations by incorporating task incremental learning of continual learning, allowing for adaptation across varying acoustic scenarios with less forgetting of previously learned knowledge. Featuring task-specific sub-networks and a scaling mechanism, DOA-PNN efficiently manages parameter growth, ensuring high performance across incremental microphone configurations. We study DOA-PNN on a simulated data under various mic distance based microphone settings. The studies reveal its capability to maintain performance with minimal parameter increase, presenting an efficient solution for DOA estimation.

8/27/2024

🤷

USDnet: Unsupervised Speech Dereverberation via Neural Forward Filtering

Zhong-Qiu Wang

In reverberant conditions with a single speaker, each far-field microphone records a reverberant version of the same speaker signal at a different location. In over-determined conditions, where there are multiple microphones but only one speaker, each recorded mixture signal can be leveraged as a constraint to narrow down the solutions to target anechoic speech and thereby reduce reverberation. Equipped with this insight, we propose USDnet, a novel deep neural network (DNN) approach for unsupervised speech dereverberation (USD). At each training step, we first feed an input mixture to USDnet to produce an estimate for target speech, and then linearly filter the DNN estimate to approximate the multi-microphone mixture so that the constraint can be satisfied at each microphone, thereby regularizing the DNN estimate to approximate target anechoic speech. The linear filter can be estimated based on the mixture and DNN estimate via neural forward filtering algorithms such as forward convolutive prediction. We show that this novel methodology can promote unsupervised dereverberation of single-source reverberant speech.

8/14/2024