TSE-PI: Target Sound Extraction under Reverberant Environments with Pitch Information

Read original: arXiv:2406.08716 - Published 6/14/2024 by Yiwen Wang, Xihong Wu

TSE-PI: Target Sound Extraction under Reverberant Environments with Pitch Information

Overview

This paper introduces TSE-PI, a method for extracting a target sound from a reverberant audio environment using pitch information.
The key idea is to leverage the pitch of the target sound to help separate it from background noise and reverberation.
The proposed method is evaluated on various reverberant speech separation tasks and shown to outperform existing approaches.

Plain English Explanation

In many real-world audio scenarios, such as video calls or voice recordings in a room, the target sound (e.g., a person's voice) is often obscured by background noise and the effects of reverberation (echoes). TSE-PI: Target Sound Extraction under Reverberant Environments with Pitch Information presents a new technique to help extract the target sound in these challenging reverberant environments.

The key insight is to use information about the pitch (the highness or lowness) of the target sound. By tracking the pitch, the method can better distinguish the target sound from the background noise and reverberation, which often have different pitch characteristics. This allows the target sound to be more accurately separated and extracted.

The proposed TSE-PI method is evaluated on various speech separation tasks in reverberant conditions and is shown to outperform existing approaches that do not use pitch information. This suggests that incorporating pitch cues can be a valuable addition to speech separation systems, especially in challenging acoustic environments.

Technical Explanation

The TSE-PI method operates in two stages. First, it estimates the pitch of the target sound using a deep neural network. This pitch information is then used in the second stage, where a second neural network extracts the target sound from the reverberant mixture.

The pitch estimation network is trained on clean speech data to learn to predict the fundamental frequency (F0) of a target speaker's voice. The sound extraction network takes the reverberant audio mixture and the predicted pitch as inputs, and uses this information to separate the target sound.

The authors evaluate TSE-PI on several reverberant speech separation tasks, including the WSJ0-2mix and WHAMR! datasets. They show that incorporating the pitch cue leads to significant performance improvements compared to baseline methods that do not use this information.

Critical Analysis

The TSE-PI approach is a promising step towards more robust speech separation in reverberant environments. By leveraging pitch information, the method can better handle the challenging effects of reverberation, which often confound conventional separation techniques.

One potential limitation is that the pitch estimation network needs to be trained on clean speech data, which may not always be available in real-world settings. Additionally, the method may struggle in scenarios where multiple speakers have similar pitch characteristics.

Further research could explore ways to make the pitch estimation more robust, such as by incorporating additional acoustic cues or using semi-supervised learning techniques. Investigating the performance of TSE-PI in more diverse and realistic reverberant environments would also be valuable.

Conclusion

TSE-PI presents a new approach to target sound extraction that leverages pitch information to improve performance in reverberant environments. By incorporating this additional acoustic cue, the method can more effectively separate the target sound from background noise and echoes, as demonstrated by its strong results on various speech separation benchmarks.

This work highlights the potential benefits of using pitch information in audio signal processing tasks, and suggests that further research in this direction could lead to more robust and capable speech separation systems, with applications in areas like teleconferencing, voice assistants, and audio production.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

TSE-PI: Target Sound Extraction under Reverberant Environments with Pitch Information

Yiwen Wang, Xihong Wu

Target sound extraction (TSE) separates the target sound from the mixture signals based on provided clues. However, the performance of existing models significantly degrades under reverberant conditions. Inspired by auditory scene analysis (ASA), this work proposes a TSE model provided with pitch information named TSE-PI. Conditional pitch extraction is achieved through the Feature-wise Linearly Modulated layer with the sound-class label. A modified Waveformer model combined with pitch information, employing a learnable Gammatone filterbank in place of the convolutional encoder, is used for target sound extraction. The inclusion of pitch information is aimed at improving the model's performance. The experimental results on the FSD50K dataset illustrate 2.4 dB improvements of target sound extraction under reverberant environments when incorporating pitch information and Gammatone filterbank.

6/14/2024

New!Language-Queried Target Sound Extraction Without Parallel Training Data

Hao Ma, Zhiyuan Peng, Xu Li, Yukai Li, Mingjie Shao, Qiuqiang Kong, Ju Liu

Language-queried target sound extraction (TSE) aims to extract specific sounds from mixtures based on language queries. Traditional fully-supervised training schemes require extensively annotated parallel audio-text data, which are labor-intensive. We introduce a language-free training scheme, requiring only unlabelled audio clips for TSE model training by utilizing the multi-modal representation alignment nature of the contrastive language-audio pre-trained model (CLAP). In a vanilla language-free training stage, target audio is encoded using the pre-trained CLAP audio encoder to form a condition embedding for the TSE model, while during inference, user language queries are encoded by CLAP text encoder. This straightforward approach faces challenges due to the modality gap between training and inference queries and information leakage from direct exposure to target audio during training. To address this, we propose a retrieval-augmented strategy. Specifically, we create an embedding cache using audio captions generated by a large language model (LLM). During training, target audio embeddings retrieve text embeddings from this cache to use as condition embeddings, ensuring consistent modalities between training and inference and eliminating information leakage. Extensive experiment results show that our retrieval-augmented approach achieves consistent and notable performance improvements over existing state-of-the-art with better generalizability.

9/17/2024

Interaural time difference loss for binaural target sound extraction

Carlos Hernandez-Olivan, Marc Delcroix, Tsubasa Ochiai, Naohiro Tawara, Tomohiro Nakatani, Shoko Araki

Binaural target sound extraction (TSE) aims to extract a desired sound from a binaural mixture of arbitrary sounds while preserving the spatial cues of the desired sound. Indeed, for many applications, the target sound signal and its spatial cues carry important information about the sound source. Binaural TSE can be realized with a neural network trained to output only the desired sound given a binaural mixture and an embedding characterizing the desired sound class as inputs. Conventional TSE systems are trained using signal-level losses, which measure the difference between the extracted and reference signals for the left and right channels. In this paper, we propose adding explicit spatial losses to better preserve the spatial cues of the target sound. In particular, we explore losses aiming at preserving the interaural level (ILD), phase (IPD), and time differences (ITD). We show experimentally that adding such spatial losses, particularly our newly proposed ITD loss, helps preserve better spatial cues while maintaining the signal-level metrics.

8/2/2024

DENSE: Dynamic Embedding Causal Target Speech Extraction

Yiwen Wang, Zeyu Yuan, Xihong Wu

Target speech extraction (TSE) focuses on extracting the speech of a specific target speaker from a mixture of signals. Existing TSE models typically utilize static embeddings as conditions for extracting the target speaker's voice. However, the static embeddings often fail to capture the contextual information of the extracted speech signal, which may limit the model's performance. We propose a novel dynamic embedding causal target speech extraction model to address this limitation. Our approach incorporates an autoregressive mechanism to generate context-dependent embeddings based on the extracted speech, enabling real-time, frame-level extraction. Experimental results demonstrate that the proposed model enhances short-time objective intelligibility (STOI) and signal-to-distortion ratio (SDR), offering a promising solution for target speech extraction in challenging scenarios.

9/11/2024