Cross-domain Neural Pitch and Periodicity Estimation

Read original: arXiv:2301.12258 - Published 8/13/2024 by Max Morrison, Caedon Hsieh, Nathan Pruyne, Bryan Pardo

🧠

Overview

Pitch is a fundamental aspect of how we perceive audio signals.
Pitch contours are commonly used to analyze speech and music signals and as input features for many audio tasks.
This paper describes techniques to improve the accuracy of neural pitch and periodicity estimators, achieving state-of-the-art performance on both speech and music.
A novel entropy-based method is introduced for extracting periodicity and voiced-unvoiced classifications from statistical inference-based pitch estimators.
The paper shows how to train a neural pitch estimator to handle both speech and music data without performance degradation.
The estimator implementations run extremely fast, approaching the speed of state-of-the-art DSP-based pitch estimators.
An open-source Python module called Pitch-Estimating Neural Networks (penn) is released for training, evaluating, and performing inference with pitch- and periodicity-estimating neural networks.

Plain English Explanation

Pitch is a fundamental aspect of how we perceive sound. It refers to whether a sound is high or low. Pitch contours are commonly used to analyze speech and music signals, and they are also used as input features for many audio-related tasks, such as music transcription, singing voice synthesis, and prosody editing.

In this paper, the researchers describe techniques they developed to improve the accuracy of widely-used neural pitch and periodicity estimators. Periodicity refers to the regularity of a sound wave. By using these techniques, the researchers were able to achieve state-of-the-art performance on pitch estimation for both speech and music.

The researchers also introduced a novel method for extracting periodicity and whether a sound is voiced (produced by vibrating vocal cords) or unvoiced from statistical inference-based pitch estimators, such as neural networks. Additionally, they showed how to train a neural pitch estimator to handle both speech and music data without any loss in performance.

Importantly, the researchers' estimator implementations run extremely fast, approaching the speed of specialized hardware-based pitch estimators, or even hundreds of times faster on a GPU. This means the pitch estimation can be done very quickly, which is important for real-time applications.

The researchers have released all of their code and models as an open-source Python module called Pitch-Estimating Neural Networks (penn), which allows others to train, evaluate, and use their pitch and periodicity estimating neural networks.

Technical Explanation

The paper focuses on improving the accuracy of neural pitch and periodicity estimators for both speech and music signals. The researchers developed several key techniques:

Improved Pitch Estimation Accuracy: The researchers applied various enhancements to widely-used neural pitch estimators to achieve state-of-the-art performance on both speech and music data. This included techniques like improved loss functions and data augmentation.
Novel Periodicity Extraction Method: The researchers introduced a novel entropy-based method for extracting periodicity and per-frame voiced-unvoiced classifications from statistical inference-based pitch estimators, such as neural networks. This allows for more accurate and robust periodicity estimation.
Cross-Domain Pitch Estimation: The researchers showed how to train a single neural pitch estimator to handle both speech and music data without any performance degradation. This "cross-domain" estimation capability is important for real-world applications.
Extremely Fast Inference: The researchers' estimator implementations run extremely fast, achieving 11.2x real-time performance on a high-end CPU and 408x real-time performance on a GPU. This approaches the speed of specialized DSP-based pitch estimators, making the neural-network-based approach viable for real-time applications.

The researchers evaluated their techniques on standard speech and music pitch estimation benchmarks, demonstrating significant improvements over previous state-of-the-art methods. They also released their code and models as the open-source Pitch-Estimating Neural Networks (penn) library, allowing others to build upon their work.

Critical Analysis

The paper presents a thorough and well-designed set of techniques for improving neural-network-based pitch and periodicity estimation. The novel entropy-based periodicity extraction method is a particularly interesting contribution, as it allows for more robust and accurate periodicity estimation compared to previous approaches.

One potential limitation is that the paper does not provide a deep analysis of the failure cases or limitations of their proposed methods. While the researchers demonstrate state-of-the-art performance, it would be valuable to understand the specific scenarios where the estimators may struggle or produce inaccurate results.

Additionally, the paper does not explore the implications of the fast inference speeds for real-world applications. It would be informative to see examples of how the researchers' techniques could be leveraged in practical audio processing systems, such as robust speech separation models for similar-pitch speakers or automatic equalization of individual instrument tracks.

Overall, the paper presents a significant advancement in the field of neural pitch and periodicity estimation, with the open-sourcing of the penn library being a valuable contribution to the research community. Further exploration of the practical applications and limitations of the techniques could lead to even greater impact.

Conclusion

This paper introduces a set of innovative techniques to improve the accuracy and performance of neural pitch and periodicity estimators for both speech and music signals. The researchers achieved state-of-the-art results on standard benchmarks and developed a novel entropy-based method for extracting periodicity and voiced-unvoiced classifications. Importantly, the researchers' estimator implementations run extremely fast, approaching the speed of specialized hardware-based pitch estimators.

The open-sourcing of the penn library allows others to build upon this work and leverage the researchers' pitch and periodicity estimation capabilities in a wide range of audio processing applications, from music transcription to singing voice synthesis and beyond. As pitch and periodicity are fundamental to our perception of sound, the techniques described in this paper have the potential to enable significant advancements in numerous audio-related fields.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🧠

Cross-domain Neural Pitch and Periodicity Estimation

Max Morrison, Caedon Hsieh, Nathan Pruyne, Bryan Pardo

Pitch is a foundational aspect of our perception of audio signals. Pitch contours are commonly used to analyze speech and music signals and as input features for many audio tasks, including music transcription, singing voice synthesis, and prosody editing. In this paper, we describe a set of techniques for improving the accuracy of widely-used neural pitch and periodicity estimators to achieve state-of-the-art performance on both speech and music. We also introduce a novel entropy-based method for extracting periodicity and per-frame voiced-unvoiced classifications from statistical inference-based pitch estimators (e.g., neural networks), and show how to train a neural pitch estimator to simultaneously handle both speech and music data (i.e., cross-domain estimation) without performance degradation. Our estimator implementations run 11.2x faster than real-time on a Intel i9-9820X 10-core 3.30 GHz CPU$unicode{x2014}$approaching the speed of state-of-the-art DSP-based pitch estimators$unicode{x2014}$or 408x faster than real-time on a NVIDIA GeForce RTX 3090 GPU. We release all of our code and models as Pitch-Estimating Neural Networks (penn), an open-source, pip-installable Python module for training, evaluating, and performing inference with pitch- and periodicity-estimating neural networks. The code for penn is available at https://github.com/interactiveaudiolab/penn.

8/13/2024

Period Singer: Integrating Periodic and Aperiodic Variational Autoencoders for Natural-Sounding End-to-End Singing Voice Synthesis

Taewoo Kim, Choongsang Cho, Young Han Lee

In this paper, we present Period Singer, a novel end-to-end singing voice synthesis (SVS) model that utilizes variational inference for periodic and aperiodic components, aimed at producing natural-sounding waveforms. Recent end-to-end SVS models have demonstrated the capability of synthesizing high-fidelity singing voices. However, owing to deterministic pitch conditioning, they do not fully address the one-to-many problem. To address this problem, we present the Period Singer architecture, which integrates variational autoencoders for the periodic and aperiodic components. Additionally, our methodology eliminates the dependency on an external aligner by estimating the phoneme alignment through a monotonic alignment search within note boundaries. Our empirical evaluations show that Period Singer outperforms existing end-to-end SVS models on Mandarin and Korean datasets. The efficacy of the proposed method was further corroborated by ablation studies.

9/12/2024

PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation

Sang-Hoon Lee, Ha-Yeong Choi, Seong-Whan Lee

Recently, universal waveform generation tasks have been investigated conditioned on various out-of-distribution scenarios. Although GAN-based methods have shown their strength in fast waveform generation, they are vulnerable to train-inference mismatch scenarios such as two-stage text-to-speech. Meanwhile, diffusion-based models have shown their powerful generative performance in other domains; however, they stay out of the limelight due to slow inference speed in waveform generation tasks. Above all, there is no generator architecture that can explicitly disentangle the natural periodic features of high-resolution waveform signals. In this paper, we propose PeriodWave, a novel universal waveform generation model. First, we introduce a period-aware flow matching estimator that can capture the periodic features of the waveform signal when estimating the vector fields. Additionally, we utilize a multi-period estimator that avoids overlaps to capture different periodic features of waveform signals. Although increasing the number of periods can improve the performance significantly, this requires more computational costs. To reduce this issue, we also propose a single period-conditional universal estimator that can feed-forward parallel by period-wise batch inference. Additionally, we utilize discrete wavelet transform to losslessly disentangle the frequency information of waveform signals for high-frequency modeling, and introduce FreeU to reduce the high-frequency noise for waveform generation. The experimental results demonstrated that our model outperforms the previous models both in Mel-spectrogram reconstruction and text-to-speech tasks. All source code will be available at url{https://github.com/sh-lee-prml/PeriodWave}.

8/15/2024

TSE-PI: Target Sound Extraction under Reverberant Environments with Pitch Information

Yiwen Wang, Xihong Wu

Target sound extraction (TSE) separates the target sound from the mixture signals based on provided clues. However, the performance of existing models significantly degrades under reverberant conditions. Inspired by auditory scene analysis (ASA), this work proposes a TSE model provided with pitch information named TSE-PI. Conditional pitch extraction is achieved through the Feature-wise Linearly Modulated layer with the sound-class label. A modified Waveformer model combined with pitch information, employing a learnable Gammatone filterbank in place of the convolutional encoder, is used for target sound extraction. The inclusion of pitch information is aimed at improving the model's performance. The experimental results on the FSD50K dataset illustrate 2.4 dB improvements of target sound extraction under reverberant environments when incorporating pitch information and Gammatone filterbank.

6/14/2024