PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation

Read original: arXiv:2408.07547 - Published 8/15/2024 by Sang-Hoon Lee, Ha-Yeong Choi, Seong-Whan Lee

PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation

Overview

PeriodWave is a novel method for generating high-fidelity audio waveforms by matching their periodic and aperiodic characteristics across multiple timescales.
It combines the strengths of flow-based generative models and variational autoencoders to capture the complex structure of audio signals.
PeriodWave outperforms state-of-the-art waveform generation models in terms of perceptual audio quality and diversity.

Plain English Explanation

PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation is a new technique for creating high-quality audio by closely matching the periodic and random components of sound waves across different time periods.

Audio signals have both regular, repetitive patterns as well as random, chaotic elements. Prior models have struggled to capture this complex structure effectively. PeriodWave addresses this by combining the strengths of two powerful machine learning approaches - flow-based models and variational autoencoders.

Flow-based models are great at generating realistic-sounding audio by precisely modeling the statistical properties of the waveform. Variational autoencoders, on the other hand, can effectively encode the periodic and aperiodic features of the audio into a compact latent representation.

By bringing these two techniques together, PeriodWave is able to generate audio samples that sound more natural and diverse than previous state-of-the-art methods. This represents an important advance in the field of high-fidelity text-guided music generation and editing.

Technical Explanation

PeriodWave is a generative model for producing high-quality audio waveforms by jointly matching their periodic and aperiodic characteristics across multiple time scales.

The key innovation is the use of a multi-period flow architecture, which combines the strengths of flow-based models and variational autoencoders. The flow-based component ensures accurate modeling of the statistical properties of the waveform, while the variational autoencoder captures the complex periodic and aperiodic structure of the audio signal.

Specifically, PeriodWave consists of:

A multi-scale flow module that maps the waveform to a latent space while preserving its periodic and aperiodic structure.
A variational autoencoder that encodes the latent representation into a compact, interpretable latent code.
A conditional flow-based synthesis module that generates the final waveform conditioned on the latent code.

The model is trained end-to-end using a combination of flow-matching and variational autoencoder objectives. This allows PeriodWave to outperform prior state-of-the-art waveform generation models, as demonstrated through extensive subjective listening tests.

Critical Analysis

The paper provides a thorough evaluation of PeriodWave, including comparisons to several recent baseline models. The results indicate that PeriodWave generates audio of significantly higher perceptual quality and diversity compared to prior methods.

However, the paper does not extensively discuss potential limitations or future research directions. For example, it would be valuable to understand how PeriodWave performs on specific audio domains (e.g. speech, music) or how it scales to longer audio durations.

Additionally, the paper could have provided more insight into the interpretability of the learned latent representations, and how they might be leveraged for applications like text-guided music generation and editing or cross-domain pitch estimation.

Overall, PeriodWave represents a significant advancement in the field of high-fidelity audio generation, and the paper lays a strong foundation for future research in this area.

Conclusion

PeriodWave is a novel deep learning approach for generating high-quality audio waveforms that closely match the periodic and aperiodic characteristics of real-world sounds. By combining the strengths of flow-based models and variational autoencoders, PeriodWave is able to outperform previous state-of-the-art methods in terms of perceptual audio quality and diversity.

This work represents an important step forward in the field of generative audio modeling, with potential applications in areas like text-to-music generation, voice conversion, and beyond. Further research is needed to fully understand the limitations and potential of this approach, but PeriodWave has clearly demonstrated the value of jointly modeling the complex structure of audio signals.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation

Sang-Hoon Lee, Ha-Yeong Choi, Seong-Whan Lee

Recently, universal waveform generation tasks have been investigated conditioned on various out-of-distribution scenarios. Although GAN-based methods have shown their strength in fast waveform generation, they are vulnerable to train-inference mismatch scenarios such as two-stage text-to-speech. Meanwhile, diffusion-based models have shown their powerful generative performance in other domains; however, they stay out of the limelight due to slow inference speed in waveform generation tasks. Above all, there is no generator architecture that can explicitly disentangle the natural periodic features of high-resolution waveform signals. In this paper, we propose PeriodWave, a novel universal waveform generation model. First, we introduce a period-aware flow matching estimator that can capture the periodic features of the waveform signal when estimating the vector fields. Additionally, we utilize a multi-period estimator that avoids overlaps to capture different periodic features of waveform signals. Although increasing the number of periods can improve the performance significantly, this requires more computational costs. To reduce this issue, we also propose a single period-conditional universal estimator that can feed-forward parallel by period-wise batch inference. Additionally, we utilize discrete wavelet transform to losslessly disentangle the frequency information of waveform signals for high-frequency modeling, and introduce FreeU to reduce the high-frequency noise for waveform generation. The experimental results demonstrated that our model outperforms the previous models both in Mel-spectrogram reconstruction and text-to-speech tasks. All source code will be available at url{https://github.com/sh-lee-prml/PeriodWave}.

8/15/2024

Accelerating High-Fidelity Waveform Generation via Adversarial Flow Matching Optimization

Sang-Hoon Lee, Ha-Yeong Choi, Seong-Whan Lee

This paper introduces PeriodWave-Turbo, a high-fidelity and high-efficient waveform generation model via adversarial flow matching optimization. Recently, conditional flow matching (CFM) generative models have been successfully adopted for waveform generation tasks, leveraging a single vector field estimation objective for training. Although these models can generate high-fidelity waveform signals, they require significantly more ODE steps compared to GAN-based models, which only need a single generation step. Additionally, the generated samples often lack high-frequency information due to noisy vector field estimation, which fails to ensure high-frequency reproduction. To address this limitation, we enhance pre-trained CFM-based generative models by incorporating a fixed-step generator modification. We utilized reconstruction losses and adversarial feedback to accelerate high-fidelity waveform generation. Through adversarial flow matching optimization, it only requires 1,000 steps of fine-tuning to achieve state-of-the-art performance across various objective metrics. Moreover, we significantly reduce inference speed from 16 steps to 2 or 4 steps. Additionally, by scaling up the backbone of PeriodWave from 29M to 70M parameters for improved generalization, PeriodWave-Turbo achieves unprecedented performance, with a perceptual evaluation of speech quality (PESQ) score of 4.454 on the LibriTTS dataset. Audio samples, source code and checkpoints will be available at https://github.com/sh-lee-prml/PeriodWave.

8/16/2024

📉

RFWave: Multi-band Rectified Flow for Audio Waveform Reconstruction

Peng Liu, Dongyang Dai, Zhiyong Wu

Recent advancements in generative modeling have significantly enhanced the reconstruction of audio waveforms from various representations. While diffusion models are adept at this task, they are hindered by latency issues due to their operation at the individual sample point level and the need for numerous sampling steps. In this study, we introduce RFWave, a cutting-edge multi-band Rectified Flow approach designed to reconstruct high-fidelity audio waveforms from Mel-spectrograms or discrete tokens. RFWave uniquely generates complex spectrograms and operates at the frame level, processing all subbands simultaneously to boost efficiency. Leveraging Rectified Flow, which targets a flat transport trajectory, RFWave achieves reconstruction with just 10 sampling steps. Our empirical evaluations show that RFWave not only provides outstanding reconstruction quality but also offers vastly superior computational efficiency, enabling audio generation at speeds up to 97 times faster than real-time on a GPU. An online demonstration is available at: https://rfwave-demo.github.io/rfwave/.

6/4/2024

Period Singer: Integrating Periodic and Aperiodic Variational Autoencoders for Natural-Sounding End-to-End Singing Voice Synthesis

Taewoo Kim, Choongsang Cho, Young Han Lee

In this paper, we present Period Singer, a novel end-to-end singing voice synthesis (SVS) model that utilizes variational inference for periodic and aperiodic components, aimed at producing natural-sounding waveforms. Recent end-to-end SVS models have demonstrated the capability of synthesizing high-fidelity singing voices. However, owing to deterministic pitch conditioning, they do not fully address the one-to-many problem. To address this problem, we present the Period Singer architecture, which integrates variational autoencoders for the periodic and aperiodic components. Additionally, our methodology eliminates the dependency on an external aligner by estimating the phoneme alignment through a monotonic alignment search within note boundaries. Our empirical evaluations show that Period Singer outperforms existing end-to-end SVS models on Mandarin and Korean datasets. The efficacy of the proposed method was further corroborated by ablation studies.

9/12/2024