Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

Read original: arXiv:2306.00814 - Published 5/30/2024 by Hubert Siuzdak

🧠

Overview

Recent advancements in neural vocoding have been driven by Generative Adversarial Networks (GANs) operating in the time-domain.
This approach neglects the benefits of time-frequency representations, leading to redundant and computationally-intensive upsampling operations.
Fourier-based time-frequency representation is an appealing alternative, aligning more accurately with human auditory perception and benefiting from fast algorithms.
Direct reconstruction of complex-valued spectrograms has been historically problematic due to phase recovery issues.

Plain English Explanation

The study presents a new model called Vocos that directly generates Fourier spectral coefficients, overcoming the challenges of previous time-domain neural vocoding approaches. Vocos not only matches the state-of-the-art in audio quality, but it also substantially improves computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding methods.

Time-frequency representations, like spectrograms, offer an intuitive way to understand audio signals. They align more closely with how humans perceive sound, and there are well-established fast algorithms for computing them. However, directly reconstructing the complex-valued spectrograms has been a challenge in the past, mainly due to issues with recovering the phase information.

Vocos addresses this by generating the Fourier spectral coefficients directly, rather than working in the time-domain like previous neural vocoding approaches. This approach is more efficient and matches the high-quality audio output of the best time-domain models. The researchers have open-sourced the source code and model weights for Vocos, making it available for others to use and build upon.

Technical Explanation

The study proposes a new neural vocoding model called Vocos that directly generates Fourier spectral coefficients, overcoming the limitations of previous time-domain approaches. The key technical elements are:

Time-Frequency Representation: Vocos uses a Fourier-based time-frequency representation, which aligns more accurately with human auditory perception and benefits from well-established fast algorithms for its computation. This is in contrast to the time-domain approach used by many previous neural vocoding models.
Direct Spectrogram Generation: Vocos directly generates the complex-valued Fourier spectral coefficients, avoiding the phase recovery issues that have historically been problematic for this approach.
Architecture and Training: Vocos utilizes a convolutional neural network architecture and is trained using a combination of spectral reconstruction loss and adversarial training, similar to techniques used in audio fake detection networks.
Computational Efficiency: By working directly with the Fourier spectral coefficients, Vocos achieves a substantial increase in computational efficiency, over an order of magnitude faster than prevailing time-domain neural vocoding approaches.

The researchers evaluate Vocos on standard benchmarks and demonstrate that it matches the state-of-the-art in audio quality, while also providing significant improvements in computational efficiency. This work builds on previous research exploring the role of time-frequency representations in audio processing tasks.

Critical Analysis

The researchers acknowledge that while Vocos provides a compelling alternative to time-domain neural vocoding, there are still some limitations and areas for further exploration:

Phase Recovery: Although Vocos avoids the phase recovery issues of previous time-frequency approaches, the researchers note that there is still room for improvement in accurately modeling the phase information.
Generalization: The evaluation of Vocos is primarily focused on standard benchmark datasets. Further research is needed to assess its performance and robustness in more diverse real-world settings, as highlighted in work on generalizing audio fake detection networks.
Interpretability: Like many deep learning models, the inner workings of Vocos may not be easily interpretable. Efforts towards end-to-end interpretable convolutional neural networks could help provide more insights into the model's behavior.
Multimodal Interactions: The study focuses solely on the audio domain, but there may be opportunities to explore multimodal interactions that could further enhance the performance and capabilities of neural vocoding models.

Overall, the Vocos model represents a significant advance in neural vocoding, demonstrating the potential of Fourier-based time-frequency representations to improve computational efficiency and audio quality. The open-sourcing of the model and code is a commendable step that will enable further research and development in this area.

Conclusion

The study introduces Vocos, a novel neural vocoding model that generates Fourier spectral coefficients directly, addressing the limitations of previous time-domain approaches. Vocos not only matches the state-of-the-art in audio quality but also substantially improves computational efficiency, achieving an order of magnitude increase in speed.

By leveraging the benefits of time-frequency representations and bypassing the phase recovery issues that have historically plagued this approach, Vocos represents a significant advancement in the field of neural vocoding. The open-sourcing of the model and code will enable further research and development, potentially leading to even more efficient and high-quality audio generation techniques in the future.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🧠

Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

Hubert Siuzdak

Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-intensive upsampling operations. Fourier-based time-frequency representation is an appealing alternative, aligning more accurately with human auditory perception, and benefitting from well-established fast algorithms for its computation. Nevertheless, direct reconstruction of complex-valued spectrograms has been historically problematic, primarily due to phase recovery issues. This study seeks to close this gap by presenting Vocos, a new model that directly generates Fourier spectral coefficients. Vocos not only matches the state-of-the-art in audio quality, as demonstrated in our evaluations, but it also substantially improves computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches. The source code and model weights have been open-sourced at https://github.com/gemelo-ai/vocos.

5/30/2024

FreeV: Free Lunch For Vocoders Through Pseudo Inversed Mel Filter

Yuanjun Lv, Hai Li, Ying Yan, Junhui Liu, Danming Xie, Lei Xie

Vocoders reconstruct speech waveforms from acoustic features and play a pivotal role in modern TTS systems. Frequent-domain GAN vocoders like Vocos and APNet2 have recently seen rapid advancements, outperforming time-domain models in inference speed while achieving comparable audio quality. However, these frequency-domain vocoders suffer from large parameter sizes, thus introducing extra memory burden. Inspired by PriorGrad and SpecGrad, we employ pseudo-inverse to estimate the amplitude spectrum as the initialization roughly. This simple initialization significantly mitigates the parameter demand for vocoder. Based on APNet2 and our streamlined Amplitude prediction branch, we propose our FreeV, compared with its counterpart APNet2, our FreeV achieves 1.8 times inference speed improvement with nearly half parameters. Meanwhile, our FreeV outperforms APNet2 in resynthesis quality, marking a step forward in pursuing real-time, high-fidelity speech synthesis. Code and checkpoints is available at: https://github.com/BakerBunker/FreeV

6/13/2024

MusicHiFi: Fast High-Fidelity Stereo Vocoding

Ge Zhu, Juan-Pablo Caceres, Zhiyao Duan, Nicholas J. Bryan

Diffusion-based audio and music generation models commonly perform generation by constructing an image representation of audio (e.g., a mel-spectrogram) and then convert it to audio using a phase reconstruction model or vocoder. Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their usefulness. We propose MusicHiFi -- an efficient high-fidelity stereophonic vocoder. Our method employs a cascade of three generative adversarial networks (GANs) that convert low-resolution mel-spectrograms to audio, upsamples to high-resolution audio via bandwidth extension, and upmixes to stereophonic audio. Compared to past work, we propose 1) a unified GAN-based generator and discriminator architecture and training procedure for each stage of our cascade, 2) a new fast, near downsampling-compatible bandwidth extension module, and 3) a new fast downmix-compatible mono-to-stereo upmixer that ensures the preservation of monophonic content in the output. We evaluate our approach using objective and subjective listening tests and find our approach yields comparable or better audio quality, better spatialization control, and significantly faster inference speed compared to past work. Sound examples are at url{https://MusicHiFi.github.io/web/}.

7/10/2024

An Investigation of Time-Frequency Representation Discriminators for High-Fidelity Vocoder

Yicheng Gu, Xueyao Zhang, Liumeng Xue, Haizhou Li, Zhizheng Wu

Generative Adversarial Network (GAN) based vocoders are superior in both inference speed and synthesis quality when reconstructing an audible waveform from an acoustic representation. This study focuses on improving the discriminator for GAN-based vocoders. Most existing Time-Frequency Representation (TFR)-based discriminators are rooted in Short-Time Fourier Transform (STFT), which owns a constant Time-Frequency (TF) resolution, linearly scaled center frequencies, and a fixed decomposition basis, making it incompatible with signals like singing voices that require dynamic attention for different frequency bands and different time intervals. Motivated by that, we propose a Multi-Scale Sub-Band Constant-Q Transform CQT (MS-SB-CQT) discriminator and a Multi-Scale Temporal-Compressed Continuous Wavelet Transform CWT (MS-TC-CWT) discriminator. Both CQT and CWT have a dynamic TF resolution for different frequency bands. In contrast, CQT has a better modeling ability in pitch information, and CWT has a better modeling ability in short-time transients. Experiments conducted on both speech and singing voices confirm the effectiveness of our proposed discriminators. Moreover, the STFT, CQT, and CWT-based discriminators can be used jointly for better performance. The proposed discriminators can boost the synthesis quality of various state-of-the-art GAN-based vocoders, including HiFi-GAN, BigVGAN, and APNet.

4/29/2024