FA-GAN: Artifacts-free and Phase-aware High-fidelity GAN-based Vocoder

Read original: arXiv:2407.04575 - Published 7/8/2024 by Rubing Shen, Yanzhen Ren, Zongkun Sun

FA-GAN: Artifacts-free and Phase-aware High-fidelity GAN-based Vocoder

Overview

FA-GAN is a new Generative Adversarial Network (GAN) based vocoder that can generate high-fidelity speech audio without artifacts.
The key innovations are its ability to model the phase information of the speech signal and its stable training process that avoids common GAN issues like mode collapse.
Experiments show FA-GAN outperforms existing GAN-based vocoders in terms of speech quality and naturalness.

Plain English Explanation

The FA-GAN paper presents a new approach to generating realistic-sounding speech audio using a type of AI model called a Generative Adversarial Network (GAN). Generating high-quality synthetic speech has historically been challenging due to issues like "artifacts" - unwanted distortions or noises in the audio output.

The core idea behind FA-GAN is that it can effectively model the phase information of the speech signal, which is a crucial component that earlier GAN-based vocoders struggled with. This allows FA-GAN to produce more natural and artifact-free speech. The researchers also developed a training process for FA-GAN that is more stable and avoids common problems that can plague GAN models, like "mode collapse" where the model gets stuck generating the same limited set of outputs.

Through experiments, the authors show that FA-GAN outperforms existing GAN-based speech synthesis models in terms of the overall quality and naturalness of the generated audio. This represents an important advance in the field of AI-generated speech.

Technical Explanation

The FA-GAN architecture consists of a generator network that produces the speech waveform, and a discriminator network that tries to distinguish the generated audio from real speech samples. A key innovation is the use of a "Phase-aware" module in the generator that explicitly models the phase information of the speech signal, which prior GAN-based vocoders had struggled with.

The training process for FA-GAN also includes several techniques to stabilize the GAN optimization and avoid common issues like mode collapse. This includes a novel "Regularized Adversarial Training" procedure and a "Frequency-aware Discriminator" that evaluates the generator output across different frequency bands.

Experiments compare FA-GAN to state-of-the-art GAN-based vocoders on a range of objective and subjective speech quality metrics. The results demonstrate that FA-GAN can generate higher fidelity, more natural-sounding audio without the artifacts present in previous models.

Critical Analysis

The FA-GAN paper makes a compelling case that modeling phase information is a key factor in improving the quality of GAN-based speech synthesis. However, the authors acknowledge that FA-GAN still has some limitations, such as a higher computational cost compared to other vocoders.

Additionally, the paper does not provide much detail on the specific architectural choices and hyperparameters used for FA-GAN. More transparency around these implementation details would allow for better reproducibility and further research building on this work.

It would also be valuable to see evaluations of FA-GAN on a broader range of speech data, beyond just the specific dataset used in this study. Assessing its performance and robustness across different languages, accents, and speaking styles would help validate the generalizability of the approach.

Conclusion

The FA-GAN paper introduces an innovative GAN-based vocoder that can generate high-quality, artifact-free speech audio by effectively modeling the phase information of the speech signal. Through careful architectural design and training techniques, the authors have made significant progress in addressing longstanding challenges in GAN-based speech synthesis.

While FA-GAN still has some limitations, this research represents an important step forward in the field of AI-generated speech. The insights and techniques developed here could help inspire further advancements that bring us closer to natural-sounding, human-like synthetic voices.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

FA-GAN: Artifacts-free and Phase-aware High-fidelity GAN-based Vocoder

Rubing Shen, Yanzhen Ren, Zongkun Sun

Generative adversarial network (GAN) based vocoders have achieved significant attention in speech synthesis with high quality and fast inference speed. However, there still exist many noticeable spectral artifacts, resulting in the quality decline of synthesized speech. In this work, we adopt a novel GAN-based vocoder designed for few artifacts and high fidelity, called FA-GAN. To suppress the aliasing artifacts caused by non-ideal upsampling layers in high-frequency components, we introduce the anti-aliased twin deconvolution module in the generator. To alleviate blurring artifacts and enrich the reconstruction of spectral details, we propose a novel fine-grained multi-resolution real and imaginary loss to assist in the modeling of phase information. Experimental results reveal that FA-GAN outperforms the compared approaches in promoting audio quality and alleviating spectral artifacts, and exhibits superior performance when applied to unseen speaker scenarios.

7/8/2024

Very Low Complexity Speech Synthesis Using Framewise Autoregressive GAN (FARGAN) with Pitch Prediction

Jean-Marc Valin, Ahmed Mustafa, Jan Buthe

Neural vocoders are now being used in a wide range of speech processing applications. In many of those applications, the vocoder can be the most complex component, so finding lower complexity algorithms can lead to significant practical benefits. In this work, we propose FARGAN, an autoregressive vocoder that takes advantage of long-term pitch prediction to synthesize high-quality speech in small subframes, without the need for teacher-forcing. Experimental results show that the proposed 600~MFLOPS FARGAN vocoder can achieve both higher quality and lower complexity than existing low-complexity vocoders. The quality even matches that of existing higher-complexity vocoders.

8/6/2024

JenGAN: Stacked Shifted Filters in GAN-Based Speech Synthesis

Hyunjae Cho, Junhyeok Lee, Wonbin Jung

Non-autoregressive GAN-based neural vocoders are widely used due to their fast inference speed and high perceptual quality. However, they often suffer from audible artifacts such as tonal artifacts in their generated results. Therefore, we propose JenGAN, a new training strategy that involves stacking shifted low-pass filters to ensure the shift-equivariant property. This method helps prevent aliasing and reduce artifacts while preserving the model structure used during inference. In our experimental evaluation, JenGAN consistently enhances the performance of vocoder models, yielding significantly superior scores across the majority of evaluation metrics.

6/11/2024

VNet: A GAN-based Multi-Tier Discriminator Network for Speech Synthesis Vocoders

Yubing Cao, Yongming Li, Liejun Wang, Yinfeng Yu

Since the introduction of Generative Adversarial Networks (GANs) in speech synthesis, remarkable achievements have been attained. In a thorough exploration of vocoders, it has been discovered that audio waveforms can be generated at speeds exceeding real-time while maintaining high fidelity, achieved through the utilization of GAN-based models. Typically, the inputs to the vocoder consist of band-limited spectral information, which inevitably sacrifices high-frequency details. To address this, we adopt the full-band Mel spectrogram information as input, aiming to provide the vocoder with the most comprehensive information possible. However, previous studies have revealed that the use of full-band spectral information as input can result in the issue of over-smoothing, compromising the naturalness of the synthesized speech. To tackle this challenge, we propose VNet, a GAN-based neural vocoder network that incorporates full-band spectral information and introduces a Multi-Tier Discriminator (MTD) comprising multiple sub-discriminators to generate high-resolution signals. Additionally, we introduce an asymptotically constrained method that modifies the adversarial loss of the generator and discriminator, enhancing the stability of the training process. Through rigorous experiments, we demonstrate that the VNet model is capable of generating high-fidelity speech and significantly improving the performance of the vocoder.

8/14/2024