Very Low Complexity Speech Synthesis Using Framewise Autoregressive GAN (FARGAN) with Pitch Prediction

Read original: arXiv:2405.21069 - Published 8/6/2024 by Jean-Marc Valin, Ahmed Mustafa, Jan Buthe

Very Low Complexity Speech Synthesis Using Framewise Autoregressive GAN (FARGAN) with Pitch Prediction

Overview

This paper presents a new approach called Framewise Autoregressive GAN (FARGAN) for very low complexity speech synthesis.
FARGAN uses a generative adversarial network (GAN) architecture to generate speech waveforms frame-by-frame, with a pitch prediction model to improve the quality.
The goal is to create a speech synthesis system that is computationally efficient and can run on resource-constrained devices.

Plain English Explanation

FARGAN is a new way to generate human-like speech using artificial intelligence. Instead of generating the entire speech waveform at once, FARGAN creates it one small piece (frame) at a time. This makes it much more efficient and able to run on devices with limited computing power, like smartphones.

To make the synthetic speech sound more natural, FARGAN also includes a pitch prediction model. Pitch is an important part of how we perceive speech, so predicting it accurately helps the generated audio sound more lifelike.

The key idea behind FARGAN is to use a type of AI model called a generative adversarial network (GAN) to generate the speech frame-by-frame. GANs work by having two neural networks compete against each other - one generates the speech, while the other tries to detect if it's real or fake. This competition pushes the generator to create more and more realistic speech over time.

By breaking down speech synthesis into small, efficient steps and adding pitch prediction, the researchers were able to create a speech system that is very lightweight and could potentially run on devices with limited computing power, like phones or smart speakers. This opens up new possibilities for deploying high-quality speech interfaces in a wide range of applications.

Technical Explanation

The FARGAN architecture consists of a generator network that produces speech frames and a discriminator network that tries to distinguish real speech from the generator's output. The generator uses an autoregressive approach, where each output frame depends on the previous frames.

To improve the quality of the generated speech, the researchers added a pitch prediction model that estimates the fundamental frequency (pitch) of the speech. This pitch information is then incorporated into the generator's input, helping it produce more natural-sounding speech.

The key technical contributions of this work include:

A framewise autoregressive GAN approach to speech synthesis, which breaks down the task into smaller, more efficient steps.
A pitch prediction model that enhances the generated speech by providing accurate pitch information.
Extensive experiments demonstrating the high quality and low complexity of the FARGAN system, compared to previous state-of-the-art speech synthesizers.

The researchers evaluated FARGAN on several speech datasets and found that it achieved compelling performance in terms of speech quality and computational efficiency, outperforming other neural vocoders like VocGAN and FastSAG. This suggests that FARGAN could be a promising approach for deploying high-quality speech synthesis on resource-constrained devices.

Critical Analysis

The paper provides a thorough evaluation of the FARGAN system, demonstrating its effectiveness in generating high-quality speech while maintaining low computational complexity. However, some potential limitations and areas for future research are worth considering:

The pitch prediction model, while helpful, may still have room for improvement. Investigating more advanced pitch estimation techniques could potentially further enhance the naturalness of the generated speech.
The paper focuses on general speech synthesis, but the techniques could potentially be extended to other audio domains, such as singing voice synthesis or music generation. Exploring these applications could broaden the impact of the FARGAN approach.
The paper does not delve into the subjective experience of the generated speech, such as its emotional expressiveness or suitability for different use cases. Further user studies and qualitative evaluations could provide valuable insights into the real-world implications of this technology.

Overall, the FARGAN system presents an innovative and promising approach to efficient speech synthesis, with potential for further refinement and expansion to other audio domains. As with any emerging technology, it will be important to consider the ethical implications and responsible deployment of such systems.

Conclusion

The FARGAN paper introduces a novel speech synthesis framework that achieves high-quality, low-complexity speech generation. By using a framewise autoregressive GAN architecture and incorporating a pitch prediction model, the researchers have created a system that can run efficiently on resource-constrained devices while maintaining compelling speech quality.

This work represents an important step forward in the development of speech synthesis technology, opening up new possibilities for deploying natural-sounding speech interfaces in a wide range of applications, from virtual assistants to accessibility tools. As the field of AI-generated audio continues to evolve, the FARGAN approach serves as an example of how thoughtful system design and technical innovation can help bridge the gap between research and real-world deployment.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Very Low Complexity Speech Synthesis Using Framewise Autoregressive GAN (FARGAN) with Pitch Prediction

Jean-Marc Valin, Ahmed Mustafa, Jan Buthe

Neural vocoders are now being used in a wide range of speech processing applications. In many of those applications, the vocoder can be the most complex component, so finding lower complexity algorithms can lead to significant practical benefits. In this work, we propose FARGAN, an autoregressive vocoder that takes advantage of long-term pitch prediction to synthesize high-quality speech in small subframes, without the need for teacher-forcing. Experimental results show that the proposed 600~MFLOPS FARGAN vocoder can achieve both higher quality and lower complexity than existing low-complexity vocoders. The quality even matches that of existing higher-complexity vocoders.

8/6/2024

🗣️

Parallel Synthesis for Autoregressive Speech Generation

Po-chun Hsu, Da-rong Liu, Andy T. Liu, Hung-yi Lee

Autoregressive neural vocoders have achieved outstanding performance in speech synthesis tasks such as text-to-speech and voice conversion. An autoregressive vocoder predicts a sample at some time step conditioned on those at previous time steps. Though it synthesizes natural human speech, the iterative generation inevitably makes the synthesis time proportional to the utterance length, leading to low efficiency. Many works were dedicated to generating the whole speech sequence in parallel and proposed GAN-based, flow-based, and score-based vocoders. This paper proposed a new thought for the autoregressive generation. Instead of iteratively predicting samples in a time sequence, the proposed model performs frequency-wise autoregressive generation (FAR) and bit-wise autoregressive generation (BAR) to synthesize speech. In FAR, a speech utterance is split into frequency subbands, and a subband is generated conditioned on the previously generated one. Similarly, in BAR, an 8-bit quantized signal is generated iteratively from the first bit. By redesigning the autoregressive method to compute in domains other than the time domain, the number of iterations in the proposed model is no longer proportional to the utterance length but to the number of subbands/bits, significantly increasing inference efficiency. Besides, a post-filter is employed to sample signals from output posteriors; its training objective is designed based on the characteristics of the proposed methods. Experimental results show that the proposed model can synthesize speech faster than real-time without GPU acceleration. Compared with baseline vocoders, the proposed model achieves better MUSHRA results and shows good generalization ability for unseen speakers and 44 kHz speech.

6/6/2024

FA-GAN: Artifacts-free and Phase-aware High-fidelity GAN-based Vocoder

Rubing Shen, Yanzhen Ren, Zongkun Sun

Generative adversarial network (GAN) based vocoders have achieved significant attention in speech synthesis with high quality and fast inference speed. However, there still exist many noticeable spectral artifacts, resulting in the quality decline of synthesized speech. In this work, we adopt a novel GAN-based vocoder designed for few artifacts and high fidelity, called FA-GAN. To suppress the aliasing artifacts caused by non-ideal upsampling layers in high-frequency components, we introduce the anti-aliased twin deconvolution module in the generator. To alleviate blurring artifacts and enrich the reconstruction of spectral details, we propose a novel fine-grained multi-resolution real and imaginary loss to assist in the modeling of phase information. Experimental results reveal that FA-GAN outperforms the compared approaches in promoting audio quality and alleviating spectral artifacts, and exhibits superior performance when applied to unseen speaker scenarios.

7/8/2024

JenGAN: Stacked Shifted Filters in GAN-Based Speech Synthesis

Hyunjae Cho, Junhyeok Lee, Wonbin Jung

Non-autoregressive GAN-based neural vocoders are widely used due to their fast inference speed and high perceptual quality. However, they often suffer from audible artifacts such as tonal artifacts in their generated results. Therefore, we propose JenGAN, a new training strategy that involves stacking shifted low-pass filters to ensure the shift-equivariant property. This method helps prevent aliasing and reduce artifacts while preserving the model structure used during inference. In our experimental evaluation, JenGAN consistently enhances the performance of vocoder models, yielding significantly superior scores across the majority of evaluation metrics.

6/11/2024