Parallel Synthesis for Autoregressive Speech Generation

Read original: arXiv:2204.11806 - Published 6/6/2024 by Po-chun Hsu, Da-rong Liu, Andy T. Liu, Hung-yi Lee

🗣️

Overview

Autoregressive speech synthesis models can generate highly natural speech, but the iterative generation process makes them slow and inefficient.
To address this issue, the paper proposes a new autoregressive approach called Frequency-wise Autoregressive (FAR) and Bit-wise Autoregressive (BAR) Speech Synthesis.
The proposed model splits the speech signal into frequency subbands or quantized bits, and generates each subband or bit iteratively, allowing for parallel computation and faster synthesis.
The model also employs a post-filter to sample high-quality audio signals from the output distributions.

Plain English Explanation

The paper presents a novel approach to speech synthesis that aims to be more efficient than traditional autoregressive models. Autoregressive models, which generate speech one sample at a time, can produce very natural-sounding speech, but they are slow because the synthesis time is proportional to the length of the speech utterance.

To address this issue, the proposed model takes a different approach. Instead of generating the speech samples one by one in the time domain, the model first splits the speech signal into different frequency subbands. It then generates each subband iteratively, conditioned on the previously generated subband. This Frequency-wise Autoregressive (FAR) approach allows the model to compute in parallel, rather than sequentially, making the synthesis process much faster.

The model also uses a Bit-wise Autoregressive (BAR) approach, where it generates the speech signal one quantized bit at a time. This further improves the efficiency of the synthesis process.

Finally, the model employs a post-filter to generate high-quality audio signals from the output distributions, ensuring that the synthesized speech sounds natural and human-like.

The experimental results show that the proposed model can synthesize speech faster than real-time, without the need for GPU acceleration. It also outperforms both traditional autoregressive and non-autoregressive models in terms of subjective speech quality, while maintaining good generalization to unseen speakers and 44 kHz speech.

Technical Explanation

The paper presents a new approach to autoregressive speech synthesis, called Frequency-wise Autoregressive (FAR) and Bit-wise Autoregressive (BAR) Speech Synthesis. Unlike traditional autoregressive models, which generate speech samples one by one in the time domain, the proposed model generates speech in the frequency domain or the quantized bit domain.

In the FAR approach, the speech signal is first split into different frequency subbands. The model then generates each subband iteratively, conditioned on the previously generated subband. This allows for parallel computation, as the generation of each subband is independent of the others. The full-band speech is then reconstructed using the generated subbands and a synthesis filter bank.

The BAR approach is similar, but instead of generating subbands, the model generates the speech signal one quantized bit at a time. Again, this allows for parallel computation, as the generation of each bit is independent of the others.

By redesigning the autoregressive method to work in the frequency or bit domains, rather than the time domain, the number of iterations required is no longer proportional to the length of the speech utterance, but rather to the number of subbands or bits. This significantly increases the inference efficiency of the model.

The paper also introduces a post-filter that samples high-quality audio signals from the output distributions of the FAR and BAR models. The training objective of this post-filter is designed to take into account the characteristics of the proposed autoregressive methods.

The experimental results show that the proposed model is able to synthesize speech faster than real-time, without the need for GPU acceleration. Compared to both autoregressive and non-autoregressive baseline models, the proposed model achieves better Mean Opinion Scores (MOS) and demonstrates good generalization to unseen speakers and 44 kHz speech.

Critical Analysis

The paper presents a novel and interesting approach to improving the efficiency of autoregressive speech synthesis models. By generating speech in the frequency domain or the quantized bit domain, rather than the time domain, the proposed model is able to achieve significant speedups in the synthesis process.

One potential limitation of the approach is that it may not capture certain temporal dependencies in the speech signal, which could impact the naturalness of the synthesized speech. The authors acknowledge this and suggest that incorporating some temporal context into the model could be an area for future research.

Additionally, the paper does not provide a detailed analysis of the computational complexity of the proposed model, which would be useful for understanding the practical implications of the approach. It would also be interesting to see how the model performs on larger and more diverse speech datasets, as the current evaluation is limited to a single dataset.

Overall, the Frequency-wise Autoregressive (FAR) and Bit-wise Autoregressive (BAR) Speech Synthesis approach presented in the paper is a promising step towards more efficient and high-quality speech synthesis. The novel design choices and the promising results suggest that this could be a valuable contribution to the field of text-to-speech and singing voice synthesis.

Conclusion

The paper proposes a new approach to autoregressive speech synthesis, called Frequency-wise Autoregessive (FAR) and Bit-wise Autoregressive (BAR) Speech Synthesis, that aims to address the inefficiency of traditional autoregressive models. By generating speech in the frequency or quantized bit domains, rather than the time domain, the proposed model is able to achieve significant speedups in the synthesis process without compromising the quality of the generated speech.

The experimental results are promising, showing that the model can synthesize speech faster than real-time and outperform both autoregressive and non-autoregressive baseline models in terms of subjective speech quality. This approach represents an important step towards more efficient and high-quality speech synthesis and could have valuable applications in a wide range of domains, from text-to-speech to singing voice synthesis.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🗣️

Parallel Synthesis for Autoregressive Speech Generation

Po-chun Hsu, Da-rong Liu, Andy T. Liu, Hung-yi Lee

Autoregressive neural vocoders have achieved outstanding performance in speech synthesis tasks such as text-to-speech and voice conversion. An autoregressive vocoder predicts a sample at some time step conditioned on those at previous time steps. Though it synthesizes natural human speech, the iterative generation inevitably makes the synthesis time proportional to the utterance length, leading to low efficiency. Many works were dedicated to generating the whole speech sequence in parallel and proposed GAN-based, flow-based, and score-based vocoders. This paper proposed a new thought for the autoregressive generation. Instead of iteratively predicting samples in a time sequence, the proposed model performs frequency-wise autoregressive generation (FAR) and bit-wise autoregressive generation (BAR) to synthesize speech. In FAR, a speech utterance is split into frequency subbands, and a subband is generated conditioned on the previously generated one. Similarly, in BAR, an 8-bit quantized signal is generated iteratively from the first bit. By redesigning the autoregressive method to compute in domains other than the time domain, the number of iterations in the proposed model is no longer proportional to the utterance length but to the number of subbands/bits, significantly increasing inference efficiency. Besides, a post-filter is employed to sample signals from output posteriors; its training objective is designed based on the characteristics of the proposed methods. Experimental results show that the proposed model can synthesize speech faster than real-time without GPU acceleration. Compared with baseline vocoders, the proposed model achieves better MUSHRA results and shows good generalization ability for unseen speakers and 44 kHz speech.

6/6/2024

A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation

Zhengrui Ma, Qingkai Fang, Shaolei Zhang, Shoutao Guo, Yang Feng, Min Zhang

Simultaneous translation models play a crucial role in facilitating communication. However, existing research primarily focuses on text-to-text or speech-to-text models, necessitating additional cascade components to achieve speech-to-speech translation. These pipeline methods suffer from error propagation and accumulate delays in each cascade component, resulting in reduced synchronization between the speaker and listener. To overcome these challenges, we propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X), which integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework. We develop a non-autoregressive decoder capable of concurrently generating multiple text or acoustic unit tokens upon receiving fixed-length speech chunks. The decoder can generate blank or repeated tokens and employ CTC decoding to dynamically adjust its latency. Experimental results show that NAST-S2X outperforms state-of-the-art models in both speech-to-text and speech-to-speech tasks. It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.

6/12/2024

Very Low Complexity Speech Synthesis Using Framewise Autoregressive GAN (FARGAN) with Pitch Prediction

Jean-Marc Valin, Ahmed Mustafa, Jan Buthe

Neural vocoders are now being used in a wide range of speech processing applications. In many of those applications, the vocoder can be the most complex component, so finding lower complexity algorithms can lead to significant practical benefits. In this work, we propose FARGAN, an autoregressive vocoder that takes advantage of long-term pitch prediction to synthesize high-quality speech in small subframes, without the need for teacher-forcing. Experimental results show that the proposed 600~MFLOPS FARGAN vocoder can achieve both higher quality and lower complexity than existing low-complexity vocoders. The quality even matches that of existing higher-complexity vocoders.

8/6/2024

Efficient Autoregressive Audio Modeling via Next-Scale Prediction

Kai Qiu, Xiang Li, Hao Chen, Jie Sun, Jinglu Wang, Zhe Lin, Marios Savvides, Bhiksha Raj

Audio generation has achieved remarkable progress with the advance of sophisticated generative models, such as diffusion models (DMs) and autoregressive (AR) models. However, due to the naturally significant sequence length of audio, the efficiency of audio generation remains an essential issue to be addressed, especially for AR models that are incorporated in large language models (LLMs). In this paper, we analyze the token length of audio tokenization and propose a novel textbf{S}cale-level textbf{A}udio textbf{T}okenizer (SAT), with improved residual quantization. Based on SAT, a scale-level textbf{A}coustic textbf{A}utotextbf{R}egressive (AAR) modeling framework is further proposed, which shifts the next-token AR prediction to next-scale AR prediction, significantly reducing the training cost and inference time. To validate the effectiveness of the proposed approach, we comprehensively analyze design choices and demonstrate the proposed AAR framework achieves a remarkable textbf{35}$times$ faster inference speed and +textbf{1.33} Fr'echet Audio Distance (FAD) against baselines on the AudioSet benchmark. Code: url{https://github.com/qiuk2/AAR}.

8/20/2024